Skip to Content

Making the Work Flow with Kepler

Printer-friendly versionPrinter-friendly version
Fall 2010

John Porter (VCR), Chau-Chin Lin (TERN), Jennifer Holm (LUQ), Ben Leinfelder (NCEAS)

During the summer of 2010 as part of the “Second Analytical Workshop on Dynamic Plot Application and Tool Design” in Kuala Lumpur, Malaysia, we had an opportunity to employ the Kepler scientific workflow tool in an authentic research context. The workshop brought together experts on tropical forests and ecological informatics to use innovative computational tools to examine how the biodiversity and spatial structure of forests change between locations around the world (see the Fall 2010 LTER Network News for more information about the workshop). Here we will focus on how we used Kepler and the challenges and opportunities it provided.

The challenge we faced in the workshop was the integration of somewhat heterogeneous datasets from mapped forest plots. The forest plots used in this study are long-term permanent plots set up in locations from North America to Central America and many parts of Asia. Each site is managed by different scientists leading to the discrepancies in data gathering between locales. The data had some similarities (all had taxonomic designations, tree measurements and coordinates), but differed in the detail of taxonomic data, how the status of stems was designated (live, dead, main, secondary), the way it was structured (some were in a single table, others in multiple tables) and the names of fields. These data needed to be ingested, converted into standard forms, processed statistically to produce new summary data structures and analyzed. The following lessons may apply for any group that will work with dissimilar and large datasets from around the globe.

We chose to accomplish these tasks using the Kepler workflow system ( Kepler was selected because it was freely available, had good support for Ecological Metadata Language (EML) and for the “R” statistical language. Kepler workflows appear as a set of interconnected boxes (“actors” in Kepler terminology), with each actor having one or more “ports” where connections can be made. Kepler has actors that provide automated EML data ingestion, execution of “R” programs, XML-stylesheet processing, text manipulation capabilities, and displays, among many other functions. For the RExpression actor, ports equate to scalar and vector elements that share the same name as the port. Ports can be used to output many forms of data, but not all of them are specified in the pull-down list, so it may require some experimentation to find the right output type to match the input of other actors if Kepler cannot automatically determine the appropriate data type. In Kepler a “director” controls the order of operations for actors. For analytical functions, as compared to modeling, this is almost always the ”SDF” (synchronous data flow) director, with the number of iterations set to 1.

During the course of the workflow development we needed to address several issues. The first issue involved using the EML actor to ingest very large data files caused Kepler to crash with a Java stack overflow exception. Because actors intercommunicate using data “tokens” in memory, large data files in excess of 35 MB would cause the workflow to fail after protracted processing. To address this problem, we used the XML stylesheet processor actor to transform the EML document directly into an R script. The large data files were parsed and loaded dynamically from a specified location on disk and saved as an R workspace. Subsequent R actors were able to load this workspace without the need to pass the voluminous data values via Kepler ports. An additional issue was that error reporting in the RExpression actor was rudimentary; workflow execution might fail, but Kepler would not provide specific error messages from R regarding the nature of the Therefore, we typically wrote and debugged R code outside Kepler, before adding it to the workflow. The active Kepler developer community has been informed of these issues and always encourages user feedback, so hopefully future versions of Kepler will resolve these problems.

After addressing these issues Kepler workflows worked quite well for ingesting and processing the data (Figure 1). The workflows successfully processed the data while also effectively communicating the analysis to the workshop group in a portable format. Workflows originally created in Virginia were revised in Taiwan and run in Malaysia. As relatively new Kepler users, we did find that there was a significant learning curve. Early workflows required about 4-times to create as might be spent “manually” performing processing. For later workflows the time required dropped dramatically, primarily because Kepler made it easy to adapt and reuse workflows. As a graphical tool, Kepler is advantageous in that it organizes the entire analysis and allows researchers to see both the high-level process as well as each detailed step. Workflow parameters can easily customize an analysis while text annotations can highlight pertinent notes about the workflow before it is distributed as a single, self-contained file. These features are especially important for complex analyses being developed by a large and spatially distributed working groups where clear communication and rapid development is essential to the collaboration.

Figure 1: Kepler Workflow for ingesting and processing data from a forest plot.

Kepler flow diagram