Using the GCE Data Toolbox as an EML-compatible workflow engine for PASTA
Wade Sheldon (GCE)
The GCE Data Toolbox for MATLAB was initially developed in 2000 to process, quality control and document environmental data collected at the then-new Georgia Coastal Ecosystems LTER site (Sheldon, 2001). Development of this software framework has continued steadily since then, adding graphical user interface dialogs (Sheldon, 2002), data indexing and search (Sheldon, 2005), web-based data mining (Sheldon, 2006; Sheldon, 2011b), dynamic QA/QC (Sheldon, 2008), and a growing suite of tools for automating data harvesting and publishing (Sheldon et al. 2013; Gries et al., 2013). We began distributing a compiled version of the toolbox to the public in 2002, and in 2010 we released the complete source code under an open source GPL license (Sheldon, 2011a). Today, the GCE Data Toolbox is used at multiple LTER sites and other research programs across the world for a wide variety of environmental data management tasks, and we are actively working to make it a more generalized tool for the scientific community (Chamblee et al., 2013).
The toolbox can be leveraged in many ways, but it has proven particularly useful for designing automated data processing, quality control and synthesis workflows (Sheldon et al., 2013; Cary and Chamblee, 2013; Gries et al., 2013). Key factors include broad data format support, a flexible metadata templating system, dynamic rule-based QA/QC, automated metadata generation and metadata-based semantic processing (fig.1). Consequently, the GCE Data Toolbox was one of the technologies chosen for a 2012 LTER NIS workshop convened to test the PASTA Framework for running analytical workflows (see http://im.lternet.edu/im_practices/data_management/nis_workflows). The lack of built-in support for EML metadata proved to be a significant barrier to fully utilizing this toolbox for PASTA workflows during the workshop; however, complete EML support has since been implemented. This article describes how the GCE Data Toolbox can now be used as a complete workflow engine for PASTA and other EML-compatible frameworks.
Figure 1. GCE Data Toolbox Data Set Editor and Metadata Template Editor GUI applications, with controls and options for managing metadata content and QA/QC rules to apply to harvested data.
EML-based Data Retrieval
The PASTA framework, as implemented to create the LTER Network Data Portal (https://portal.lternet.edu), is based on the EML data package concept and uses EML physical and attribute metadata to retrieve, validate and identify data objects that are stored in the system (Servilla et al., 2006). In order to download data from PASTA into the GCE Data Toolbox, code was developed to request an EML metadata document for a specified packageId using a PASTA web service. An XSLT stylesheet (EMLdataset2mfile.xsl) was then developed to transform the EML document into a native MATLAB program capable of downloading the described tabular data objects and parsing the data and metadata into MATLAB arrays, using the file structure and attribute metadata to generate appropriate command syntax. This approach was inspired by John Porter's PASTAprog, which transforms EML documents to generate R, SAS and SPSS code for retrieving and analyzing the data, and in fact this MATLAB stylesheet is now provided as an option for the PASTAprog web service at VCR and in the LTER Network Data Portal (see http://im.lternet.edu/im_practices/data_management/nis_workflows/PASTAprog). The generated program is saved as a MATLAB function m-file that can be run interactively or called in a workflow script. This function m-file is fully documented and can be archived along with the data, providing a means to re-download the same data in the future as well as useful provenance metadata for any workflows that leverage the data.
When the m-file function is called, a generic MATLAB data object (i.e. struct variable) is returned containing parsed metadata and data arrays organized into named fields. The data can be analyzed using standard MATLAB commands independently of the GCE Data Toolbox software; however, an import function was developed (eml2gce.m) to simplify transforming the parsed data and metadata into a toolbox-compatible data structure containing typed data, formatted documentation and attribute metadata, QA/QC rules and qualifier flags. Additional helper functions and GUI dialogs are also available to simplify mining data from PASTA and other EML repositories over the Internet, including the KNB Metacat and local site catalogs (fig. 2). Virtually any tabular text data that are properly described in EML (i.e. as dataTable entities and attributes) can now be retrieved into the GCE Data Toolbox with a single button press or workflow command, using only the structural metadata in EML to guide downloading, parsing and documenting of the data.
Figure 2. GCE Data Toolbox metadata content displayed in the Metadata Editor application, and styled as plain text, generic toolbox XML and EML.
EML Generation for Derived Data
Many types of workflows only need to read data from PASTA to create an analytical product, report or graph. Such workflows were the focus of the 2012 NIS workshop described above. However, workflows that synthesize data from one or more data sets in PASTA to create a derived PASTA data set, or that archive processed primary data in PASTA, are also potentially useful and were envisioned for PASTA since conception. In 2013 we attempted to create such synthesis workflows during a follow-on PASTA workflows workshop at NTL (http://intranet2.lternet.edu/content/leveraging-pasta-and-eml-based-workflow-tools-lter-data-synthesis), but quickly ran into problems efficiently generating EML metadata for derived products we generated. Early plans for PASTA development included a "metadata factory" web service that could be used to programmatically generate EML in scripting environments. Unfortunately that service was scaled back to provide only provenance metadata fragments instead, requiring the workflow developer to generate the majority of EML content including the complex but critically-important attribute metadata. Manually authoring EML metadata using oXygen or Morpho proved too tedious and time-consuming to complete during the workshop, and was not recommend as a best practice for workflow development. A more automated approach for generating EML in workflows was clearly needed.
The GCE Data Toolbox is very adept at generating metadata for derived data sets during processing, by meshing metadata from source data sets and automatically creating attribute metadata for derived variables added by toolbox functions. The toolbox data model (i.e. GCE Data Structure; https://gce-svn.marsci.uga.edu/trac/GCE_Toolbox/wiki/DataModel) supports detailed, structured documentation and attribute metadata with fields based on the ESA's FLED report recommendations (Gross and Pake, 1995; Michener, 2000). A flexible metadata styling system is also available for transforming metadata into formatted text and XML documents (fig. 3). However, the toolbox data model pre-dates EML 2 by several years and intrinsically stores metadata content at lower granularity than EML requires (particularly personnel information), making cross-walking toolbox metadata to EML difficult and error-prone.
Figure 3. GCE Data Toolbox metadata content displayed in the Metadata Editor application, and styled as plain text, generic toolbox XML and EML.
In 2014 these difficulties were overcome, though, and support for producing fine-grained EML metadata was added to the GCE Data Toolbox. Native attribute descriptors (e.g. data type, variable type, number type, precision, code definitions, Q/C criteria) are automatically mapped to EML measurementScale equivalents, and units can be documented as custom units (complete with auto-generated STMML definitions in additionalMetadata), or can be mapped to user-specified standard and custom units managed in lookup-table data sets provided with toolbox downloads (i.e. EMLUnitDictionary.mat, EMLUnitMap.mat). Therefore once data are successfully loaded into the GCE Data Toolbox and described, EML with congruent data tables can be generated with no additional effort, removing a huge barrier for uploading data to PASTA. A GUI dialog is now provided with the toolbox for generating complete EML data packages, or just dataTable and attributeList XML fragments for inclusion in separately-generated EML depending on the desired use case (fig.4).
Figure 4. GUI dialog for generating formatted data files and corresponding EML metadata documents or fragments for uploading to PASTA or another metadata repository. Note that specific authorization metadata for non-public users cannot currently be specified using the GUI alone, but can be specified when calling the corresponding command-line function programmatically in a workflow.
The GCE Data Toolbox provides a rich set of tools for processing raw data as well as importing and synthesizing existing data sets. For example, the toolbox can be used to programmatically mine data from the USGS National Water Information System, NOAA Global Historic Climate Network repository, NOAA Hydro-meteorological Automated Data System, LTER ClimDB/HydroDB, and DataTurbine servers directly over the Internet, providing a wealth of ready-to-use data for large-scale synthesis projects. Once data are imported, tools are provided for scaling and summarizing data by aggregation, binning and date-time resampling, as well as gap-filling, filtering and sub-setting data. Multiple data sets can also be integrated using database-style joins on key columns, as well as metadata-aware merges for concatenating related tables. For time-series data sets, overlapping date ranges in merged data can be removed automatically, and records can be padded to create a monotonic time series to simplify gap-filling and analysis. Unit conversion, data type transformation, date/time reformatting, geographic coordinate re-projection and other common data harmonization operations are also fully supported.
All of the operations described above can be performed using interactive GUI applications, but can also be scripted and run on a scheduled basis. The GCE Data Toolbox Wiki provides extensive documentation on getting started with this toolbox (https://gce-svn.marsci.uga.edu/trac/GCE_Toolbox/wiki/Documentation), as well as a quick-start guide to functions commonly used to build workflow scripts (https://gce-svn.marsci.uga.edu/trac/GCE_Toolbox/wiki/API). A graphical workflow-builder application is also envisioned for the toolbox in the future, but may require supplemental funding to implement.
Once EML-described data sets are generated from interactive or scripted workflows, the data objects need to be deposited in a web-accessible directory and the EML document uploaded to PASTA. To date we have only tested uploads using the LTER Network Data Portal web interface, but PASTA provides a comprehensive web service API that can be leveraged to script the entire evaluation and upload process for frequently-run workflows (https://pasta.lternet.edu/package/docs/api). Note that the POST and PUT commands necessary to upload EML documents require HTTPS and authentication, which are not supported by MATLAB's native HTTP functions (urlread, urlwrite). It is therefore necessary to call other programs via the MATLAB "system()" function to accomplish these steps. The simplest strategy is to install the cURL executable (http://curl.haxx.se/) on your system, which provides a rich set of command-line options for interacting with the PASTA API. The relevant cURL commands are also described in the PASTA API guide and draft workflow best practices guide (Gries et al, 2013b).
The primary rationale for the LTER Network's adoption of EML 2 as its metadata standard in 2003 was facilitating computer-mediated data analysis and integration. Unfortunately, the extensive effort required to upgrade legacy metadata content and management systems and the sheer complexity of this XML specification kept the focus on producing rather than using EML for much of the decade since. LTER sites can now produce EML metadata for the core data they archive, and PASTA has been implemented to provide stable access to version-controlled LTER EML documents and data, but the original goal for EML has proved elusive.
The NIS workflow workshops held in 2012 and 2013 demonstrated that effective workflows can indeed be built using EML-described data in PASTA and common research tools such as R, SAS, Kepler, and MATLAB. Now that EML with congruent, PASTA-compatible data files can be generated for synthetic data products automatically, hopefully we can return to this original goal and take full advantage of both EML and PASTA for LTER synthesis projects. The addition of EML support to the GCE Data Toolbox also provides LTER sites and other environmental programs with a practical solution for quality controlling and documenting streaming sensor data for archiving in an EML-compliant data repository (e.g. KNB Metacat, PASTA or another DataONE node) without the need to implement a full-fledged metadata management system (MMS), and provides new options for sites that have adopted a MMS like DEIMS or Metabase.
In other words, these improvements in the GCE Data Toolbox, along with the advancements made by the LTER Network Office NIS developers and the rest of the LTER information management community, go most of the way toward fulfilling our original vision from 2003 and all the way in terms of making workflow-driven analysis and archiving of streaming data into a reality.
Cary, R. and Chamblee, J. 2013. Coweeta LTER Upgrades Sensor Stations by Implementing the GCE Data Toolbox for Matlab to Stream Data. In: LTER Databits – Information management Newsletter of the Long Term Ecological Research Network: Spring 2013. LTER Network, Albuquerque, NM. (http://databits.lternet.edu/spring-2013)
Chamblee, J., Sheldon, W., Cary, R. 2013. GCE and CWT Host Successful Workshop to Demonstrate, Improve, and Promote the Adoption of the GCE Data Toolbox for MATLAB. In: LTER Databits – Information management Newsletter of the Long Term Ecological Research Network: Spring 2013. LTER Network, Albuquerque, NM. (http://databits.lternet.edu/spring-2013)
Gries, C., Sheldon, W.M. Jr., Fountain, T., Sebranek, C., Miller, M. and Tilak, S. 2013. Integrating Open Source Data Turbine with the GCE Data Toolbox for MATLAB. In: LTER Databits - Information Management Newsletter of the Long Term Ecological Research Network: Spring 2013. LTER Network, Albuquerque, NM. (http://databits.lternet.edu/spring-2013)
Gries, C., Porter, J., Ruddell, B., Servilla, M., Sheldon, W. and Walsh, J. 2013b. NIS data workflows best practices, ver. 0.2. Long Term Ecological Research Network, Albuquerque, NM. (http://im.lternet.edu/sites/im.lternet.edu/files/NISdataworkflowsbestpractices0.2.pdf)
Gross, Katherine L. and Catherine E. Pake. 1995. Final report of the Ecological Society of America Committee on the Future of Long-term Ecological Data (FLED). Volume I: Text of the Report. The Ecological Society of America, Washington, D.C.
Michener, William K. 2000. Metadata. Pages 92-116 in: Ecological Data - Design, Management and Processing. Michener, William K. and James W. Brunt, eds. Blackwell Science Ltd., Oxford, England.
Servilla, M., Brunt, J. San Gil, I., Costa, D. 2006. PASTA: A Network-level Architecture Design for Automating the Creation of Synthetic Products in the LTER Network. Ecological Informatics. (http://feon.wdfiles.com/local--files/start/LTERPASTADataModel.pdf)
Sheldon, W.M. Jr., Chamblee, J.F. and Cary, R. 2013. Automating Data Harvests with the GCE Data Toolbox. In: LTER Databits -Information Management Newsletter of the Long Term Ecological Research Network, Fall 2013 issue. LTER Network, Albuquerque, NM. (http://databits.lternet.edu/fall-2013)
Sheldon, W.M. Jr. 2011b. Mining Long-term Data from the Global Historical Climatology Network. In: LTER Databits - Information Management Newsletter of the Long Term Ecological Research Network, Fall 2011 issue. LTER Network, Albuquerque, NM. (http://databits.lternet.edu/fall-2011)
Sheldon, W.M. Jr. 2011a. Putting It Out There – Making the Transition to Open Source Software Development. In: LTER Databits - Information Management Newsletter of the Long Term Ecological Research Network, Spring 2011 issue. LTER Network, Albuquerque, New Mexico. (http://databits.lternet.edu/spring-2011)
Sheldon, W.M. Jr. 2008. Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data. Pages 145-150 in: Gries, C. and Jones, M.B. (editors). Proceedings of the Environmental Information Management Conference 2008 (EIM 2008): Sensor Networks. Albuquerque, New Mexico. (http://gce-lter.marsci.uga.edu/public/files/pubs/wsheldon_dynamic_qc_eimc2008_final.pdf)
Sheldon, W.M. 2006. Mining and Integrating Data from ClimDB and USGS using the GCE Data Toolbox. In: LTER Databits - Information Management Newsletter of the Long Term Ecological Research Network, Spring 2006 issue. LTER Network, Albuquerque, NM. (http://databits.lternet.edu/spring-2006)
Sheldon, W.M. 2005. GCE Data Search Engine: A Client-side Application for Metadata-based Data Discovery and Integration. DataBits: an electronic newsletter for Information Managers, Spring 2005 issue. Long Term Ecological Research Network, Albuquerque, NM. (http://databits.lternet.edu/spring-2005)
Sheldon, W.M. 2002. GCE Data Toolbox for Matlab® --Platform-independent tools for metadata-driven semantic data processing and analysis. In: LTER Databits - Information Management Newsletter of the Long Term Ecological Research Network, Fall 2002 issue. LTER Network, Albuquerque, NM. (http://databits.lternet.edu/fall-2002)
Sheldon, W.M. 2001. A Standard for Creating Dynamic, Self-documenting Tabular Data Sets Using Matlab®. In: LTER Databits -Information Management Newsletter of the Long Term Ecological Research Network, Spring 2001 issue. LTER Network, Albuquerque, NM. (http://databits.lternet.edu/spring-2001)