Skip to Content

GCE Data Toolbox for Matlab® -- Platform-independent tools for metadata-driven semantic data processing and analysis

Printer-friendly versionPrinter-friendly version
Issue: 
Fall 2002

- Wade Sheldon, Georgia Coastal Ecosystems LTER (GCE)

In the Spring 2001 issue of Databits I described the initial development of the GCE Data Toolbox, an integrated set of Matlab functions for dynamic analysis and documentation of tabular data sets stored in a standardized data structure format (references). Development has steadily continued on these tools since then, and a suite of graphical user interface (GUI) applications was recently added to provide convenient access to most of the capabilities of the toolbox for users unfamiliar with the Matlab programming language. These GUI applications use standard menus, graphical controls, and dialog boxes for input and are compatible with any operating system that supports the Matlab environment, including Microsoft Windows (9x, NT, 2000, XP), Linux, Solaris, and Macintosh OS/X. Complete descriptions of the functions and screenshots of the GUI applications are now available on the GCE website (http://gce-lter.marsci.uga.edu/lter/research/tools/data_toolbox.htm), so the remainder of this article will focus on new innovations in metadata-driven semantic data processing, potential applications of this technology, and plans to add support for metadata stored in Ecological Metadata Language 2 format (see http://knb.ecoinformatics.org/software/eml/).

Semantic Data Processing

The guiding design philosophy of the GCE Data Toolbox is that ecological metadata should remain inextricably linked to the data set it describes, and the metadata used to determine which operations are appropriate for any given data value based on the type of information it represents. I refer to this philosophy as metadata-driven semantic data processing.

One of the most important roles that semantic processing technology can serve is to protect the validity of data and calculations throughout all processing steps. For example, GCE Data Toolbox functions query column descriptor metadata information for operations ranging from generation of formatting instructions for numerical display and export, determining which statistical procedures are appropriate to perform when aggregating and summarizing data sets, confirming column compatibility when merging and joining multiple structures, and validating new entries made in the data editor. This semantic approach greatly minimizes the potential for data contamination compared to the standard spreadsheet and database analyses often used by ecologists, because these programs either freely coerce values between disparate formats or protect values based only on gross data format characteristics. For instance, a relational database query combining temperature data in °C with data in °F would not generate an error as long as the column data types are compatible; in contrast, these columns could not be joined or merged by GCE Data Toolbox functions unless the units were first standardized.

Semantic processing based on metadata also supports intelligent application automation. Many GCE Data Toolbox functions and GUI applications automatically identify candidate data columns without user intervention, based on data storage type, variable type, and numerical type metadata descriptors and column units. Examples include date/time and geographic coordinate inter-conversions and column unit conversions. This approach is also used in the data mapping application to identify georeference columns by column name and variable type and then automatically project between coordinate systems, when necessary, based on column units. Automatic unit conversion capability will also be added to relational join and merge functions in the near future, further simplifying the creation of synthetic data sets without loss of validity.

An important prerequisite for semantic processing, of course, is that metadata information remain synchronized with the data it describes. This is a major challenge when data is processed using disparate programs, but is accomplished automatically and transparently by GCE Data Toolbox functions. All data structure changes are logged by date to a history field, preserving the full context of all data processing. Column descriptors are also dynamically updated each time metadata is displayed or exported, and value codes are automatically generated and documented in the metadata when text fields or QA/QC flags are encoded as numerals for export in formats that don't support mixed alphanumeric characters (e.g. Matlab numerical matrices). Toolbox functions also add ancillary processing information to relevant metadata fields when appropriate, such as documentation of equations used for automatic unit conversions in the Calculations field of the Data section. These steps ensure that metadata remains relevant and useful regardless of how many updates or transformations have been performed on a given data set.

Potential Applications

The GCE Data Toolbox was primarily developed to process and package GCE-LTER data for automated analysis and distribution, but many other uses of this technology are possible. Metadata-based semantic mediation permits highly specific processing based on simple generic commands, making these tools ideal for many automated batch processing tasks. The dual user interface design (command-line and GUI) and platform-independence of the Matlab language also provide broad compatibility and flexibility.

Generic data import filters are provided for parsing delimited ASCII files and arrays and matrices stored in Matlab binary files, and data can also be imported directly from relational database tables, views and stored procedures via SQL queries (requires the optional Matlab Database Toolbox). Metadata can be imported from tokenized headers on ASCII files along with the data or manually entered into a GUI editor. The toolbox also includes support for user-editable metadata templates, in which column descriptor and general metadata are defined in advance and then applied to newly imported data by matching column names, data types and units with template entries. Multiple data and metadata export formats are supported as well, including delimited ASCII, CSV, Matlab (both arrays and matrices), and table insert/update for SQL databases. This combination of import and export capabilities, metadata template support, and automated analyses should allow these tools to be used in a very wide range of data acquisition, processing and presentation applications.

Recent enhancements to Matlab itself, such as seamless integration with JAVA classes and data types (introduced in version 6) and additions of native XML and XSLT support, network data access, and timer objects (introduced in version 6.5), also open up many exciting possibilities. For example, the new Matlab 'urlread' and 'urlwrite' functions allow data to be retrieved from any Internet-accessible data store accessible by URL using standard HTTP GET and POST requests. Together, the urlwrite and timer functions allow fully-functional, automated data harvesters to be programmed with a few lines of code once import scripts and metadata templates are defined for the data source. Just such a harvester was recently implemented for meteorological and hydrographic data from the USGS real-time monitoring station at Meridian, Georgia, and similar harvesters will soon be implemented for the NOAA NDBC buoy off Sapelo Island and USGS gauging station on the Altamaha River. The harvester imports and validates raw data, applies metadata via template, performs QA/QC flagging based on criteria specified in the metadata for each column, performs bulk English-to-metric unit conversions, automatically adds a serial date column calculated from individual date component columns, archives the raw and processed data, appends the new data to a cumulative data structure, removes any duplicate entries from overlapping harvests, and regenerates weekly and monthly plots of key parameters. All operations listed above are implemented with a few generic toolbox commands which can easily be edited and reused for other data sources.

Support for EML

Now that the LTER Information Managers have collectively agreed to support Ecological Metadata Language (EML) version 2 as a network-wide metadata exchange standard, a question that naturally follows is how EML support can be added to existing metadata-based tools such as the GCE Data Toolbox. The metadata standard used by the toolbox is nominally based on the content standard for non-geospatial metadata recommended in the 1995 Ecological Society of America Committee on the Future of Long-term Ecological Data report, as described in Michener, 2000 (references). Primary data descriptor fields, such as column name, units, description, data storage type, variable type, numerical type and precision, are stored in dedicated structure fields and managed along with the data columns themselves. The remainder of the metadata is stored in a parsable three-column array of section names, field names, and field values, which can be searched and updated by toolbox functions, manually edited in a GUI application, and formatted for display using a simple style language.

This metadata scheme is flexible and extensible, but differs from EML 2 in two important ways: 1) much flatter hierarchy of fields, designed to store child elements primarily as preformatted blocks of text; 2) lower granularity of some elements, such as person names and geographic information. Both of these differences stem from the fact that the toolbox metadata standard is optimized for final storage and formatting of metadata derived from another primary source, such as a relational database management system, with basic support for parsing and searching to support updates and meshing metadata from multiple structures. In contrast, EML 2 is designed to store a much wider range of metadata information of varying complexity in a more modular fashion.

With these differences in mind, the natural first step towards providing EML 2 support is to develop an import filter that parses EML documents, extracts the column descriptors, and formats the remainder of the metadata to support a simpler schema based on sections and fields as described above. Now that Matlab natively supports XML and XSLT, this can be accomplished by writing XSLT templates that convert EML documents to tokenized text headers already supported by the ASCII import filter. Work on this approach is already planned, and will coincide with efforts to develop EML support for GCE metadata stored in the GCE metadata database. Plans to add support for exporting EML from the GCE Data Structures are less certain, but will be explored as these other technologies are implemented. For the time being, experimental XML metadata export functions have been developed which may be extended to provide limited EML export support in the future.

Conclusion

The GCE Data Toolbox for Matlab has proven extremely useful in all phases of data acquisition, processing, analysis, presentation and distribution at the GCE LTER site during the past two years. With the addition of easy-to-use GUI applications, program documentation, and planned support for standard EML metadata, hopefully other sites will be able to benefit from these efforts as well.

References

Michener, William K.  2000.  Metadata.  Pages 92-116 in: Ecological Data - Design, Management and Processing.  Michener, William K. and James W. Brunt, eds.  Blackwell Science Ltd., Oxford, England.

Sheldon, W.M.  2001.  A Standard for Creating Dynamic, Self-documenting Tabular Data Sets Using Matlab®.  DataBits: An electronic newsletter for Information Managers. (http://intranet.lternet.edu/archives/documents/Newsletters/DataBits/01spring/)