Skip to Content

A Standard for Creating Dynamic, Self-documenting Tabular Data Sets Using Matlab®

Printer-friendly versionPrinter-friendly version
Issue: 
Spring 2001

-Wade Sheldon, Georgia Coastal Ecosystems (GCE)

One of the biggest challenges I have faced setting up the GCE data management system is providing support for online data analysis and flexible formatting for all data sets. When I asked our site scientists to identify features that would enhance their ability to use shared project data, the most common responses were:

  1. Online plotting to preview data
  2. Support for sub-sampling data sets (i.e. downloading only portions of interest)
  3. Multiple file format options to minimize post-download processing. Clearly, developing effective protocols for storing dynamic, computer-readable data sets needed to be a high priority in our information system

One approach to this problem would be to store all data in a relational database management system and provide plotting, querying and formatting capabilities through server- and client-side web applications. While this is certainly an effective approach for large homogenous data sets, like those from our core monitoring efforts, it has a number of drawbacks for managing data from individual studies. Some common study elements, such as replication and repeated measures, are difficult to accommodate in relational models. Also, the expertise and administration required to maintain highly diverse data sets (i.e. "wide" databases) in a RDMS and assist users with queries might become a burden for data management staff (Porter, 2000).

Given these limitations, we decided to develop a custom software solution to process, store, analyze, display and format tabular data sets from research studies. The software and storage specification were developed using Matlab® (The Mathworks Inc., http://www.mathworks.com), an open-source, multi-platform programming language for numerical analysis and data visualization. Matlab is a dominant programming language in oceanographic research and is used by many GCE investigators, so we felt this tool provided the best potential for long-term code support and collateral usage. In addition, the availability of add-on function libraries ("Toolboxes") for accessing Matlab programs from web forms, connecting to databases, and displaying geocoded data on map projections will allow us to meet all our analysis and display needs with a common set of tools.

The initial results of this effort are now complete and in use at our site, as described in our information system guide (see references). The primary components are:

  1. A standard for storing fully-documented tabular data sets as Matlab data structures
  2. A set of functions constituting the 'GCE Data Toolbox'. Data structures are multi-dimensional arrays organized into named fields, each of which can store data of any size and complexity (unlike conventional database fields). This capability was exploited to encapsulate variable amounts of structured metadata inside a single structure field, allowing the metadata to be tightly coupled to the data set. Other fields contain column descriptor information, such as the names, descriptions, units, precisions, data storage types, logical variable types (i.e. domains), and numerical characteristics of each data column. The data set itself is stored as an array with each column containing either floating-point, integer or string (mixed alphanumeric) values. A matching string matrix for storing QA/QC flags is also supported, allowing flexible handling of flag information separately from the data values.

The functions in the GCE Data Toolbox provide a layer of abstraction for users, allowing data structures to be created, manipulated, analyzed, and exported using simple commands without knowing anything about their actual composition. The tools are also 'data aware', in that they use the metadata information stored in the structure to validate the data set and apply a semantic approach when performing statistical analyses. For example, the column statistics function will calculate a median but not a mean on an integer column, and will compute a vector average rather than an arithmetic average if a floating-point column has a numeric type of 'angular'. Another powerful feature of these tools is that they transparently store processing history information with each function iteration and update column descriptor metadata fields to reflect the actual composition of the data structure at the time the metadata is parsed to generate the data documentation. Providing automatic linkages between data and metadata to maintain the quality and validity of data sets has been a major design goal of the GCE Data Toolbox.

Various analytical f unctions have also been developed to provide basic database functionality for data structures. These functions support column selection, multi-column bi-directional sorting, natural language multi-column queries to select rows, and multi-column aggregation (with statistical analysis in each aggregate for specified columns) . This latter capability has already proven to be a valuable research tool, allowing us to sort and aggregate large plant monitoring data sets by various categorical variables, quickly producing summary statistics for various levels of detail in the study. Work is now underway to develop WWW and Matlab GUI interfaces to allow non-Matlab users to analyze data sets online as well as offline.

Developing custom software to manage data is certainly not appropriate for every situation, but it does offer unique opportunities to solve scientific problems not easily addressed using business-oriented tools. I will report back on our progress as we continue to develop this technology and put it to task at our site.

References

GCE LTER Information System Guide: http://gce-lter.marsci.uga.edu/lter/research/guide/gce-is.htm

Porter, J.H. 2000. Scientific Databases. In: W.K. Michener and J.W. Brunt (Editors), Ecological Data - Design, Management and Processing. Methods in Ecology. Blackwell Science Ltd., London, pp. 48-69.