Skip to Content

Putting EML to Work: The PTAH Project

Printer-friendly versionPrinter-friendly version
Issue: 
Spring 2006

- John Porter(VCR)

Ecological Metadata Language (EML) is a major step forward in the management and exchange of ecological metadata. By providing a consistent structure for metadata elements, it permits access to those elements by programs as well as people. Thusfar, LTER efforts have focused primarily on producing EML metadata. The "Metacat" system uses EML documents to create a cross-site data catalog, but there have been few other applications (such as Kepler) that use EML documents. Here I discuss several new applications that have been developed that exploit EML to help aid in statistical analyses.

The basic steps followed by researchers in analyzing their data have been relatively unchanged since the advent statistical programs on personal computers in the 1980s. Based on whatever (minimal) metadata is available, a researcher would use an editor to create a statistical program capable of reading the data (Fig. 1).

In the case of EML documents, a stylesheet is used to translate the underlying document into a human-readable display, but otherwise the process is almost entirely the same as in the past.

However, with the advent of EML containing attribute-level data, a new option is available - the direct generation of a statistical program that handles the routine tasks of reading the data, adding labels and missing values statements and basic statistical analyses (Fig. 2). This allows researchers to "shortcut" the laborious process of creating the basic statistical program and instead focus on the creative aspects of the analysis, by adding additional commands to the program written by the stylesheet. The EML document also plays the role of providing the researcher with the details on methodology etc. that are critical to understanding appropriate use of the data.

The PTAH (Processing Techiques for Automated Harmonization, also the Egyptian god of creation) project has as its goal the creation tools that use EML to support ecological research., including the creation of the needed stylesheets and supporting programs. The first step has been to create stylesheets that translate EML documents into statistical programs for the Statistical Analysis System (SAS), the Statistical Package for the Social Sciences (SPSS) and the "R" statistical packages. These stylesheets use the information in the dataTable, attributeList and physical modules of EML as the raw data for creating statistical programs. Through the use of the stylesheets, time to prepare a statistical program for a dataset with dozens of variables can be reduced to as little as 5 minutes. The stylesheets are available for download on LTERNET CVS web site (http://cvs.lternet.edu>). To aid researchers, who may be unfamiliar with XML stylesheets and templates and the programs for processing them, a web site at http://www.vcrlter.virginia.edu/data/eml2 provides a web interface for the

Currently 73% of LTER sites produce at least some EML documents with attribute level metadata, but overall only 31% of EML documents contain attribute-level data (D. Costa, pers. comm. 3/31/06). As more EML metadata is enhanced to achieve levels 3-6 as outlined in the "EML Best Practices" (http://cvs.lternet.edu/cgi-bin/viewcvs.cgi/emlbestpractices/emlbestpractices-1.0/emlbestpractices_oct2004.doc), the utility of this approach will gain in value. However, even now over 1,200 datasets contain the information needed to allow semi-automated processing using this approach.

The prototype PTAH system still requires that the user edit the resulting statistical program to provide information that is not available in the EML document, such as where the data file(s) are located on the user's system. As protocols for automated, authenticated access to the raw data are developed, this step may be eliminated. Currently, the user may also need to deal with issues related to data formatting (e.g., SAS prefers missing values as character strings, even in numeric fields, but 'R' may crash if you try to read character data from a numeric field). Development of some pre-processing tools may be required to make the system more robust. Additionally, the current PTAH implementation focuses entirely on data in text files, but could be extended to use the SQL query capabilities supported in EML.