Skip to Content

EML Harvesting II: Preparing Site Metadata and Harvest Lists

Printer-friendly versionPrinter-friendly version
Issue: 
Spring 2005

Feature Articles

EML Harvesting II: Preparing Site Metadata and Harvest Lists

- Wade Sheldon (GCE)

INTRODUCTION

In EML Harvesting I (DataBits, Fall 2004), Duane Costa described the features and operation of the new Metacat Harvester software and associated harvesting service developed at the LTER Network Office (LNO). This new harvesting service provides LTER sites with a simple and practical means of synchronizing metadata documents with the LNO/KNB Metacat server, and by extension the broader Metacat and EcoGrid networks. This Harvester specification has also recently been adapted by Chris Lindsley and Tim Rhyne at the Oak Ridge National Laboratory (ORNL) to support automatic harvesting of EML metadata and transformation to FGDC Biological Data Profile (BDP) format for inclusion in the NBII Metadata Clearinghouse.

Thanks to these dual efforts, LTER sites can now participate in multiple metadata search networks by:

  1. Providing valid EML documents (compatible with Metacat) on a publicly-accessible WWW server
  2. Creating an XML harvest list document containing the WWW URL for each EML document
  3. Registering the URL of the harvest list with LNO and scheduling harvests (and optionally registering for NBII participation)
  4. Managing document "revision" numbers and corresponding EML "packageIDs" to control the harvest and Metacat synchronization process as site EML documents are added or revise

These steps are described in the remainder of this article.

1) Providing EML for Metacat

The KNB Metacat data repository system is designed to archive XML-based metadata documents regardless of their schema, and Harvester is similarly schema-agnostic. The only nominal requirements are that documents conform to XML structure rules (i.e. are well formed), and are valid according to the referenced schema. In the specific case of EML, this means document contents must conform to the EML 2.0.0 or 2.0.1 schema rules as documented at http://knb.ecoinformatics.org/software/eml/.

However, the following guidelines taken from the EML Best Practices for LTER Sites document must also be followed in order to support automatic document harvesting and synchronization with Metacat:

  1. EML document ids and revision numbers:

    "packageId" attributes for EML contributed to the KNB Metacat should be formed as follows:

    <eml:eml packageId="knb-lter-[site].[dataset number].[revision]" system="knb" ...>  (e.g. packageId="knb-lter-gce.187.4")

  2. Access Control:

Metacat access control format conforms to the LDAP distinguishedName for an individual, as in “uid=FLS,o=LTER,dc=ecoinformatics,dc=org” (where "FLS" stands for "Fictitious LTER Site"). Access elements for documents contributed to the KNB Metacat should be formed as follows:

<access authSystem="knb" order="allowFirst" scope="document"> 
<allow>
<principal>uid=FLS,o=lter,dc=ecoinformatics,dc=org</principal>
<permission>all</permission>
</allow>
<allow>
<principal>public</principal>
<permission>read</permission>
</allow>
</access>

Specific access control rules can also be included for any individuals registered in the KNB LDAP server, such as the site IM or contributing PI; however, LNO has established an alias account for each site (based on the three letter site acronym, e.g. uid=GCE) to ensure consistent ownership of LTER metadata stored in the KNB Metacat independent of personnel changes over time. Site IMs can contact Duane Costa <dcosta@lternet.edu> to obtain or reset the password on their site alias account.

2) Creating a Harvest List Document

The Metacat Harvester operates by periodically downloading and parsing an XML-based "harvest list" document containing URLs for all EML documents available at the harvest site. This site-managed harvest list is therefore the key to participating in the Metacat EML harvesting system, much like the legacy LTER DTOC cataloging system.

The structure of the harvest list is fairly simple, as illustrated in the following example and fully described on the new LTER IM Mentoring web page. A "<document>" element is required for each EML document to be harvested, listing a unique numeric identifier, numeric revision number, and WWW URL. As in the example, URLs for both static EML documents and web applications or scripts with query string parameters can be included as appropriate. Note that any XML reserved characters in URLs, such as "&", "<" and apostrophes, must be "escaped" using the XML character references "&amp;", "&lt;" and "&apos;", respectively.

<?xml version="1.0" encoding="UTF-8"?>
<hrv:harvestList xmlns:hrv="eml://ecoinformatics.org/harvestList">
<!-- first EML document -->
<document>
<docid>
<scope>knb-lter-gce</scope>
<identifier>1</identifier>
<revision>7</revision>
</docid>
<documentType>eml://ecoinformatics.org/eml-2.0.1</documentType>
<!-- static EML document URL -->
<documentURL>http://gce-lter.marsci.uga.edu/lter/datasets/eml/knb-lter-gce_1_7.xml</documentURL>
</document>
<!-- second EML document -->
<document>
<docid>
<scope>knb-lter-gce</scope>
<identifier>2</identifier>
<revision>6</revision>
</docid>
<documentType>eml://ecoinformatics.org/eml-2.0.1</documentType>
<!-- dynamically-generated EML document URL -->
<documentURL>
http://gce-lter.marsci.uga.edu/lter/asp/db/send_eml.asp?detail=full&amp;missing=NaN&amp;metacat=yes&amp;dataset=2
</documentURL>
</document>
<!-- additional EML document elements... -->
</hrv:harvestList>

Although the harvest list structure looks somewhat verbose, most of the document is composed of static markup. The only variable portions are the two numeric fields and the URL itself (highlighted in red in the example code). The harvest list can therefore be generated very easily by cutting and pasting in a text or XML editor, or using simple string handling procedures in any scripting language.

3) Scheduling Harvests

After sites have posted valid EML documents to a WWW server and have constructed a corresponding harvest list, the URL for the harvest list and harvesting frequency must be registered at LNO as described in EML Harvesting I (DataBits Fall 2004). The ideal harvesting schedule for a site will depend on the frequency with which the site typically adds or updates data sets and metadata, and makes corresponding changes to revision numbers in the harvest list. Monthly or weekly harvests are probably reasonable for most sites, although more frequent harvests could be requested considering that only new or updated documents will be retrieved so system resources will not be needlessly taxed on either end.

In order to register for NBII harvesting, site IMs should individually contact Inigo San Gil <isangil@lternet.edu> or Tim Rhyne <rhynebt@ornl.gov> for assistance; however, this policy may change in the future because Inigo will be investigating the possibility of NBII harvesting metadata directly from the LNO/KNB Metacat server. After sites are scheduled for harvesting, NBII personnel will follow up with the site IM to request general information for creation of a "Clearinghouse Node" description for their LTER site. For sites that wish to further advertise their data holdings, NBII can also publish their metadata in the FGDC clearinghouse (also called the National Spatial Data Infrastructure [NSDI]) and the new Geospatial One-Stop (GOS) on request.

The same harvest list URL can be registered for Metacat, NBII, NSDI and GOS participation, or separate URLs can be registered to specifically tailor the metadata documents synchronized with each system. At GCE, for example, we provide complete harvest lists for Metacat, NSDI and GOS synchronization, but generate a reduced harvest list containing only URLs for metadata from biologically-oriented studies (based on research theme) for NBII. These harvest lists are dynamically generated from a single web application (http://gce-lter.marsci.uga.edu/lter/asp/db/eml_harvest_doc.asp), using a query string parameter to distinguish among synchronization targets (i.e. hostname=metacat for Metacat, hostname=nbii for NBII and hostname=geospatial for NSDI and GOS); however, multiple static harvest list documents could also be produced to accomplish the same task.

4) Managing EML Harvests

As indicated in the Providing EML for Metacat section, Metacat and by extension Harvester rely on numerical data set ids and revision numbers for document management and synchronization. When Harvester encounters a new <identifier> or changed <revision> in a harvest list, the corresponding EML document will be downloaded and inserted into Metacat; consequently, sites can only control metadata harvests by managing these identifiers and revisions. Although this sounds straight forward, theoretical and practical issues concerning data and metadata versioning have been hotly debated in LTER for many years and versioning practices vary extensively across site information systems. Even for sites that do number and version data sets, work-arounds may be required for maximum interoperability with Harvester and Metacat.

At GCE we use sequential numeric ids as alternative identifiers for all data sets in our metadata database and we maintain explicit major and minor version numbers to track changes in data and metadata content since original release. Despite this apparently idyllic situation for Metacat compatibility, we had to devise a complex work-around for generating EML revisions to accommodate changes in EML implementation independent of metadata contents. For instance, we made several changes to our EML implementation in February-March 2004 in response to feedback from NCEAS developers and to improve display of our documents using the default Metacat XSL style sheets. Further changes were prompted by the EML Best Practices working group meeting in May 2004. Each of these changes required a revision change in order to trigger re-harvesting of the updated EML documents despite the fact that the underlying data and metadata contents themselves had not changed.

The best strategy for supporting Metacat versioning (as well as the KNB authentication system) in EML documents will likely vary according to the technology used to generate the EML documents themselves. Sites that plan to manage static documents may have to manually update and synchronize revision numbers between EML documents and the harvest list or develop scripting or XSLT approaches to propagate version changes. Sites that manage metadata in an RDBMS and generate EML documents and the harvest list programmatically may choose to add EML revision tracking to their systems, or just periodically increment revision numbers to force updates in Metacat. At GCE we have taken this process a step further and chosen to differentially generate EML optimized for Metacat. Document URLs in the dynamically-generated GCE harvest list contain an additional query string parameter "&metacat=yes", which instructs our web application to include Metacat-specific packageIds and revisions, appropriate access control elements for the KNB system, and alternative data table URLs designed to stream data for publicly-accessible data sets (i.e. after transparently logging access in our data use tracking system). Many versioning strategies and work-arounds are possible and LTER Information Managers are encouraged to discuss potential approaches with LNO, the EML Best Practices working group, and other IMs as they develop support for EML harvesting at their site.

Note that NBII completely replaces all metadata records with the current ones during each harvest; therefore, versioning issues are not critical for NBII, NSDI, and GOS harvests. Sites can also request a special off-schedule harvest if major changes are made to their EML implementation or documents. Consequently, managing harvests for NBII, NSDI and GOS participation is considerably simpler than for Metacat.

CONCLUDING REMARKS

The Metacat Harvester fills an important technology gap that has prevented many LTER sites from participating in the Metacat repository system. Although the KNB Morpho program is a powerful metadata entry and management tool that works directly with Metacat, the lack of built-in support for metadata content re-use and integration with existing site information systems has precluded its use at most LTER sites. Other systems capable of synchronizing metadata with Metacat (e.g. CAP-LTER Xanthoria) have also not been adopted by most sites for various technical reasons.

The technological neutrality of the Metacat Harvester is particularly beneficial from a site perspective, because it supports participation regardless of IT architecture or EML generation approach and will accommodate transitions in technology over time. For example, a site just beginning to generate EML metadata can maintain a static harvest list and update the list in a text or XML editor as each new document is created, whereas a site developing more automated approaches can generate a dynamic harvest list using any web application framework. Similarly, URLs for both static and dynamic EML documents can be included in a single harvest list, allowing sites to develop dynamic EML-generation capabilities in stages without affecting participation in metadata search networks.

Contributing LTER EML documents to the LNO/KNB Metacat will help towards accomplishing a major goal identified by the LTER NIS Advisory Committee, by providing integrated data searching across the LTER Network based on structured metadata. It will also allow LTER sites to leverage tools and technologies being developed by KNB and SEEK built on Metacat, such as the EcoGrid and Kepler work-flow analysis software. The ability to synchronize metadata with the NBII Clearinghouse with no additional effort is also a tremendous benefit to both the LTER and NBII networks, and will support discovery and use of LTER data by an even wider audience in the scientific community.

ACKNOWLEDGEMENTS

I would like to thank David Blankman (formerly at LNO), Duane Costa (LNO), Matt Jones (NCEAS), and Chris Lindsley (ORNL) for their collaboration and help developing support for EML harvesting at GCE, which led to this article. I would also like to thank Tim Rhyne (ORNL) for providing additional information on NBII and for his editorial advise.