Skip to Content

Targeted searches with EML and LTER Controlled Vocabulary

Printer-friendly versionPrinter-friendly version
Issue: 
Fall 2012

Margaret O'Brien (SBC)

This essay will illustrate one way in which these two structures - EML datasets and our SKOS vocabulary - can be used together right now to further improve a user’s experience when looking for LTER data. The examples here could be implemented in our current catalog; they do not require PASTA. Definitions for terms can be found at the end of this essay.

 

Background

One of the reasons we adopted EML as the format for LTER metadata was that it was structured. The metadata structure (XML path) where a term occurs carries information in addition to the term itself, so searches can take advantage of that structure by accessing specific metadata components such as dataset/abstract, or dataset/creator. The LTER data catalog’s queries have used EML’s paths for many years. In 2011, the LTER Controlled Vocabulary project began structuring our search terms in a format called SKOS, arranging terms into hierarchies with relationships such as synonymy, and the network catalog now uses the SKOS structure to drive an auto-filled form for searches by term. But the current network catalog does not yet take full advantage of these structures.

 

The Core Areas Search Challenge

Our audience needs the ability to “find data for our core research areas”. But our research is interdisciplinary by nature, so a single dataset is often related to several research areas. To further complicate matters, some terms, like ‘primary production’, are both a measurement (i.e., areal uptake of carbon over time) and a topic of study. So there could be two ways to interpret the request “show me data for primary production”. As illustrated in the figure below, the user might want either A) data reporting production rates, or alternatively, B) data related to research on primary production. Obviously, two different queries should be offered, and the concept ‘related’ is crucial to one of them. Fortunately, with structured EML and a structured vocabulary, it is already possible to build these.

Figure 1. Example of two searches, where each is targeted at a specific type of data request.

Example of two searches, where each is targeted at a specific type of data request.

The two query types take advantage of different features of the SKOS vocabulary, and search different parts of the EML. By designing distinct queries that are clearly labeled and have appropriate search parameters, the possible uses of the same term (‘primary production’) can be clearer to the user. A system such as this separates the catalog’s responsibility from the data’s. The data package does not need to ‘know’ what research projects might use it, but the system does. The EML content is the responsibility of individual sites, scientists, and information managers, while the Vocabulary (as part of the ‘catalog’) is the responsibility of the Network.

 

Requirements

To achieve the desired results, we need two things:

  1. A vocabulary that makes all the proper linkages and contains the expected terms to be used for all LTER data. It will be particularly important to make connections between related terms.
  2. Data that are described explicitly and carefully. The EML path 'dataset/abstract' must describe only the data, and other details about the scientific project that generated it are in their appropriate locations, for example, ‘dataset/project/abstract’. Keywords should apply to data only, and not to the projects that use them. For example, if a dataset is of carbon dioxide measurements it should not have the keyword ‘primary production’. The linkage between ‘carbon dioxide’ and ‘primary production’ is taken care of by the Vocabulary.

 

Limitations

What these searches cannot do:

  1. They do not group together datasets that are related to a specific site-based or network-based research project. That could be accomplished with queries, but different from the examples above.
  2. They cannot make inferences about the appropriateness of data or a particular use. For that functionality, we need more sophisticated knowledge models such as ontology.
  3. These example queries are still based on simple string-matches in the EML. So any dataset that uses the term ‘primary production’ in a searched field will be returned (e.g., the phrase “data describe the transect for our primary production study”), and would be false positive for query type A. To reduce those false positives, we would need a more complex annotation system between the EML and the catalog.

 

Conclusion

Designing a few targeted queries is not a major or sophisticated change to the current catalog. It can be accomplished with the Controlled Vocabulary as is stands now, and can be applied to either a Metacat back end, or the developing PASTA API. Currently, we have only one term-based search form in the catalog. It appears to be of query type A, and it’s generally parameterized that way. However it returns results that are closer to the expected results of query type B. This may be due in large part to inappropriate keyword use in datasets. As with many uses of EML, complete analysis may indicate that EML paths other than those listed in Figure 1 should be considered.

 

Definitions

EML Path: the location of a metadata-item in the EML document, e.g., ‘/eml/dataset/title’ is the XPath to the data package’s title.

LTER Controlled Vocabulary: a set of terms structured into SKOS. The vocabulary can be browsed here: http://vocab.lternet.edu (Porter, 2010, 2011)

Synonym: a term in the LTER controlled vocabulary that can be used in place of another term. For example, ‘nitrate’ and ‘NO3’ are synonyms.

Related terms: terms in the LTER controlled vocabulary that are not synonyms, but that could be included to expand a search. Some groups of related terms include ‘carbon, NPP, primary production’, or ‘nutrient flux, nitrate’.

 

References

Porter, J. 2011, Managing Controlled Vocabularies with "TemaTres". Databits, Spring 2011, http://databits.lternet.edu/spring-2010/controlled-vocabulary-lter-datasets

Porter, J. 2010., A Controlled Vocabulary for LTER Datasets. Databits, Spring 2010, http://databits.lternet.edu/spring-2010/controlled-vocabulary-lter-datasets