Skip to Content

A Controlled Vocabulary for LTER Datasets

Printer-friendly versionPrinter-friendly version
Issue: 
Spring 2010

John Porter (VCR)

Currently most keywords used to characterize datasets at most LTER sites are uncontrolled, meaning that they are selected entirely by the data creator. One of the challenges facing LTER and external researchers in discovering data from LTER sites is inconsistent application of keywords. A researcher interested in carbon dioxide measurements must search on both "Carbon Dioxide and "CO2." Moreover, the existing set of keywords is highly diverse. For example, in a 2006 survey of EML documents in the LTER Data Catalog, over half (1,616 of 3,206) the keywords were used in only a single dataset, and only 104 (3%) of the keywords were used at five or more different LTER sites (Porter 2006).

To address this problem, in 2005 the LTER Information Management Committee established an ad hoc "Controlled Vocabulary Working Group" and charged it with studying the problem and proposing solutions. To that end the group compiled and analyzed keywords found in LTER datasets and documents, and identified external lexographic resources, such as controlled vocabularies, thesauri and ontologies, that might be applied to the problem (Porter, 2006). Initially the working group attempted to identify existing resources, such as the National Biological Information Infrastructure (NBII) Thesaurus, that LTER might be able to adopt wholesale. Unfortunately, using widely-used LTER keywords as a metric, none of the external resources proved to be suitable. Too many keywords commonly used in LTER datasets were absent from the existing lexographic resources. So, starting in 2008 the working group focused on developing a LTER-specific controlled vocabulary, ultimately identifying a list of ~600 keywords that were either used by two or more LTER sites, or were found in one of the external resources (NBII Thesaurus and Global Change Master Directory Keyword List), and conformed to the recommendations of the international standard for controlled vocabularies (NISO 2005). This draft list was then circulated to members of the LTER Information Management for suggested additions and deletions, which were then voted upon (Porter, 2009). The final list consists of 640 keywords (http://intranet.lternet.edu/im/files/im/LTER_Keywords_V0.9.xls).

The final list was presented to the Information Management Committee (IMC) at the 2009 All-Scientists' Meeting. The sense of the meeting the keyword list was sufficiently evolved to form the basis of an LTER Controlled Vocabulary, but that adoption of an official LTER controlled vocabulary was beyond the powers of the IMC, and that a system of procedures needed to be developed for managing LTER-specific lexographic resources.

Earlier this year the LTER Information Management Executive Committee requested guidance from the LTER Executive Board regarding:

  • Should there be an official LTER dataset keyword list that sites would be encouraged to integrate into their datasets?
  • Who should determine what the contents of a keyword list should be, and who should manage revisions to the list?
  • What resources might be available for creating tools and databases that will help sites integrate the keywords into their datasets, and help data users discover relevant datasets?
  • Are there additional steps that are needed to further improve the discoverability of LTER datasets so that they have the maximum value in promoting scientific research?

The general response was positive, and in early 2010 the LTER Executive Board committed to helping to locate some domain scientists to work with the Information Management Committee on future activities, and endorsed the use of the list by LTER sites. Duane Costa with the LTER Network Office has already been working on some tools and databases to support access to the list via web services.

Next steps for the process include:

  1. Getting the keywords integrated into existing and future LTER Metadata. Some of this may be automated, because of the synonym ring created as the list was compiled that includes the forms of words actually found in LTER metadata. However, some additions will necessarily be manual. This process should be enabled through the use of tools that suggest possible words based on free-text searches of the metadata and through type-ahead drop down lists, similar to the one used on the LTER Metacat now.
  2. Creating taxonomies that provide browsable and searchable structures for use in LTER data catalogs.
    More than one taxonomy (a polytaxonomy) will be needed. For example, we might have one taxonomy for ecosystems (e.g., forest, stream, and grassland) and another for ecological processes (e.g., productivity with net and gross productivity as sub-categories). Each of these taxonomies will include the keywords from the list, so that they can be linked to the datasets. Steps in the creation of the taxonomies include:
    1. Identifying the taxonomies to be created (e.g., ecosystems, processes and objects measured)
    2. Examine existing lexographic resources (NBII Thesaurus, GCMD) to see if there are existing structures there that we can adopt
    3. Develop the taxonomies, assuring that each of the keywords falls into at least one of the taxonomies and adding modifiers to the keywords to help prevent them from being ambiguous (e.g., "head" can have both hydrological [pressure] and anatomical uses).
  3. Develop software tools that will use the taxonomies for browsing and searching.

These developments will require active participation by LTER Information Managers and ecological researchers to assure that the resulting products will well serve the ecological research community. Throughout the process and into the future, the keyword list and taxonomies will need to be revised and improved. However, before we can improve them, we need to create them!

References:

National Biological Information Infrastructure (NBII) Thesaurus. http://www.nbii.gov/portal/server.pt/community/biocomplexity_thesaurus/578

NISO. 2005. Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. ANSI/NISO Z39.19. http://www.niso.org/kst/reports/standards?step=2&gid=&project_key=7cc9b5...

Olsen, L.M., G. Major, K. Shein, J. Scialdone, R. Vogel, S. Leicester, H. Weir, S. Ritz, T. Stevens, M. Meaux, C.Solomon, R. Bilodeau, M. Holland, T. Northcutt, R. A. Restrepo, 2007. NASA/Global Change Master Directory (GCMD) Earth Science Keywords. Version 6.0.0.0.0
http://gcmd.nasa.gov/Resources/valids/archives/keyword_list.html

Porter, J. H. 2006. Improving Data Queries through use of a Controlled Vocabulary. LTER Databits Spring 2006. http://intranet.lternet.edu/archives/documents/Newsletters/DataBits/06sp...

Porter, J.H. 2009. Developing a Controlled Vocabulary for LTER Data. LTER Databits Fall 2009. http://databits.lternet.edu/node/70