Improving Data Queries through use of a Controlled Vocabulary

Spring 2006

- John Porter (Virginia Coast Reserve LTER)

Currently the keywords used to characterize datasets at most LTER sites are uncontrolled, meaning that they are selected entirely by the data creator. One of the challenges facing LTER and external researchers in discovering data from LTER sites is inconsistent application of keywords. A researcher interested in carbon dioxide measurements must search on both "Carbon Dioxide and "CO2." Moreover, a search on "Gases" would not find either of them.

The existing set of words and multi-word terms is highly diverse. For example, in the EML documents comprising the LTER Data Catalog, over half (1,616 of 3,206) the keyterms are used in only a single instance. Only 104 of the terms are used at 5 or more different LTER sites. The situation is similar for other lists of words:

Source Number of Terms Number used at 5 or more sites Frequently used words
EML Keywords 3,206 104 LTER (1002), Temperature (701)
EML Titles 2,825 213 And (768), Data (482), LTER (378)
EML Attributes 6,318 436 The (4,207), Data(1,621), Carbon(328)
DTOC Keywords 2,774 103 ARC (1645), Temperature (732))
Bibliography Titles 13,538 1,855 Of (12,611), Forest (2,050)

To help improve this situation, a working group at the 2005 LTER Information Managers' meeting met to develop a plan for improving the searchability of LTER data. The plan revolves around identifying existing controlled vocabularies, thesauri and ontologies that could be exploited to help provide a richer content for searching LTER data. The working group came up with a three part plan:

  1. Information gathering: Accumulate and analyze lists of words and terms used by LTER researchers. Combine these lists and identify a set of "important" terms that can be used to test the richness of existing resources. Words from existing site-specific controlled vocabularies will also be gathered for use in the testing phase. This list-gathering phase of the plan is largely complete thanks to the efforts of Duane Costa (EML lists), James Brunt (Bibliographic list) and John Porter (Data Table-of-Contents/DTOC list), and the lists are posted on the web site. These individual lists were then combined to produce a consolidated listing of 21,153 words or terms along with:
    • Number of lists on which it appeared (range 1-5)
    • Number of sites and uses from each list (EML Title, Keyword and Attribute; Bibliography; and DTOC)
    • Max and Min number of sites using within a list (0-24)
    • Max and Min number of uses within a list (0-12,611)
    • Is it a multi-word term?
    This consolidated listing is now available via the LTER Metacat as package knb-lter-vcr.147.1, and is free for use by other ecoinformatics groups interested in analyzing LTER content. During a videoconferencing session in April, a sub-working group chaired by John Walsh and Barbara Benson was charged with the development of one or more (shorter!) lists where words or terms are rated in terms of their "importance."
  2. Testing: The goal of this step is to use the list(s) of "important" words and words from site-specific controlled vocabularies to test the utility existing lexigraphic resources such as controlled vocabularies, thesauri and ontologies. These resources will be rated based on the number of "important" words that are found in a given resource along with measures of how "rich" the information that resource returns, such as number of more general terms, more specific terms or related terms. During the April videoconference, a sub-working group chaired by Inigo San Gil was charged with deciding:
    1. what should the content of a "report" from a test session include?
    2. Which resources should be evaluated?
    3. How should the testing be conducted?
    When the work of this subgroup is completed, we should have the information needed to make decisions about which lexigraphic resources are likely to be most useful.
  3. Development: Once the lexigraphic resources (existing controlled vocabularies, thesauri and ontologies) have been evaluated, one or more will be selected for utilization by LTER. This may involve negotiating formal Memoranda of Understanding (MOU's) with the resource creators or working with them to enrich their content to support LTER searches. Additionally, tools that use these richer information resources need to be developed, so that users searching for data will have access to improved search and browse tools. During the April videoconference, the LTER Network Office agreed to take the lead on developing prototype applications that are capable of using a wide array of lexigraphic resources. However, initially they will be tested using a smaller subset of resources while the information gathering and testing phases are completed.

The sub-working groups aimed at evaluation of lists (identifying "important" terms), testing against existing lexographic resources and development will be working over the next several months on their respective tasks. Information managers and others interested in participating in the sub-working groups should contact the sub-group leaders.