Skip to Content

Navigating Semantic Approaches: From Keywords to Ontologies

Printer-friendly versionPrinter-friendly version
Spring 2006

- Deana Pennington - LTER Network Office and SEEK Project

Ontologies! Controlled vocabularies! Data dictionaries! These and a multitude of other terms are coming into widespread use as we grapple with semantic methods for clarifying the meaning of words used to describe computational resources. How ironic that a field whose goal is semantic clarification is itself littered with unclear terms! In actuality, the terms have precise meanings, and well-understood implications. The goal of this article is to provide a conceptual framework for understanding these different semantic approaches, introduce the approach being used within the Science Environment for Ecological Knowledge (SEEK) project, and suggest some opportunities for leveraging different approaches ongoing within LTER.

Knowledge representation (KR) is a very broad field. In its most general sense, it is simply methods for external representation of the things that we know internally. There are things that exist (physical and non-physical/abstract); we represent them in different ways in order to be able to talk about them. Natural language is a form of knowledge representation - we assign words to represent things that exist in the world. Mathematics, and physical or computational models, are ways to represent knowledge. The same thing can be represented many different ways, and the choice of representation will affect the ways in which one can talk about and/or reason about things. All representations are necessarily imprecise and inaccurate because the only completely precise and accurate representation of a thing is the thing itself. The best representation depends on the objective.

Here, we are primarily interested in a technology view of KR, where the goal is to provide automated reasoning regarding semantic compatibility of resources (data, computational models, etc.). For example, using John Porter's example (this issue), we would like the system to be able to determine through automated reasoning that there are semantic relationships between the terms "CO2", "carbon dioxide" and "gases," and to be able to perform different tasks based on the degree to which those relationships are specified. Choices in representation methods are primarily choices about to what extent and how the relationships are specified, and there is a trade off between the degree of expressiveness and automatic reasoning capability. For example, natural language is very expressive, but does not lend itself to any kind of automated reasoning

Common methods used to achieve semantic clarification are shown in Table 1, along with their characteristics. First, we recognize that a thing (physical or abstract) must be represented by some kind of symbol. The set of like things form a concept which we can define and may represent with multiple symbols. For instance, all carbon dioxide molecules through time constitute a set of like things (the molecules themselves) that we perceive as a single concept that we can explicitly define and that we represent with multiple terms (CO2, carbon dioxide). Synonyms are terms that represent the same concept. Any additional structure that we impose on a group of terms comes from defining other relationships between terms. Classification is the process of organizing a group of concepts into a subsumption hierarchy where the relationship between two concepts is in the form of broader and narrower terms (e.g. CO2 "isa" gas, where "isa" is the conventional way of referring to this relationship in KR). Properties are the defining characteristics of each concept, which requires linking two concepts with a "has" relationship (e.g. CO2 hasProperty odorless, where "odorless" is its own defined concept). Classification occurs through property assignment, therefore these two approaches go together even though property assignment may be tacit rather than explicit. For instance, the concept "gas" has certain properties by which you determine which things belong in that concept. The concept CO2 is a subset of the concept gas, and as such all things that are represented by the concept CO2 inherit the properties of the concept gas, but those properties are further restricted by additional properties of CO2 that not all gases possess. Hence, classification is a process of organizing concepts by properties, whether or not those properties are made explicit. Lastly, other kinds of relationships between concepts can be stated. However, logical reasoning is based on the mathematics of set theory, therefore automated reasoning engines typically operate on "isa" and "has" relationships. This seems straightforward but in practice, appropriately structuring concepts such that logical reasoning can be optimized is a challenge. There are common pitfalls in organizing subsumption hierarchies that can negatively affect reasoning capability (perhaps the subject of a future article).

Table 1. Methods of semantic clarification and attributes of each. Parentheses () indicate an attribute is sometimes incorporated into the given method, but not required.

  Definition Synonyms Classification (isa) Properties (has) OtherRelations
Dictionary X        
Controlled vocabulary (X) X      
Thesaurus X X      
Taxonomy (X) X X (X) X
Ontology X X X X X

A key issue in the choice of semantic method is the level of automatic functionality provided (Table 2). Any added semantic information, at any level, will enable better resource discovery. Simply assigning keywords (often called annotating) to resources is much more efficient than searching through the entire text of the resource. Providing a defined list of keywords (data dictionary) clarifies what terms may be searched and what those terms explicitly mean. A controlled vocabulary limits the terms that may be used. A common usage of controlled vocabularies is to avoid the use of synonyms such that a search on a single term should yield all relevant resources (e.g. either CO2 or carbon dioxide would be provided as a keyword, but not both). A thesaurus allows synonyms and specifies the link between them. In addition, a thesaurus usually links words that are related but not synonyms, and sometimes links antonyms. Searching for resources via a thesaurus would return those annotated to the term you searched for and those annotated to any related terms. For instance, a search on carbon dioxide would return datasets annotated with the keyword CO2, even if you didn't specifically request that term. A taxonomy adds a classification hierarchy, therefore a search can include narrower terms. For instance, a search on the term "gases" including narrower terms would return resources annotated to the term CO2 even though CO2 was not specifically requested. An ontology provides similar functionality, but additionally one could search by properties, for instance, a search on "gases" that have the property "odorless" would return a resource annotated as CO2.

Table 2. Methods of semantic clarification and attributes of each. Parentheses () indicate an attribute is sometimes incorporated into the given method, but not required.

  Keywords Dictionary Controlled vocabulary Thesaurus Taxonomy Ontology
Discovery X X X X X X
Intergration       X X X
Working Analysis           X

In addition to resource discovery, semantic clarification can aid in automating data integration. For instance, if two datasets have the same format and structure, but one dataset is annotated CO2 and another carbon dioxide, then a system linked to a thesaurus, taxonomy, or ontology could automatically join the two datasets into one. Or, if one would like an integrated dataset of all information on dissolved gases in a given lake, a search on "gases" linked to a taxonomy or ontology would return a dataset annotated with the concept CO2 that could then be automatically integrated with other datasets, depending on other system criteria.

Ontologies combined with automated reasoning can enable a broad array of more sophisticated functionality. Rather than attempt to describe that functionality in general terms, it is likely to be more useful to describe one specific example that illustrates the level of functionality that could be obtained. In SEEK, we have a biomass example that we have used repeatedly to ground our understanding of the practicalities of ontology usage (Figure 1). Given two datasets, one that contains information about plant species, cover area, and height, and a second that contains information about plant species and biomass, we "know" that there is a relationship between these two datasets (at least, those of us who work with plant biomass). The reason we know that, is we have a conceptual model in our head that can easily make the inference that we can calculate plant volume from area and height information, and that plant volume is related in context-specific ways to biomass. If we know that volume equals area times height, and we know the function to transform plant volume to biomass, then we can integrate these two datasets manually. However, if we have 1000 datasets of each and no devoted graduate research assistant on which to confer this task, this is not a desirable approach.

The general approach to automate this task duplicates our own reasoning approach. We need to formally encode a conceptual model in a language that the computer can understand (an ontology), specify how each concept that is in our datasets fits into that conceptual model, and find the algorithms on the system (tools) that can do the transformation between those concepts (Figure 1). This is equivalent to following the path from AREA and HGT in the first dataset, to corresponding concepts in the ontology, following those to the Calculate Volume tool, using the output from that as input to the Calculate Biomass tool, then integrating the result with the second dataset (Figure 1).

Figure 1. Illustration of the use of ontologies to automatically integrate annotated datasets by way of annotated tools, the use of which can be logically inferred by a reasoning engine.

Figure 1. Illustration of the use of ontologies to automatically integrate annotated datasets by way of annotated tools, the use of which can be logically inferred by a reasoning engine.

This fairly straightforward task explodes into complex details when implemented on the system. Logically, the system has to make two connections before figuring out the path. It must recognize that it needs plant volume to get to biomass, and that it can get plant volume from area and height of plant species. It must infer a couple of dozen automatic steps to accomplish that, including finding each concept occurring in the datasets in the ontology then figuring out how they are related, what tools are related to the same concepts and which one(s) will make the correct transformation. What if there are another dozen columns not shown here? The system must locate all of those and determine any and all relationships between all column concepts even if they aren't relevant to the immediate question (although the system can allow the user to specify which columns are needed, often all columns are integrated in order to keep the extra information).

The five explicit concepts used by the resources in question (data and tools) can be represented by different terms and organized in different ways. For instance, volume, area, and height are clear terms with clear mathematical relationships, but should we have a separate concept for plant volume and relate that to biomass, or do we force the system to figure out that it is plant volume by recognizing (somehow) that the area and height columns are related to the species name column, which contains only plant species? What is the right trade off between ontology complexity versus system complexity? Do we have a separate section of the ontology specific for plant measurements and link those to generic measurement terms? What are the implications of our choices on other applications that we have not yet thought about? This is a single example of functionality that could be provided by an ontology - there are many others that use the ontology in different ways, all of which bring a set of complexities into the game.

To address these complexities, SEEK established a KR team made of a number of people who participate at different times in different ways, but whose regular participants have been Mark Schildhauer (NCEAS), Josh Madin (NCEAS), Shawn Bowers (UC Davis), Ferdinando Villa (U Vermont), Sergei Krivov (U Vermont) and myself. Determining an appropriate generic framework from which to begin population of an ecologic ontology has occupied many grueling hours of discussion by the team. Progress is slow and includes hours of abstract, philosophical discussion. For example, we recently spent two days discussing what, exactly, is an observation, how does that relate to columns in a dataset, and what is the philosophically correct way to represent this in an observation ontology that will allow for re-use of the data in ways that were not necessarily intended by the data collector? Is the entire row a single related observation or is each column an independent observation, and what are the implications of that decision for automated reasoning? What about columns that hang together, like columns designating blocks and replicates of a field experiment? Are those spatial concepts, experimental concepts, or some kind of hybrid, and how would you represent that in a generic ontology?

There are few analogous ontologies that have been developed and applied in other disciplines. Prior experience in medical and biologic domains has been limited mostly to integration of synonyms and hierarchical concepts. For instance, the gene ontology that has been highly effective is limited to expression of a relatively few categories of concepts using very few kinds of relationships. Ecology, in contrast, is a science about a multitude of relationships between many different kinds of concepts, and any ontology that will be useful for the applications we have will necessarily be complex.

Ontology development in ecology is going to be a long, complex process. In the meantime, simpler, quicker methods such as data dictionaries and controlled vocabularies have an important role to play. They can greatly enhance our ability to discover and make sense of relevant information. They can inform ontology development by providing lists of relevant terms with which to populate the ontology once a framework is in place. They can provide paths by which resources could be (semi-)automatically annotated to ontologies. Conversely, once ontologies are in place, they can be used to inform simpler methods. For instance, the decision to add a term to a controlled vocabulary could be informed by displaying terms that might be related. The challenge of leveraging these different approaches is one of navigating the different temporal scales of development such that we can clearly envision future linkages and work independently towards a collective goal. Many opportunities for collaboration exist, but many barriers to working across scale, institution, organization, and culture exist as well. The challenge of the human dimension may well be more difficult to overcome than any of the technical challenges