DataONE to enable semantic searches for LTER NPP data
Margaret O'Brien (SBC)
Obstacles finding complex NPP data
The study of long term patterns in primary production is one of the core research areas at every LTER site, and a typically reported measurement is “net primary production” (NPP), e.g., the amount of new organic material produced during a time interval. Each LTER site’s measurements of NPP are determined locally and depend on the organisms being studied, e.g., their sizes, growth rates and community composition, and so methods vary widely in scope (organism, community, ecosystem), and scale (temporal and spatial). Methods also have different assumptions and limitations. It can be difficult to ascertain if any groups of NPP values are comparable without knowing these significant details. Scientists conducting synthesis projects using LTER data need a) to accurately find data sets containing NPP with the appropriate dimensions, and b) to learn enough about the methods in different studies to evaluate compatibility with their needs (Figure 1).
To enhance all data searches, the LTER information managers designed a SKOS-based controlled vocabulary. Through its use of “narrower terms” and “synonyms”, this has helped to refine search results in the Network catalog. But because the vocabulary’s scope is very broad, the NPP-related terms number fewer than 20 and there are essentially no terms related to field methods. A data collection as complex as LTER NPP data would be better served by an ontological system having strong semantic relationships and expressivity. The development of such a system is not possible with our current resources. Additionally, the issues surrounding description of NPP data are not unique to LTER, and the most complete, robust solution will be developed by a collaboration of data scientists and informatics specialists from many communities.
|Figure 1. Methods for measuring NPP at LTER sites may have significant differences in temporal and spatial scales. Left, a chamber for measuring in situ NPP in a benthic algal community at the Santa Barbara Coastal LTER. Right, satellite image of NPP from the BigFoot site, Harvard Forest (image downloaded from ORNL DAAC).|
The DataONE project began in 2009 and its large and diverse group of scientists and software engineers now coordinates hundreds of thousands of datasets from a diverse group of member nodes. LTER joined in 2012, contributing about 9000 records. Having recently entered Phase II, DataONE’s mission now includes solving specific problems for the earth science community. Its investigators are well aware of the difficulties that scientists face during data discovery, and plan cyberinfrastructure that will incorporate innovative and high-value features - among them, semantic technologies to enable precise data discovery and recall with measurement searches.To develop ontological solutions for data discovery, DataONE must begin with a “use case” - a sample problem which is constrained in scope, but complex enough to present a variety of potential obstacles, and with a large corpus of data having rich metadata. LTER’s diverse primary production data is an ideal case for developing a semantic search system.
Our range of measurements, biomes, and methodologies, comprising approximately 2500 datasets will ensure multiple benefits:
- DataONE is able to compare the effectiveness of an array of approaches
- LTER has a solution for one of its more complex data types
- The modeling patterns employed will be extensive enough to accommodate other future work in other scientific domains
Broadly speaking, two major new components are needed for this semantic system: the ontology itself (also called a “knowledge model”), and the annotations to link datasets to concepts in the ontology. Many concomitant issues have been identified, particularly related to involvement with the the scientific community, including versioning, ownership, and design and usability of web-interfaces. All new technology is planned to build on pre-existing community practices for ontology and annotation formats and management. A working product is planned to be ready early in 2016, with a session planned for the 2015 LTER ASM to gather feedback on progress to date.
A group evaluated the current landscape of knowledge modeling for NPP data and ascertained that there were no pre-existing comprehensive models for this scientific domain. And so, DataONE’s work will serve not just its own and LTER’s needs, but the greater NPP community as well. DataONE has already dedicated a developer with an extensive background in coding for semantic systems. A computer science postdoc works on text and unit-matching algorithms. An LTER information manager acts as a coordinator, along with informatics scientists who have created other ontological frameworks. Under this group’s direction, two graduate students working at UC Santa Barbara/NCEAS will work on various aspects of assembly and annotation.
One important activity will be an “annotation experiment” to compare precision and recall for datasets with no annotation (e.g., the current text search), with manual annotation, fully automated annotation (based on text and unit matching algorithm), and semi-automated annotation (auto with additional choice/verification by users)
Extensions, relationship to other DataONE efforts
In addition to semantic search, DataONE also is implementing a system to enhance reproducibility by storing and indexing provenance trace information. The use case for this provenance work is the Multi-scale Synthesis and Terrestrial Model Intercomparison Project (MsTMIP), a comparison of carbon flux model results and observations, whose overall goal is to provide feedback to the terrestrial biospheric modeling community to improve the diagnosis and attribution of carbon sources and sinks. MsTMIP’s central measurement is Net Ecosystem Exchange (NEE), and MsTMIP has identified contributing measurements which are comparable to many LTER measurements, e.g., NPP.
MsTMIP needs differ from LTER's. Because the MsTMIP project is concerned with model results, its needs are primarily to track data provenance, and data discovery issues are secondary. However, scientists using MsTMIP data will need to: a) know which MsTMIP-model produced the dataset, and b) find appropriate benchmark data for model evaluation. Developing semantic structures for carbon flux should meet the discovery needs of both projects.
An initial focus on NPP data serves an important and timely need for the LTER, while setting the stage for future work. For example, continuing with the same projects, an extension could model and annotate the MsTMIP models themselves, which would also provide general examples for handling and modeling additional features of provenance. Another possibility is to improve discovery for data related to other carbon-cycling processes, e.g., oceanic carbonate system parameters for the study of ocean acidification. That work would almost certainly involve a broad community that intersected the LTER. Alternatively, we could tackle measurements in another LTER core research area. Nutrient flux is timely to consider because the Science Council plans to address this topic in upcoming synthesis projects. Semantic web technologies are an active area of development with high potential. But we know from our own experience that our data are too diverse to tackle at one time. We will be best served by breaking it into manageable chunks, and leveraging the work of those whose missions complement our own.