Skip to Content

Opening the Data Vault with the Drupal Ecological Information Management System

Printer-friendly versionPrinter-friendly version
Issue: 
Fall 2013

Inigo San Gil (MCM), Kristin Vanderbilt (SEV), Corinna Gries (NTL), Jeanine McGann (UNM), Marshall White (LNO), Eda Melendez-Colom (LUQ), Aaron Stephenson (NTL), Jim Laundre (ARC), Hap Garritt (PIE), ken Ramsey (JRN), Philip Tarrant (CAP), Ryan Raub (CAP), David Julian (CAP), Chau Chin Lin (TFRI), David Blankman (ILTER), Atzimba Lopez (Mex-LTER), Cristina Takacs-Vesbach (MCM), Palantir.net and Dave Reid.

This article covers two innovative features of the Drupal Ecological Information Management System (DEIMS): Faceted Searches and the Data Explorer.

Introduction

LTER sites have been charged with managing and archiving their data for long-term public use since the inception of the NSF LTER program. Due to the fact that LTER research is site based, long-term observations are highly optimized for a particular ecosystem and most datasets become re-usable only when collection methods and sampling contexts are well documented. Hence, a large variety of datasets are currently curated at LTER sites and a search for particular data can become a search for the proverbial needle in a haystack. Availability of data is no longer the problem with more than 6000 datasets published by the LTERs; the problem now is optimizing the data search. DEIMS provides several approaches to a more successful data search and tools for initial data exploration.

DEIMS is a powerful tool for managing most information products associated with an LTER, field station, or research lab.  Significantly, it includes a web-based metadata editor for describing datasets and a module for generating EML, BDP and ISO compliant metadata.
In addition, DEIMS includes web-based publication management, research project information, people associated with the site, images, and other information as needed. All of these interlinked resources benefit from secondary relations formed using common keywords that a DEIMS allowed user selects from either the LTER controlled vocabulary (Porter, 2013), the LTER Core Areas vocabulary, or other site-centric keyword families. The DEIMS site information management team completes the information curation process using the DEIMS workbench. Search results are displayed in DEIMS in various ways using Drupal’s (Dries, powerful ability to create custom listing pages (or views) of data.

Thus, based on Drupal’s (Dries, 2001) inherent capability of linking of different information concepts, a data consumer can find datasets produced by a person directly on the person profile page, or linked to a research site on the page describing that site, or linked to a specific research project on the page illustrating that project, or in connection with a journal publication. In addition to these different ways of accessing information on a typical DEIMS site, a faceted search has been implemented in the latest DEIMS version.

DEIMS Data Discovery

In past Databits issues (San Gil, Spring 2013 and Fall 2011; Gries et al., Spring 2010) we have written about DEIMS' approach to data discovery. Instead of a single Google-like interface to data discovery, DEIMS offers a variety of pathways to discover data (San Gil et al., 2010). A data consumer can search for data by a person's name (from the person profile page or data catalog interface), by association with a journal publication, or by particular location set or temporal range. Now, there is a new and even more specific DEIMS feature for the LTER data consumer -- faceted searches, the ability to narrow initial results from a broad search.

What is a faceted search?

Wikipedia defines faceted search as (Tunkelang, 2009): ‘Faceted search, also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a classification system, allowing users to explore a collection of information by applying multiple filters. A faceted classification system classifies each information element along multiple explicit dimensions, enabling the classifications to be accessed and ordered in multiple ways rather than in a single, pre-determined, taxonomic order.’ Hence, faceted searching allows us to refine initial, broad, Google-like searches. It consists of the grouping of search results under narrow sub-categories. This narrowing helps find the appropriate set of results among hundreds of relevant results from an initial search. Most people have used a similar feature in many websites, such as Google shopping or at a variety of retail websites, i.e., an initial search for a ‘refrigerator’ yield a long list of results that can be narrowed further selecting price, brand, color, and a number of other filtering options, making the search experience better.

DEIMS has a number of facets or filters to improve your search experience as well. You can narrow the search for data sets by the lead principal investigator, by location, by duration, or through the tags used to classify the dataset.

The screenshot below shows the initial search interface for datasets -- by default a user sees the Google-like search box, and a constantly adapting listing of results matching the current search criteria:

DEIMS Faceted Search initial screen

For example, a data consumer may start his/her initial search using the box, perhaps narrowing the result set to a fraction of all the data set catalog holdings. As an example, let’s look at the initial results set using the search term “grassland warming”. The text search box provided by DEIMS implements most comon simple searches, e.g., this would search for ‘grassland’ or ‘warming’ and provide a result set based on the index ranking from the combined terms. DEIMS adds the faceted search power: In the screenshot below all applicable facets that narrow the results further appear in the blocks on the right hand side:

Faceted Search initial resultset and facets

Data consumers can further filter the results by choosing one of the many data set owners, or by selecting a dataset timespan (duration). Another result-narrowing facet is the site-specific thematic keywords. A useful narrowing criteria is provided by the LTER Core Areas facet, the first block on the right in the figure above. If a data user browes the above page using a mobile device, these blocks would be rendered at the bottom of the page instead of to the side. Data facets are powered by default by the DEIMS relational database indexes. However, the data facets can be configured to consume an Apache Solr (Apache Foundation, 2010) created index. Data set facets can be extended with a variety of existing widgets, including graphs, slides, and tagclouds. The DEIMS facet search feature extends the Drupal facet API contributed module in conjunction with the context contributed module. In addition to documentation provided in the respective module pages, the documentation pages for the extension explain how to extend DEIMS out-of-the-box capabilities. Further set-up documentation and help on how to configure the faceted search is accessible through the DEIMS project page.

Once a dataset of interest has been found, a full data package is offered to the data consumer.  However, often the data consumer is only interested in a subset of the data.  DEIMS offers the data consumer the ability to explore the data and further narrow, dissect, and subset the original package.  

Exploring LTER data with DEIMS - the Data Explorer

What is the Data Explorer?

The Data Explorer is a DEIMS feature that enables a user to connect to a relational database to expose and query its data holdings.

How do we use the Data Explorer?

The Data Explorer (DE) creates a query system for relational databases in DEIMS. When the DE module is configured, a data consumer may access a query page for each of the datasets catalogued and described in DEIMS. Each DE query page has two parts. The top part allows the data consumer to select which fields he/she wants to query, effectively subsetting a data package. The lower part of the query page allows filtering value ranges or thresholds from each of the data table columns.

For example, a hydrological table may have a date column, a temperature column, a discharge column, conductivity, and quality flag columns. The data consumer may just want to download the date and discharge values, which is done using the top part of the DE interface. In addition, the data consumer may want to narrow the data to a range of dates, set a temperature threshold, or exclude rows that have been tagged with quality control codes qualifying the validity of the overall measurements. This is done in the lower part of the data explorer. After submitting the query, the data consumer can either preview the results, or download them as a comma delimited file.

The next series of examples will illustrate this process using  data collected by the McMurdo Dry Valleys Stream Team at the Delta Stream, led by Diane McKnight. The particular DEIMS setup may vary a bit depending on look and feel customizations or other improvements, but should essentially resemble what is shown here.

Starting with the DE Dashboard, a page that is accessed through the main menu (this example URL is the DEIMS default location http://example.com/data-explorer-dashboard):

Data Explorer Dashboard

Data exploration is initiated clicking on the link “Explore DELTA HYDRO” in the right-hand column; the link leads to the Data Explorer query page, which is divided into two parts. The upper part allows us to select which of the columns we want to query and download data from.

Data Explorer selecting fields

Note that only the first two choices are checked because in this example we are only interested in the date that the discharge (stream flowrate) was measured and the discharge value itself. The rest of the data columns remain unchecked. Furthermore, in this example, we want to filter by a range of values. The lower part of the DEIMS DE query page allows us to create customized ranges to narrow the final results even further:

Filtering, narrowing data using DE

Using the datetime range filter, we have selected four years’ worth of data -- from November 11, 2007 to the same day in 2011. We also filtered for high-confidence values (tagged as “GOOD”: most accurate within 10%) and limited the results to discharge values at or below twenty liters per second. We could have added any other ranges to the rest of the column filters. It is important to use the checkbox to the left of the filter we want to apply.

Finally, there are the “Web Preview” and “Download” buttons at the bottom of the query page. Be patient with the “Download” for large datasets. Depending on the query, you may be requesting gigabytes worth of data!

To conclude this example, here is what a “Web Preview” of the results would look like for the example we are illustrating:

Preview query results with the DEIMS Data Explorer

DEIMS DE requires adopters to transfer file-based data into a relational database. Currently DEIMS DE connects with Oracle's MySQL, the MySQL fork MariaDB, and Postgres, but it can be extended to connect to Oracle R-series databases and other flavors (SQLite, Microsoft SQL server, etc). Originally, the DEIMS group wanted to task the contractors with the development of a similar query directly against comma delimited files and spreadsheets; however, funding constraints forced the group to limit the DE feature to relational databases. We will pursue funding to extend these features. Documentation on how to configure the DE is available at the DEIMS project pages. You can always contact the authors of this article for help or assistance in this or any other DEIMS matter.

The DEIMS Data Explorer module is inspired by the original North Temperate Lakes LTER Data Catalog, a custom development led by Barbara Benson and transferred to the current NTL DEIMS Data Catalog (based on Drupal 6, and developed by Preston Alexander and others). Supplemental funding to NTL was allocated to the DEIMS project specifically to expand the original DEIMS Data Catalog module. The main developer, Dave Reid, was tasked with the generalization of the module with the goal of making the DE accessible to any LTER site and beyond.

Concluding remarks and next steps

In this article, we have covered the DEIMS grassroots efforts in expanding access to LTER data vaults. DEIMS exposes data holdings using state-of-the-art discovery methods, including faceted searches. Another feature we have covered is the ability to perform data subsetting, avoiding unnecessary downloads of massive datasets. DEIMS exports the metadata and data contents using the LTER-adopted metadata specification -- the Ecological Metadata Language, with PASTA-ready compliance. In addition, DEIMS also offers its metadata holdings using the Biological Data Profile specification, a profile of the content standard for digital geospatial metadata used by the US federal agencies. International visitors may use DEIMS' ability to export metadata formatted using the International Standards Organization (ISO) standards 19115, 19109 and 19110, expressed as an extension of the ISO standard 19139 XML implementation.

DEIMS development is ongoing. We are in the midst of adding charts and graphs to the DE outputs, an effort conducted in collaboration with our international partners at ILTER, specifically the IM committee chaired by David Blankman. Also, we will produce a road map for indexing metadata using Apache Solr, a technology already in use by Tai-Bif (Shao et al, 2013) and DEIMS instances at the Taiwan Forestry Research Institute. We also are exploring adding more community-developed widgets and styles to the DEIMS search facets.

We will seek funding to foster the growing DEIMS community, an essential component to sustain the current momentum of standardadization within LTER sites. Specifically, we need to nurture DEIMS training in both the use and management of facets. It has been a relative long time since we have conducted training. We also need funding to continue developing the system and addressing a long list of feature requests which were unaddressed in the most recent development sprint (March through August). DEIMS' presence at professional meetings and conferences are vital for community adoption.

Just like the White House and thousands of others, we contribute back to the Drupal community: some of the extensions sponsored by the NSF-funded LTER DEIMS are already in use by other projects. The co-author list of this article would run in the hundreds if we were to include the work of developers that have contributed to DEIMS either by contributing to the Drupal core, or any of the 80+ Drupal contributed modules that DEIMS leverages. We invite all LTER sites to actively participate in DEIMS, as well as any person or group interested in being a part of a common solution to all information management. The group effort is precisely the main strength of DEIMS: a unified approach and solution to information management for sites, stations, and research projects.

Citations:

Apache Foundation. "Apache Solr". Accessed December 2013 http://lucene.apache.org/solr

Dries, B. "The Drupal Content Management System". 2001. Accessed Dec.2013 at http://drupal.org

Gries, C.; San Gil, I.; Vanderbilt, K.; and Garritt, H. 2010. Drupal developments at the LTER Network. Databits, Spring 2010. http://databits.lternet.edu/spring-2010/drupal-developments-lter-network.

Porter, J. LTER Controlled Vocabulary Working Group planned. Databits, Spring 2013. http://databits.lternet.edu/spring-2013/lter-controlled-vocabulary-workshop-planned

San Gil, I. 2011. The Drupal Ecological Information Management System (DEIMS): Recent Progress and Upcoming Challenges for a Grassroots Project. Databits, Fall 2011.
http://databits.lternet.edu/fall-2011/drupal-ecological-information-management-system-deims-recent-progress-and-upcoming-challen.

San Gil, I. 2013. The New Drupal Ecological Information Management System (DEIMS). Databits, Spring 2013. http://databits.lternet.edu/spring-2013/new-drupal-ecological-information-management-system.

San Gil, I.; White, M.; Melendez-Colom, E.; and Vanderbilt. K. 2010. Case Studies of Ecological Integrative Information Systems: The Luquillo and Sevilleta Information Management Systems. Communications in Computer and Information Science 108:18-35. DOI: 10.1007/978-3-642-16552-8_3. http://www.springerlink.com/content/j183x10588574846/.

Shao, K.T., Lai, K.C, Lin, Y. C., Chen, L. S., Li, H.Y., Hsu, C.H., Lee, H., Hsu H. W. and Mai G. S. "Experience and Strategy of Biodiversity Data Integration in Taiwan" Data Science Journal 12 (2013): 27

Tunkelang, D. 2009. Faceted Search, Synthesis Lectures on Information Concepts, Retrieval, and Services. Wikepedia Vol. 1, No. 1, Pages 1-80 (doi: 10.2200/S00190ED1V01Y200904ICR005).