Skip to Content

GCE Data Search Engine: A Client-side Application for Metadata-based Data Discovery and Integration

Printer-friendly versionPrinter-friendly version
Issue: 
Spring 2005

- Wade Sheldon (GCE)

INTRODUCTION

As the number and diversity of data sets in the GCE Data Catalog have grown over the past five years, GCE investigators have found it increasingly tedious to find and download all project data relevant to their particular questions or analyses. Over the past year we have also significantly expanded the scope of the GCE Data Portal web site to include more ancillary near-real-time and historic data sets relevant to the GCE site; however, most portal data sets were not LTER-funded and are therefore not included in the GCE Data Catalog, requiring investigators to find and download data files using an entirely different web interface. Consequently, the information management effort required to help users locate and integrate data for their research projects has been steadily increasing, limiting the resources available for other IM activities. A more comprehensive end-user solution for data discovery, access and integration was clearly needed.

SEARCH STRATEGY

The first challenge was to identify the basic search strategy to use, including the metadata source and content to target. At GCE, we primarily store metadata for core project data in a normalized relational database management system (RDBMS), which supports very comprehensive metadata-based searches using SQL. However, metadata for hundreds of ancillary and provisional data sets are not currently managed in this database and would therefore not be searchable. In order to support searching of all GCE data sets, we chose instead to initially target the structured metadata and data storage standard we developed in the first year of our project and continue to use for primary data processing, distribution and archival (i.e. the GCE Data Structure specification). This standard, based on MATLAB® structure arrays, combines parseable documentation metadata, attribute metadata, data table and QA/QC information into a single computer-readable data package (1,2). The metadata content stored in these data structures is a complete implementation of ESA's FLED standard as described by Michener (3), the same comprehensive standard on which our RDBMS and much of EML is based.

In order to efficiently search for information stored in data structures, which are typically archived as MATLAB binary files on a computer file system, we first developed a comprehensive file indexing application. This application evaluates data structures in any number of directories and subdirectories and generates an optimized search index structure containing complete file details and searchable metadata. We initially chose the following metadata content to index, however the application logic is generic and can be modified to index any available content:

  • General Metadata
    • Title (text field)
    • Abstract (text field)
    • Keywords (text array)
    • Data Set Themes and Core Areas (text field)
    • Methods (text field)
    • Study Sescriptors (text field)
    • Authors (text field)
    • Taxonomic Names (text array)
  • Temporal Coverage
    • Study Begin Date (serial date)
    • Study End Date (serial date)
    • Public Release Date (serial date)
  • Spatial Coverage
    • Study Sites (text array)
    • West Bounding Longitude, decimal degrees (floating-point number)
    • East Bounding Longitude, decimal degrees (floating-point number)
    • South Bounding Latitude, decimal degrees (floating-point number)
    • North Bounding Latitude, decimal degrees (floating-point number)
  • Data Table Attributes
    • Attribute Names (text array)
    • Attribute Units (text array)
    • Attribute Variable/Semantic Types (text array)

For most GCE data sets, all targeted content can be parsed directly from corresponding metadata fields; however, we quickly realized that spatial and temporal coverage metadata for some classes of data sets are often incomplete. For example, data sets for many hydrographic studies only include study site metadata for the primary cruise transect of interest but not the various marsh-oriented sites also intersected by the cruise track. In addition, data sets produced by investigators for their own use (another target for this technology) often contain detailed temporal and spatial information in the data table but no corresponding coverage metadata at all. Consequently, we also included data mining logic to augment geographic and temporal coverage metadata during indexing (i.e. using attribute descriptors to identify date/time and geographic coordinate data, perform any necessary transformations, and then run detailed temporal and geospatial analyses and GCE geographic database lookups to populate metadata fields).

In conjunction with the data indexing engine, we also developed a flexible search application for querying indices based on parsing delimited lists of criteria (e.g. Keywords = ctd, PAR; DateStart = 01-Jan-2001; DateEnd = 01-Jan-2002; Columns = Salinity; ...) and returning file details and descriptions for all matching data sets. Various text comparison options are supported (i.e. contains, starts with, exact match), but these options are currently set in a configuration file on a per-index-field basis to simplify the query syntax. Negative search criteria can also be specified for any text field or text array field to filter out corresponding matches (e.g. 'Keywords = ctd, -PAR' matches all data sets containing the keyword 'ctd' but excludes any matches that also include 'PAR'). Additional search options can be specified to force case sensitive text comparisons, set bounding box comparison type (fully contained or intersecting), and set overall query type (all criteria matched, any criteria matched). Logic is also included to support compound-field comparisons (e.g. 'Column+Units = Salinity PSU').

In designing these applications, we tried to strike a balance between high specificity (fast development, higher performance, poor reusability) and high generality (slow development, lower performance, higher reusability). We chose metadata content, query syntax and indexing strategies optimal for current GCE needs, but used generalized approaches that could easily be adapted to other tasks in the future. As a result, we were able to achieve very high performance despite MATLAB's relative lack of text processing sophistication. For example, fully indexing 370 complex data sets (i.e. over 300 megabytes of data) takes approximately 90 seconds on a modern PC, and complex searches on this index can be executed in 0.05-0.25 seconds, providing instantaneous results for users.

USER INTERFACE DESIGN

Figure 1. GCE Data Search<br />
Engine interface

figure 1. GCE Data Search Engine interface (click to view larger image)

After completing the data indexing and search applications, we developed a comprehensive graphical user interface (GUI) search engine application for building and managing search indices, defining queries, and managing result sets from searches (Figure 7-1). Several different prototype designs were considered, drawing inspiration from the ongoing LTER Metacat Query Interface design process and NBII Mercury search interface, but in the end a multi-paned, single-form design was chosen for the initial implementation to simplify the process.

The top pane includes a scrolling list of all indexed paths and the number of data sets each contains, with buttons for adding and removing paths and refreshing the index to remove deleted files, add new files and re-index modified files in all listed directories. Below the path list, various GUI controls are also included for entering search criteria and search options to create a query, plus the main 'Search' button for executing the search. Geographic bounding box coordinates can be entered manually, or selected by dragging a box on an interactive map of the GCE study area.

The middle pane contains a scrolling list of all successful queries, along with buttons to manage this list. Prior queries can be reloaded from this list at any time to fill in search criteria fields, allowing users to build up standard queries which they can modify or re-execute against new or updated indices. Query logging can also be disabled and this pane can be hidden to simplify the form and make more room on screen for search results.

The bottom pane contains a cumulative list of all data sets returned from various queries the user has performed. The general location (local or web, see below), accession id, title and study date range are displayed for each data set. Double clicking on any record loads the data set and displays complete formatted metadata in one of several user-selectable styles. The upper button panel in this pane can be used to manage the result set (i.e. sort, select, clear records), and the lower button panel loads the selected data sets into various GCE data analysis applications for editing, visualization (plotting, mapping), statistical analysis, and other operations.

In addition to the main form controls, menu options are also provided to set various user options and to perform batch operations on selected data sets, including copying, exporting in various text and MATLAB formats, and merging to create a composite data set (i.e. by performing a metadata-based union, matching columns by name, data type and units, and merging metadata contents and QA/QC flags). Users can also save complete "workspace" files to disk (i.e. containing the search index, query history, dialog selections and result set), allowing them to persist individual search sessions and reload them for instant start up. A default workspace file is also saved on program exit, so users can pick up exactly where they left off without needing to create or load a new search index.

TESTING AND DEVELOPMENT

Input from various GCE investigators and technical staff was solicited throughout the design and development process, and a number of refinements were made based on this feedback. For example, users requested the ability to search for data sets that contained a specified date, so support was added for both date range and contains date queries (i.e. selected using a drop-down menu on the search form). Feedback solicited from GIS experts at UGA was also used to refine the logic for intersecting-type bounding box searches in order to eliminate spurious matches to un-sampled interior regions in multi-site survey data sets (i.e. with large overall bounding boxes). Additionally, early user testing was also instrumental in the debugging process, particularly cross-platform file system issues.

After initial testing was complete and the documentation was finished, the new applications were added to the complete suite of MATLAB-based data processing and analysis programs developed at GCE (i.e. GCE Data Toolbox for MATLAB), and a series of beta versions were posted on the private GCE web site for project member access in Fall 2004. In March 2005, a compiled version of the enhanced toolbox was also released for public access (http://gce-lter.marsci.uga.edu/lter/research/tools/toolbox_download.htm).

INTEGRATION INTO THE GCE INFORMATION SYSTEM

As stated in the introduction, the main objective in developing the GCE Data Search Engine was to enable users to seamlessly search all GCE data holdings and assemble data sets of interest for analyses. However, the file-based nature of the search system we developed requires that all target data sets be available on a local or network-accessible file system for indexing. In order to accommodate this requirement, automated routines were developed to regularly archive all data sets in the GCE Data Catalog, GCE Data Portal, and provisional monitoring data in Zip format and upload them to the private GCE web site for single-point download. We also regularly create and distribute CDs to GCE members on request, containing all data sets and the latest version of the GCE Data Toolbox software.

Additionally, a more elegant solution was also identified shortly after the first beta release of the software. Recent versions of MATLAB include support for network file access via HTTP and FTP, so a procedure was developed to substitute local file paths in search indices with relevant HTTP URLs for the GCE web site. Code was then added to the GCE Data Search Engine to support web-based files in search indices (including user registration, transparent downloads, and web cache management), and menu options were added to enable users to download and incorporate pre-generated indices of all publicly-accessible GCE data sets from a stable URL on the GCE web site. This capability dramatically increased the utility of the application, allowing users to simultaneously search for data stored anywhere on their local file system as well as the GCE web site, and then assemble, transform and analyze these data in one integrated environment. It has also provided GCE users with powerful batch processing capabilities that previously required custom MATLAB scripting to perform.

This software has also proven useful in other areas of the GCE Information System. For example, data indexing routines were used to develop automated software for generating data set summary and detail web pages (based on standard HTML templates) for automatically-harvested data posted on the GCE Data Portal web site. We can now provide detailed, user-friendly web pages for data dissemination (and visualization) with no database or web application server overhead. With this technology, a single computer with MATLAB and Apache could function as a stand-alone data harvesting platform and high-demand data distribution server.

FUTURE PLANS

This software has already proven very useful for a number of data synthesis projects at GCE, and we will continue to refine it to improve functionality. However, we also intend to use the lessons learned for developing improved web-based query interfaces for the GCE Data Catalog, and standardized LTER data query interfaces.

REFERENCES

  1. Sheldon, W.M. 2001. A Standard for Creating Dynamic, Self-documenting Tabular Data Sets Using MATLAB. DataBits: An electronic newsletter for Information Managers, Spring 2001 issue. (http://intranet.lternet.edu/archives/documents/Newsletters/DataBits/01spring/)
  2. Sheldon, W.M. 2002. GCE Data Toolbox for MATLAB -- Platform-independent tools for metadata-driven semantic data processing and analysis. DataBits: an electronic newsletter for Information Managers, Fall 2002 issue. (http://intranet.lternet.edu/archives/documents/Newsletters/DataBits/02fall/)
  3. Michener, William K.  2000.  Metadata.  Pages 92-116 in: Ecological Data - Design, Management and Processing.  Michener, William K. and James W. Brunt, eds.  Blackwell Science Ltd., Oxford, England.