Skip to Content

Mining and Integrating Data from ClimDB and USGS using the GCE Data Toolbox

Printer-friendly versionPrinter-friendly version
Issue: 
Spring 2006

- Wade Sheldon (GCE)

Introduction

Climate and hydrographic data are critically important for most long-term ecology studies, so integrated climate and hydrography databases such as ClimDB/HydroDB and the USGS National Water Information System represent major resources for ecologists. Both ClimDB/HydroDB and USGS also have web-based query interfaces, which provide convenient access to data and basic plots from many monitoring stations across the country. These centralized databases therefore significantly aid users with the first two phases of any data synthesis project: data discovery and data access.

Data synthesis doesn't stop with the initial download, though, and many users I've worked with quickly become frustrated performing the remaining steps that are typically required. For example, common follow-up tasks include parsing and importing data into spreadsheets or analytical software, assigning or harmonizing attribute names and units, and integrating data from multiple stations for comparative analysis. Automating these operations is highly desirable, but usually requires custom programming and is not practical for most researchers. Consequently some students and researchers avoid data synthesis all together, viewing it as either too tedious or difficult, while others request help with synthesis tasks from information management staff, adding to their workload.

As I've described in several prior Data Bits articles (1,2,3), at GCE we have developed a flexible environment for metadata-based data transformation, quality control, analysis and integration using the multi-platform MATLAB programming language (i.e. GCE Data Toolbox). This software was also used to develop an automated USGS data harvesting service for HydroDB that contributes near-real-time hydrographic data on behalf of 10 LTER sites to the ClimDB/HydroDB database on a weekly basis (4). In the remainder of this article I describe new data mining features recently added to this toolbox that allow users to interactively retrieve data from any station in ClimDB/HydroDB or the USGS NWIS (using MATLAB 6.5 or higher), and then easily transform and integrate these data sets to perform synthesis on their own.

 

Mining ClimDB/HydroDB

fig 1. ClimDB/HydroDB data mining dialog
fig 1. ClimDB/HydroDB data mining dialog
(click on image for larger version)

Retrieving data using the ClimDB/HydroDB web application is a multi-step process, starting with site selection and user registration, followed by station and parameter selection, then specification of download options. In order to automate this process on behalf of users a complete "variables" report is initially retrieved from the ClimDB web server in XML format. This report is then parsed into delimited text format using XSLT and cached as a data table to provide an up-to-date list of stations and available parameters, which can be updated at any time on demand. This parameter table is used to populate controls on a GUI form (fig 1), which also prompts for and caches user name and affiliation information to pass to ClimDB along with each data request.

When the user selects a site using the drop-down menu at the top of the dialog, all available stations for that site are automatically displayed along with the temporal coverage for each. Pressing the "Select Station" button (or double clicking on the entry) copies the relevant site and station information to the data request form and fills in the complete date range, which the user can then fine-tune to cover only the date range of interest. Pressing the "Proceed" button sends a series of requests to the ClimDB web server via HTTP POST commands (with progress displayed after each step), until the requested file is transmitted and saved locally. The retrieved file is then parsed and transformed into a GCE Toolbox-compatible file format and loaded into the toolbox data set editor window for analysis, transformation and display (see http://gce-lter.marsci.uga.edu/lter/research/tools/toolbox_applications.htm). At present, daily data are requested for all available parameters, relying on GCE toolbox functions for data sub-setting and temporal re-sampling, but other options may be added in the future.

On the surface, the work flow just described only represents a convenient way to query ClimDB/HydroDB using MATLAB; however, several important transformations occur automatically during the data import step that can significantly streamline synthesis. For example, the GCE toolbox supports the use of user-defined metadata templates, which can be applied whenever data are imported from a particular source. These templates contain boilerplate documentation metadata as well as a table of attribute descriptors and QA/QC flagging rules for every variable that may be present in the source data (i.e. matched by field name). A default ClimDB template is included with the toolbox distribution, but users can edit this template in a GUI dialog and specify their own site-specific names and descriptions for ClimDB attributes (as well as additional QA/QC criteria), causing ClimDB attributes to be "aliased" automatically each time new data are downloaded. In addition, any flags assigned to values in the ClimDB/HydroDB database are automatically converted to GCE toolbox flag arrays, allowing users to take full advantage of numerous flag-handling options supported by toolbox functions (e.g. selective or comprehensive removal of flagged values, display of flags above values on data plots, interactive graphical flag editing using the mouse, and automatic documentation of flagged values in derived data sets resulting from aggregation, temporal re-sampling and bin-averaging operations, with optional flagging of derived parameters based on number of flagged values in the corresponding aggregate or bin).

Mining USGS

 fig 2. USGS NWIS data mining dialog

fig 2. USGS NWIS data mining dialog
(click on image for larger version)

A very similar dialog is also available for retrieving data from any station in the USGS NWIS database (fig. 2). When the user selects a state or territory using the drop-down menu, a list of all available stations is displayed, along with short descriptions from USGS. Pressing the "Select Station" button (or double-clicking on the station entry) loads that station ID into the request form. Unlike ClimDB/HydroDB, USGS does not support querying by specific date range, so users are prompted for the number of past days of data to request. Users can request recent provisional (near-real-time, high frequency) data, recent daily data, or finalized daily data, and requests are automatically routed to the correct server and application on the USGS server. USGS limits provisional near-real time data requests to approximately 30 days and recent daily data to 720 days, so longer requests will only return the maximum allowed data records for the respective data type. In addition, finalized data are typically released mid-way through the subsequent year, so there is typically a gap of 6 to 18 months at the end of finalized data sets.

As with ClimDB/HydroDB data, users can create or modify the included metadata template to alias data columns, and any USGS-assigned Q/C flags will be converted to GCE toolbox flag arrays. Since USGS data are generally provided in English units, which are rarely acceptable for scientific work, automatic English-to-Metric unit conversions are also performed by the import filter. Both the table of unit conversions and English-Metric unit mappings can be customized by end-users using GUI dialogs, which are accessible from menus in the data set editor window. 

Transforming and Integrating Data

As already mentioned, several transformations are automatically applied each time data are mined from either ClimDB/HydroDB or USGS. Some additional transformations that can be applied interactively after data are imported into the GCE toolbox include:

  • Additional unit conversions and column aliasing
  • Deletion and re-ordering of columns
  • Creation of calculated data columns
  • Creation of additional date/time columns (e.g. YearDay from Date, Serial day from date components)
  • Data sub-setting or filtering based on values in any number of columns or based on mathematical expressions

Data from multiple stations can be integrated by performing pair-wise joins between data sets based on related columns (most commonly serial date or date component columns in this case), returning join columns and selected data columns from both sources in the final derived data set. In order to facilitate joining multiple data sets via a series of pair-wise joins (i.e. without first saving all data sets individually to disk), in-memory data sets can be copied to the results pane of the GCE Search Engine dialog (http://gce-lter.marsci.uga.edu/lter/research/tools/toolbox_search.htm). Using this dialog, two data sets can be selected for joining, and then the resulting derived data set will automatically be added to the list of available data sets, permitting additional join operations to be performed (in addition to other operations, such as plotting, mapping and viewing metadata contents).

A distinct benefit of using the GCE Data Toolbox for post-processing and synthesis is that all actions performed during data importing and all transformations performed by the user are automatically logged to the data set metadata, allowing the complete lineage of the data set to be viewed at any point in the process (in addition to general documentation and full attribute metadata). The user can also choose to export data in CSV or another delimited text format or copy columns to the base MATLAB workspace at any point to analyze the data using other tools or their own custom algorithms.

Application in Automated Workflows

The screen shots and descriptions in this article have emphasized interactive use of the GCE Toolbox based on GUI dialogs, which I believe would probably appeal to most potential users; however, all operations described are based on documented command-line functions that are well suited to use in automated workflow and serial batch processing scenarios. For example, data from any USGS station can be retrieved using a single call to the "fetch_usgs" function, which is described below (and displayed at the MATLAB command prompt by typing "help fetch_usgs"):

Fetch data from the USGS web site for the specified station and time period

syntax: [s,msg] = fetch_usgs(stationid,datatype,days,template,pn,fn)

inputs:
stationid = string listing USGS station to collect (e.g. '022035975')
datatype = type of data to return
'daily' = daily summary (default)
'realtime' = real-time data
'archive' = reviewed, archived data
days = number of days of data to request (default = 720 for 'daily', 31 for 'realtime', 3650 for 'archive')
template = metadata template to use (default = 'USGS_Generic')
pn = path to use for raw data tile (default = pwd)
fn = filename to use for raw data (tab-delimited text;
default ='usgs_[stationid]_[datatype]_yyyymmdd_hhmm.txt'
baseurl = base URL for the USGS water data web site (default = 'http://waterdata.usgs.gov/nwis/')
baseurl2 = fall-back base URL (default = 'http://nwis.waterdata.usgs.gov/usa/nwis/')

outputs:
s = GCE Data Structure containing the requested data set
msg = text of any error messages
...

Additional function calls can then be made to perform all the other transformations and integration steps described above. Command-line functions can also be combined in MATLAB scripts and other functions, and then executed on a timed basis (using MATLAB timer objects) to produce highly automated work flows that are triggered without user intervention. 

Reversing the Flow: Contributing Data to ClimDB/HydroDB

In addition to retrieving data from ClimDB/HydroDB, dialogs and functions are also provided for generating harvest files for uploading new or revised data to the database. The column aliasing described above is reversed to assign accepted ClimDB/HydroDB parameter names to columns (i.e. based on entries in a user-editable column mapping table). Units are also automatically converted to standard ClimDB/HydroDB equivalents, and high-frequency data are automatically re-sampled to daily values prior to export (i.e. based on statistical aggregation using date component columns, which are auto-generated from serial date columns when necessary). Using these tools, for example, an LTER Information Manager could retrieve data from any USGS station, or import a file downloaded from the NOAA National Climate Data Center web site, and generate a valid ClimDB/HydroDB harvest file with a few mouse clicks. 

Conclusion

I believe the tools described in this article, and in more detail on the GCE web site (http://gce-lter.marsci.uga.edu/lter/research/tools/data_toolbox.htm), provide a practical means for acquiring data from the ClimDB/HydroDB and USGS NWIS databases and integrating them in real time. Data from other sources, such as NOAA's NCDC web site and user-contributed data from delimited text or MATLAB files can also be imported for analysis and integration alongside long-term monitoring data from these databases.

Although these tools are freely available on the GCE web site and will run on Microsoft Windows, Linux/Unix and Mac OS/x platforms, it should be emphasized that they do require the commercial MATLAB software package in order to run, which may be a limiting factor for some potential users.

References

  1. Sheldon, W.M. 2005. GCE Data Search Engine: A Client-side Application for Metadata-based Data Discovery and Integration. DataBits: an electronic newsletter for Information Managers. LTER Network.
  2. Sheldon, W.M. 2002. GCE Data Toolbox for MatlabĀ® -- Platform-independent tools for metadata-driven semantic data processing and analysis. DataBits: an electronic newsletter for Information Managers, Fall 2002 issue. Long Term Ecological Research Network, Albuquerque, NM.
  3. Sheldon, W.M. 2001. A Standard for Creating Dynamic, Self-documenting Tabular Data Sets Using MatlabĀ®. DataBits: An electronic newsletter for Information Managers. Long Term Ecological Research Network, Albuquerque, NM.
  4. Henshaw, D., Sheldon, W.M. and Vanderbilt, K. 2003. Introducing the Climate and Hydrology Web Harvester System. LTER Network News, Vol. 16 No.2 Fall 2003.