Skip to Content

Mining Long-term Data from the Global Historical Climatology Network

Printer-friendly versionPrinter-friendly version
Issue: 
Fall 2011

Wade Sheldon (GCE)

Introduction

Long-term climate data are critically important for climate change research, but are also needed to parameterize ecological models and provide context for interpreting research study findings. Consequently, climate data are among the most frequently-requested data products from LTER sites. This fact was a prime motivating factor for development of the LTER ClimDB database from 1997 to 2002 (Henshaw et al., 2006). However, direct climate measurements made at the Georgia Coastal Ecosystems LTER site (GCE) are currently fairly limited, both geographically and temporally, because our monitoring program began in 2001. Therefore, in order to put results from GCE studies into broader historic and geographic context and to support LTER cross-site synthesis projects, we rely on climate data collected near the GCE domain from an array of long-term National Weather Service stations operated under the Cooperative Observer Program (NWS-COOP, http://www.nws.noaa.gov/om/coop/).  

Data from NWS-COOP stations are distributed through the NOAA National Climatic Data Center (NCDC, http://www.ncdc.noaa.gov/oa/ncdc.html), so we have periodically requested data from NCDC for these ancillary weather stations to supplement GCE data. This process was greatly simplified in 2009 when we developed fully automated NCDC data mining support for the GCE Data Toolbox for MATLAB software (Sheldon, 2002). Functions in this software package proxy all interaction with the NCDC web site to retrieve and parse daily summary data for any NWS COOP station, then execute a workflow to convert English units to metric, perform QA/QC checks, and apply both documentation and attribute metadata to produce a fully documented tabular data set ready for analysis or synthesis. Unfortunately, this entire process ground to a halt in April 2011 when NOAA announced that it was abandoning the traditional COOP/Daily data forms, meaning that daily summary data sets would not be available from the existing web application beyond December 2010. We clearly needed to find a new source for NWS-COOP data. 


Goodbye COOP/Daily, Hello GHCN-D


When the NCDC announced the termination of the COOP/Daily forms (memo), they stated that these data products were being replaced with Global Historical Climatology Network Daily (GHCN-D) data that will be freely available in the future through a revised Climate Data Online (CDO) system. After seven months of limbo, the new NCDC CDO system was officially unveiled in November 2011 and is now available for use (http://www.ncdc.noaa.gov/cdo-web/search). The new system is a major improvement, both in terms of appearance and usability, but the revised web interface relies on client-side Javascript interaction for selecting stations and parameters and data download links are only exposed in email responses to data requests. This new architecture therefore precludes agent-based data mining using MATLAB or other software incapable of intercepting email messages. So in terms of restoring access to NWS-COOP data for GCE this new system is two steps forward, one step back.

Rather than returning to interactive web-based data requests for all of our stations, I searched for another way to access GHCN-D data that is more amenable to data mining. I discovered that GHCN-D data files are also available from NCDC via anonymous FTP, albeit in formats only a 1960's Fortran programmer could love (ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/). A single file containing the full period of record for each GHCN station (including all NWS-COOP stations) is available in the "all" subdirectory of the FTP site, as a text file with a ".dly" extension. For example, our NWS station on Sapelo Island (COOP ID 097808) is available as USC00097808.dly. Therefore just knowing the COOP station ID is sufficient to form an FTP URL programatically that will retrieve all available data for a station.

GHCN-D Format - The Bad, the Really Bad, and the Ugly

NOAA has never been known for distributing easy-to-parse data files, but the GHCN-D files set a new standard for user-unfriendliness (fig. 1). Each data row is preceded by a 21-digit label containing the station id, year, month, and a 4-digit parameter code - all concatenated without spaces. Following this label are 31 repeating groups of 8-digit fields, each containing a 4-digit signed integer value (yes, integer) and 3 separate qualifier codes, one group for each potential day in a calendar month. Months with less than 31 days are padded with -9999 missing value codes and no qualifiers. In order to store any type of numeric value in a 4-digit integer field, many parameters are represented in unusual units of measurement (e.g. tenths of a degree celsius, hundredths of a millimeter). In all, 189 distinct parameters are possible after compound, coded parameter types are expanded (e.g. SN*# is minimum soil temperature, where * is a 1 digit ground cover code and # is a 1 digit depth code). Working with these files clearly requires custom programming, advanced data integration methodology, and a good sense of humor.

 


 GHCN-D Format

figure 1. Screen capture of a GHCN-D data file. A 21-digit label is followed by 31 8-digit fields, each containing an integer value and 3 qualifiers for the corresponding day of the month.



Developing a GHCN-D Parser

The first requirement for using this new data source was developing a MATLAB-based parser to support importing GHCN-D files into the GCE Data Toolbox. After trying several different approaches, the simplest strategy proved to be generating an intermediate text file that is refactored as a normalized data table. The normalized table contains separate columns for station, date (including day), parameter code, and 3 distinct flag columns (fig. 2). In addition to simplifying loading the file into MATLAB, this step provided an opportunity to filter out null values for invalid dates a priori (e.g. February 31) and add programmatic support for date range filtering to replace the lost functionality of the NCDC data request form (i.e. date range selector fields).

 


GHCN-D Normalized Table

figure 2. Screen capture of a normalized intermediate file generated by the MATLAB-based data parser. The header fields contain attribute metadata tokens that facilitate importing the file into the GCE Data Toolbox software, and tilde characters (~) are used to represent null strings to ensure efficient parsing.



After getting over the initial parsing hurdle, the next challenge was de-normalizing the derived table to generate a conventional ("wide") tabular data set and converting values into appropriate numeric formats and standard units. I began by building a table of all 189 potential parameters that could be present in a GHCN-D file, obtaining parameter codes, definitions and units from the "readme" file in the FTP directory. I then added columns for attribute metadata descriptors for the GCE Data Toolbox (data type, variable type, numeric type, and precision), along with original units, target units, and unit conversion multiplier (as appropriate) for each parameter. The last step was to import this table into the GCE Data Toolbox to serve as a reference data set for the import filter function.

To generate a conventional tabular data set, the intermediate data file is loaded and queried using the GCE Data Toolbox to generate a list of all unique parameter codes present. The import filter then serially subsets data records by parameter, looks up parameter characteristics from the reference data set, applies the attribute metadata descriptors (converting units as specified), and joins the data records together by date. The result is a derived tabular data set with columns for station id and date, and paired value and qualifier flag columns for each parameter in the original GHCN-D file (e.g. TMAX, Flag_TMAX, TMIN, Flag_TMIN, etc.). Values in these columns are converted to their appropriate data type (e.g. floating-point, exponential, integer) for the accompanying National Institute of Standards and Technology (NIST) SI units. To complete the GCE Data Toolbox data structure, the provider-defined qualifiers are converted to intrinsic QA/QC flags, meshing flags with any flags assigned by QA/QC rules defined in the metadata template and evaluted during processing.

This entire workflow is now encapsulated into a single import function in the GCE Data Toolbox software (imp_ncdc_ghcnd.m), allowing any GHCN-D file to be loaded and transformed in a single step. A second function (fetch_ncdc_ghcnd.m) was also developed to handle the FTP request and file download, then call the import function to parse the data. Together, these functions restore the capacity to mine data for any NCDC climate station over the Internet using the GCE Data Toolbox software (fig. 3). These new functions are available now on request and will be included in the next release of the GCE Data Toolbox software in December 2011 or January 2012. Copies of the parameter metadata table (in spreadsheet form) and examples of raw and normalized intermediate files are also available for those wishing to develop their own parsing solutions for this resource.

 


GCE Data Toolbox GHCN-D import dialog

 figure 3. Screen capture of the GCE Data Toolbox import form for retrieving GHCN-D data from the NOAA NCDC FTP site.


 

Conclusion

A wealth of environmental data are available on the Internet from U.S. and international monitoring programs. However, mining these data remains a significant challenge. Even when programs are written to automate this complex process, subtle changes to data formats, web interfaces, firewall rules and access protocols can render programs obsolete overnight. However, the efficiency gained by transitioning from human-mediated data mining to computer-mediated data mining can be tremendous, justifying this ongoing effort. LTER can also take a lesson from the new NCDC Climate Data Online web site: a slick web interface is a worth-while goal, but should not come at the expense of broad data access to the scientific community.


References

Henshaw, D.L., Sheldon, W.M., Remillard, S.M. and Kotwica, K. 2006. ClimDB/HydroDB: A web harvester and data warehouse approach to building a cross-site climate and hydrology database. Proceedings of the 7th International Conference on Hydroscience and Engineering (ICHE 2006). Michael Piasecki and College of Engineering, Drexel University, Philadelphia, USA. (http://hdl.handle.net/1860/1434)

Sheldon, W.M.  2002. GCE Data Toolbox for MATLAB. Georgia Coastal Ecosystems Long Term Ecological Research Program. (https://gce-svn.marsci.uga.edu/trac/GCE_Toolbox/).