A New Way to Use PASTA for Synthesis: Results from the Second VEG-DB Workshop
Emery Boose (HFR), Fox Peterson (AND), Suzanne Remillard (AND), Don Henshaw (AND), Mark Harmon (AND)
At the recent VEG-DB workshop at the Sevilleta National Wildlife Refuge, April 30 to May 2, 2013, we addressed the question of how the new PASTA infrastructure might be utilized to support synthetic studies of long-term LTER vegetation data. These data have the potential to shed light on some interesting scientific questions, including: How does NPP depend on water potential? What is the role of mortality in NPP? How is productivity related to disturbance? Are scaling relationships between biomass and density constant across different taxa? The task of assembling and synthesizing LTER vegetation data from individual websites in order to answer questions like these is formidable and beyond the reach of most individuals or working groups. But a synthesis engine (VEG-E) that builds on the site and network investment in PASTA might bring the analysis of such questions within reach (see Figure).
In the workshop we identified two technical challenges to using PASTA for this purpose. The first is the heterogeneity of site data, which more often than not vary in critical details (structure, variable names, units) between sites and even within a site. This is especially true for biological data. For example, at SBC, growth in giant kelp (Macrocystis) is measured in g/m2-day and biomass is calculated as a function of height and frond number. At PIE, growth of marsh grass (Spartina) is measured by the change in blade number across seasons and biomass is calculated as a function of percent cover and blade number. And at AND, the biomass of Douglas Fir (Pseudotsuga) and other tree species is calculated as a function of DBH using various allometric equations, while ANPP is calculated as the change in biomass plus mortality.
Our solution to this problem requires participating sites to prepare their data in one of several prescribed formats, depending on the biome and the level of measurement (e.g. individual plant, plot, species), and to include critical derived variables such as biomass. This represents extra work for the site. However it greatly simplifies downstream processing, leading to a simpler and faster design for VEG-E.
The second challenge has to do with PASTA itself. As PASTA becomes fully populated it will contain a wealth of long-term vegetation data. However identifying the best datasets for synthesis will be non-trivial. For example, some sites (lumpers) may submit new data and corrections as updates of the same data package; while other sites (splitters) may submit new data as different data packages. Sites may have multiple vegetation studies and may have preferences about which data packages to contribute to VEG-E. Sites may also change their minds over time.
Our solution to this problem requires participating sites to post a harvest list of the datasets in PASTA that they would like to contribute to VEG-E. This list would be harvested at regular intervals (perhaps monthly) and used by VEG-E as a guide for which datasets to retrieve from PASTA. The list would be updated by the site whenever a new data package for VEG-E is submitted to PASTA or the site decides to change which data packages to submit to VEG-E.
The design of VEG-E itself was not considered in detail at the workshop. However it might be fairly simple and might consist (for example) of a backend relational database and frontend user interface. Though the number of individual records will be large, the data are well defined, the number of tables is limited, and many of the desired operations (e.g. aggregation and subsetting) are performed quite efficiently by database software. Wherever possible we would propose to use tools developed by others; e.g. for taxonomic reconciliation or graphing. In addition to a harvest process for retrieving site harvest lists and PASTA data packages, VEG-E would also include an archiving process for submitting snapshots of itself back to PASTA, perhaps on an annual basis, as a long-term record of VEG-E activities. Over time VEG-E could be extended to retrieve related data from other sources (e.g. climate data from ClimDB or plot information from SiteDB) using web services.
The VEG-E interface would include built-in tools for aggregating, subsetting, downloading, and graphing data and for generating simple statistical measures. Individuals could design their own workflows to download and analyze data from VEG-E, and users who prefer to do their own analyses could download the entire contents of VEG-E. By providing one-stop shopping for long-term LTER vegetation data, VEG-E would enable researchers to see at a glance what others are doing across the network, provide impetus for participating sites to keep their vegetation data complete and up-to-date, and provide a platform for synthesis of LTER vegetation data for the entire scientific community.
Though the focus of our workshop was on vegetation data, we believe the VEG-E model is inherently generic and could be adapted easily for other areas of interest, including climate, hydrology, stream chemistry, and soils. It might even be possible to design a single application engine that would serve multiple disciplines, utilizing PASTA and some additional data preparation at the site level.
Next steps for the project include working out details of the VEG-E design, specifying the structure (variables, variable names, and units) for site data for different biomes and different levels of measurement (e.g. individual plants, plots, species), preparation and submission to PASTA of data from 8 or 9 sites who will serve as early adopters, and development of a working prototype. Not to mention looking for funding.