Data Package Inventory Tracking: Slicing and Dicing with SQL
M. Gastil-Buhl (MCR), Margaret O'Brien (SBC)
Like all LTER sites, MCR and SBC have a need for an inventory of datasets. We use inventories to:
- Build tables for annual reports
- Highlight time-series datasets that are ready to update
- Keep site leadership abreast of IM-team activities
- Plan activities for an assistant
This used to be a paper folder of Excel printouts, distilled from colorful grids maintained on a whiteboard with markers. To share status with our team we maintained HTML pages manually. That system worked all right for one person actively working on only a few datasets from a small inventory, but soon it became clear this really needed to go into a database. Coincidentally, at MCR we were loading metadata into Metabase at the same time we were preparing our inventory of datasets for PASTA, and both tasks involved keeping track of many aspects of each dataset. Patterns emerged which suggested a database design which could replace our manually-built web pages and report-tables with script-generated content. So in the summer of 2012 we loaded our content into tables and then gradually let a design evolve to meet real practical needs for both sites.
The database models from GCE LTER, which we loosely refer to as "Metabase" contain a place for just about any item of information an Information Management System could imagine. So we were surprised to discover one thing was missing: a system to track the maintenance status of all datasets in a site's inventory. We seized this rare opportunity to contribute something back to GCE, and added an additional schema within our Metabase to house the inventory tables and views. A few pre-defined queries made common summaries accessible over the web (Figure 1).
Figure 1. Web page query builder.
Now that we have database-driven inventories of our dataset packages' status, we find we are using it nearly every day. The process of designing and populating that database has brought awareness of the variety of data we handle, their specific needs for maintenance, and the details required to keep track of the data package life cycle (Figure 2). And for the first time, we have an efficient way to sort our datasets by all those categories created by the network, our site, or ourselves, such as "signature" and "core".
Figure 2. Package status cycle at SBC and MCR
While designing the package management database, we needed to accommodate both SBC's and MCR's management styles. We also had heard interest from some other sites in possibly using this database, so we knew we needed to keep the model general enough for wider adoption (Figure 3). We defined our stages in creating and maintaining datasets as a mutually exclusive exhaustive set, granular enough to be meaningful for different audiences and purposes. The data tables are not site-specific. The SQL views allow for some customization, and the web script which uses those views is further customized, e.g., by making hyperlinks specific to a site. The SQL views provide a crucial layer of abstraction between applications and the database tables.
Figure 3. Entity-Relationship Diagram of data package management database schema
We have already used the database-driven queries to generate inventories for our Information Management Plans, and we expect it to streamline production of tables for reports and proposals. In addition to replacing those manual tasks, we also offer views tailored for specific audiences where the user selects from a drop-down menu. For example, a site scientist or reviewer may be interested in knowing which datasets are "student thesis data". Our site leadership asks what datasets are "backlogged" or how many have been updated in the network catalog since a given date (Figures 4, 5).
Figure 4. Inventory query result showing datasets with status "backlog".
Figure 5. Inventory query result showing categories (only top few shown here).