Metabase Adoption by SBC and MCR
Margaret O'Brien (SBC), M. Gastil-Buhl (MCR)
Metabase is a comprehensive relational database designed in Microsoft SQL Server, which has been in production at GCE LTER for many years. It already has all the features needed for housing and exporting EML metadata, and work is ongoing to integrate EML with other tools at GCE, most notably the GCE Matlab Toolbox (Sheldon, et al. 2012). Two LTER sites co-located at UC Santa Barbara, Santa Barbara Coastal and Moorea Coral Reef (SBC and MCR, respectively) required a robust relational database to underlie many features of their information management systems (IMS), including web pages for people, research projects and dataset views, and chose Metabase for this (Gastil-Buhl, 2010). We began our adoption process by porting Metabase to the open source relational database management system PostgreSQL in 2011, and the first exports have been from the research project tables as lter-project XML (O'Brien, 2011). Our next milestone is export of EML datasets. This article provides an overview of our recent progress.
Each site has its own Metabase instance in PostgreSQL, and the GCE core model was installed into three PostgreSQL schemas: 'metabase2', 'research_project' and 'resources', which are analogous to two of the SQL Server databases in GCE Metabase (Metabase and Submissions, Sheldon, 2011). The views used by GCE were also ported, and serve as an abstraction layer between the model and export scripts. In many cases we will use Metabase views as designed. However, in some cases we will create additional views, especially when use of PostgreSQL XML type is indicated. We created a separate schema for these views, called 'trees'. The research_project schema was populated first, and we are now working mainly on datasets, in the metabase2 schema.
Datasets use many more tables than do research projects. Both MCR and SBC have rich, high quality EML metadata, and routinely use most dataset module elements. Our strategy was to use this already-structured metadata to populate Metabase. Across multiple static EML documents, content is not constrained, allowing some fields to be irregular. An opportunity to regularize this content arises during upload to Metabase. Within a relational database, constraint and regularization of content is built in, and in a normalized database, the parent tables must be populated first.
As of this writing, all dataset-level content for both sites is populated, and we are progressing with the entity-level metadata. A general pattern for table-population has emerged:
- Ascertain the population-order for a group of tables with a schema analyzer such as SchemaSpy.
- Using XSLT, extract the subset of EML content needed for the table(s) from the entire corpus of EML documents to flat text files.
- Upload content to a generic 'scratch' table and regularize content as appropriate.
- Add primary and foreign keys to the scratch table to ensure constraints are met.
- Use SQL to copy content from the scratch table to the respective metabase tables.
Our scripts for extracting database table content from EML are reusable to some extent. With continued use, we have been able to factor some components to make them more generic.
Collaborative Cross-site Design
We used existing GCE Metabase content as a guide, and as much as possible populated SBC/MCR tables in the same way. In some cases, however, the model did not quite fit our local concept of a dataset feature. This is understandable; Metabase has been in production since 2001 with multiple uses at GCE, and evolved to suit the needs of that site. We described each issue we encountered, and have categorized our approach to solving them:
- Solving the incompatibility is essential to continuing, and our solution to the issue IS NOT backward compatible: we wrote up a proposed change and contacted Metabase's author. These issues were most important to discuss with GCE before proceeding.
- Solving the incompatibility is essential to continuing, and our solution to the issue IS backward compatible: we proceeded with the best solution for SBC and MCR, wrote up our solution and planned to share with GCE at a future date. Possibly, our solution would be more broadly applicable, but we thought it more considerate and efficient to work out potential cross-site solutions in a more formal manner.
- Non-essential changes, whether backward-compatible or not: we made notes, planned to bring it up later, and moved on! These instances are just alternative ways of modeling the same information.
NSF leadership has encouraged us to leverage existing tools within the network, and the GCE suite of tools has a long track record of continual upgrades to meet increasing expectations, making it an ideal candidate. In our work, we have invested extra time to carefully design code enhancements that can be merged back into the shared model. In developing additional export scripts, a major focus is also on reusability by other sites. Additionally, we are taking advantage of the SQL/XML features which are now available in PostgreSQL 9, but which were not available in SQL Server in 2001 (when Metabase was originally written). GCE has expressed interest in porting these enhancements back into SQL Server, and we hope they meet the high standards exhibited by that site.
Gastil-Buhl, M., et al., 2010, Metadata database models and EML creation, poster at IMC annual meeting. http://im.lternet.edu/node/694
O'Brien, M. 2011. The Santa Barbara Coastal implementation of projectDB using Metabase. http://databits.lternet.edu/fall-2011/santa-barbara-coastal-sbc-lters-implementation-projectdb-using-metabase
Sheldon, W.2011, GCE Metabasew Database Schema, https://gce-lter.marsci.uga.edu/public/app/resources.asp?type=document&category=informatics&theme=Database%20schemas