Wade Sheldon (GCE), John Carpenter (GCE)
Two LTER workshops were convened in 2008-2009 to plan and develop ProjectDB, a cross-site research project description language and database. The first workshop brought together a diverse group of LTER Information Managers to define the scope and use cases for the database (Walsh and Downing, 2008; O'Brien, 2009). A second workshop was convened in April 2009, where a smaller group developed an XML schema for storing the targeted project information (lter-project-2.1.0, based on eml-project) and prototyped XQuery-based web services using eXist, a native XML database system (Gries et al., 2009; Sheldon, 2009). The ProjectDB effort was very effective and serves as a model for collaborative software design in LTER; however, design is not the end-point in the software development process. This article describes taking ProjectDB to the next level at the Georgia Coastal Ecosystems LTER site (GCE), by putting the schema and database into production and integrating it with the rest of our information system.
Planning for Implementation
The first decision we had to make was how the concept of a "research project" should be applied within our site organizational scheme. For example, should every funded (or published) activity of every project investigator (P.I.) and graduate student be considered a separate research project, or should we only document major focus areas of the larger GCE research program as projects? After discussing the possibilities with the GCE leadership, we settled on a simple approach: any research activity that is specifically named in an NSF proposal or supplement and that is addressing a stated research question from a GCE proposal will be considered a research project. We felt that tying projects to questions and elements of proposals would best serve our need to track progress on all funded activities, and would provide NSF and reviewers the clearest alignment between proposed activities and research products. This approach also mirrors how we have traditionally assembled annual NSF reports, thereby simplifying constructing project descriptions retrospectively (see below) and allowing us to use ProjectDB to organize report contents moving forward.
Next we had to decide what specific information to include in project descriptions. The ProjectDB schema, like its parent EML, is highly flexible and can store a wide variety of content at varying levels of detail (O’Brien et al., 2009). For our initial implementation we chose to focus on descriptive abstract text and links to GCE personnel, research questions, publications and datasets as the highest priority content. Overall geographic and temporal coverage and keywords were also considered critical for search purposes. We additionally wanted to include supporting imagery (e.g. graphs, maps) and links to information outside the GCE information system to allow investigators to showcase their study designs and results informally but effectively. Management information (e.g. project review and approvals), fine-grained geographic information and cross-links to related research projects were deemed less important for this initial implementation, though, and were deferred to a later release.
XML vs. Relational Content Models
Our next decision was how to implement the back-end content model for the database. The ProjectDB design focused solely on XML technology (XML schema, XQuery, XSLT, eXist), based on broad computer platform and language support and to leverage prior work on the EML schema and pre-built functionality in the eXist XML database (e.g. REST web services and document management). However, like many LTER sites we manage all GCE metadata and research information in a relational database management system (RDBMS). While there are distinct advantages to both document-centric (XML) and record-centric (RDBMS) approaches to content management, bridging between these two models can be complex and has ramifications for how dynamic web content is generated and searched. For example, XML documents are typically displayed and manipulated using XSLT (or XQuery), whereas server-side web applications or client-side AJAX applications are more often used to render and search relational database contents on the web.
After much consideration, we chose to extend our existing RDBMS (SQL Server) to manage research project description information. This allowed us to link projects to existing database content, avoid record duplication, take advantage of referential integrity, and simplify development of data entry forms (see below). Following the philosophy of the lter-project schema, we developed a generalized SQL model for storing an unlimited amount of descriptive text at different scopes, as well as unlimited reports and associated material. Controlled vocabularies for document scope and material type, as well as explicit “DisplayOrder” fields for text sections, allow content to be nested appropriately in the XML document and tagged using attributes to support differential HTML generation and styling using XSLT (e.g. image tags versus anchor tags). Junction tables are then used to define many-to-many relationships between research projects and GCE questions, personnel, data sets, publications, archived files, geographic locations, taxonomic database records and external web resources. The entity-relationship diagram of the database is shown in figure 1 and also available on the GCE web site (http://gce-lter.marsci.uga.edu/public/app/resource_details.asp?id=376).
(Click on image to enlarge)
Figure 1. Entity-relationship diagram of the research project schema in the GCE_Submissions database.
We chose to use XML technology, though, for developing the remainder of our system to leverage and contribute to the broader ProjectDB effort. To accomplish this SQL-XML hybridization, we developed database code and server-side ASP web applications to dynamically generate lter-project XML from SQL Server using the same approach and code library used to generate dataset EML. In essence, we designed SQL views to generate XML fragments dynamically for each major element of the document, employing conditional logic (or subqueries and unions) where necessary to support complex-type elements. These views are then queried in an ASP script, and the fragments are assembled in document order and streamed to the client to create an XML document. SQL functions are used in views to escape illegal characters that may appear in paragraph contents (e.g. <,>,&), and an IIS code-page directive is used to encode text as UTF-8 prior to streaming for XML parser compatibility. Many other approaches can be used to achieve similar functionality, including native XML support in SQL Server and other major RDBMS software, but this approach has proven effective and easy to manage through years of use and refinement at GCE.
As mentioned above, a significant advantage of storing content in a relational database is ease of developing form-based data entry interfaces. Although XML-based form generators are emerging (e.g. XForms), relational database tools have matured over decades and many options are available for rapid development of client-server and web form interfaces. We chose to use Microsoft Access to develop an initial set of client-server forms, in order to take advantage of its rich set of data-bound controls and excellent SQL Server support. Based on past experience, we anticipated that only GCE IM staff and project lead P.I.s would initially enter and update project descriptions, so the platform and software limitations imposed by Access were an acceptable trade-off for its rapid application development capabilities. In the future we plan to deploy web-based forms for end-user updates, as we do for bibliographic citations, document archiving and other user contributions on our web site. Screen shots of representative tabs on the Access-based forms are shown in figure 2.
(Click on images to enlarge)
Figure 2. Microsoft Access user interface forms used to enter and update information in the SQL Server database.
We initially populated the database using content from annual NSF progress reports for the first 9 years of the GCE project. Project titles, associated personnel, descriptive text and figures were extracted from Microsoft Word and Adobe Acrobat files and pasted into Access form fields to create project records. Time spans and geographic coverage were derived from IM knowledge, history of report entries (i.e. which years the text appeared), and proposal details. Input from lead project P.I.s and other investigators was used to associate projects with research questions, data sets and publications. The initial population of the database took about two weeks for 75 project descriptions. After filling in as much information as possible, a request was sent out to all GCE participants to review their project entries and submit edits and updates to the IM office. Although tedious, this exercise was successful and resulted in a database that was immediately useful and represented the complete history of GCE research.
Tying Everything Together
In order to support the XQuery-based web services and search form developed by the ProjectDB working group, we deployed an eXist database instance within our information system. (Note that the LTER Network Office does not currently operate a production version of eXist, so this was a necessary step). We hardened the database for production use following best practices described in the documentation (e.g. disabling public write on collections, restricting access to administrative interfaces and reverse-proxying /exist/rest and /exist/xmlrpc to Apache virtual directories). We then developed a Python script to retrieve ASP-generated XML documents and upload them to eXist using HTTP PUT in order to synchronize content between databases on a regular basis. For web display of project descriptions, we augmented the XSLT stylesheets developed for ProjectDB to import a second XSLT that creates the GCE web scaffolding and imports navigation menus via AJAX. In addition, a second "plain" stylesheet without web scaffolding and navigation was developed for printing research project descriptions or loading contents into word processing programs for content re-use in reports and proposals and to simplify end-user content revision.
The GCE implementation of the ProjectDB search form and display pages are online at http://gce-lter2.marsci.uga.edu/exist/rest/db/projects/util/xquery/getProjectsQueryForm.xql?xslUrl=http://gce-lter2.marsci.uga.edu/exist/rest/db/projects/util/xslt/gceQueryForm.xsl. In addition, conventional ASP web application pages are available for viewing a list of all ongoing and completed projects (http://gce-lter.marsci.uga.edu/public/research/projects.asp) and listing research projects addressing each GCE question (e.g. http://gce-lter.marsci.uga.edu/public/research/gce2_q1.asp). For the latter two examples, summary information is queried from the SQL database for display on these pages using ASP, but the ultimate document links return XML with an XSLT declaration to produce identical output to the XQuery results serialized by eXist.
We initially developed and deployed the GCE ProjectDB during preparations for our October 2009 site review, and it proved invaluable for that purpose. Coupled with new dynamic web pages describing GCE research questions (see http://gce-lter.marsci.uga.edu/public/research/research.htm), both GCE investigators and reviewers could easily determine how specific research elements were contributing to our overall mission. Dynamic hyperlinks between research projects and personnel pages, data set metadata, and other database content also provide more ways to navigate to related content on the GCE web site. As a cautionary note, though, this database provided reviewers with a birds-eye view of our productivity (or lack thereof) on specific project elements. However, this increased transparency and project traceability will be crucial to identify holes and weak areas that need attention as we prepare for our renewal proposal.
In the year since the review we have found ProjectDB to be a very effective tool in our information system. This database provides an essential link between research activities, people, and products just as envisioned during the LTER planning workshop. Although keeping this database current will require ongoing effort as new projects, data sets and publications are added, the ability to leverage this database to generate text for annual reports and for reviewing working group progress more than offsets this work. We have uploaded our site-specific modifications to XQueries and XSLT stylesheets to the ProjectDB development instance of eXist at LNO, and look forward to continued collaboration on developing this cross-site resource.
Gries, C., Bohm, S., O’Brien, M., Porter, J., Sheldon, W. and Walsh, J. 2009. Project Database for LTER Sites. LTER Information Management web site (http://intranet.lternet.edu/im/project/LTERProjectDatabase).
O’Brien, M., Bohm, S., Porter, J. Gries, C., Sheldon, W. and Walsh, J. 2009. The LTERProject XML Schema. LTER Information Management web site (http://intranet.lternet.edu/im/project/LTERProjectDatabase/documentation...).
O'Brien, M. 2009. Lessons learned from the projectDB workshops. LTER Databits – Information Management Newsletter of the Long Term Ecological Research Network, Spring 2009 (http://databits.lternet.edu/spring-2009/lessons-learned-projectdb-workshops).
Sheldon, W. 2009. Getting Started with eXist and XQuery. LTER Databits – Information Management Newsletter of the Long Term Ecological Research Network, Spring 2009 (http://databits.lternet.edu/spring-2009/getting-started-exist-and-xquery).
Walsh, J. and Downing, J. 2008. PROJECTDB – Planning and Development of a Collaborative Programming Effort. LTER Databits – Information Management Newsletter of the Long Term Ecological Research Network, Fall 2008 (http://databits.lternet.edu/fall-2008/projectdb-).