Skip to Content

Drupal developments in the LTER Network

Printer-friendly versionPrinter-friendly version
Issue: 
Spring 2010

Corinna Gries (NTL), Inigo San Gil (NBII/LNO), Kristin Vanderbilt (SEV), and Hap Garrit (PIE)

Introduction

As we all know, maintaining a website that was created employing less than perfect design and coding principles is somewhat tedious and time consuming. General updates may have to be made to every page, adding a new section can be cumbersome, quickly adding a news item requires html knowledge, and rearranging things in response to new insights of how people would like to navigate is almost impossible. This is where content management systems (CMS) come in. With strict separation of content, organization, layout and design they overcome the above mentioned obstacles to a more dynamic and responsive approach of web site maintenance.

Most content management systems on the market do a good job in providing these aspects, but some are going further and really are hybrids between a content management system and a development framework, which is what is needed for the average LTER site web application. While most CMS can handle personnel profiles, publications, calendars, and images in addition to static content, the development framework aspect of the system enables handling of more specialized content types, like project descriptions and EML metadata. In the open source realm the most widely used CMS are Joomla, Wordpress, Plone and Drupal. These are all perfectly valid content management systems that offer similar functionality with Plone and Drupal providing the highest degree of flexibility. There are plenty of web accessible discussions about why some folks prefer one CMS over another. In these discussions, where you may draw a parallel to the Apple vs. Microsoft eternal and somewhat boring, but always passionate discussions, you may find arguments in favor and against a particular CMS. Some of these arguments seem to gain some community traction and are based on tangible evidence, but plenty of unfounded claims always prevent a conclusive analysis.

Several reasons have led to choosing Drupal over Plone: Plone's programming language is Python vs. PHP for Drupal and Plone requires a more sophisticated hosting environment than Drupal. Due to the flexibility they provide, the learning curves are high for both systems, but Plone is considered still more complicated in the initial site set up than Drupal while the content maintenance is more user friendly in Plone (Murrain et al. 2009). However, it is expected that Drupal 7 will provide a more intuitive user interface for the content administrator. Drupal's strongest points are the so called taxonomies or keywords. Every piece of information can be tagged with a keyword and displayed in groups based on these keywords. This allows for extreme flexibility in accessing information. A very simple example is used in the LTER IM website. Two different views of site based information are provided. One is by subject area (e.g. site bytes for 2008) and the other by site name (e.g. all site bytes and more for Andrews). Searches for information can also be made more successful. In another example from the LTER IM website, a person profile can be tagged with the same keywords that are used in the knowledgebase articles making people show up in search results as experts in a certain field.

The data model in Drupal

The most popular Drupal instances use MySQL to store all their content. At this point, it may be convenient to learn a bit of the Drupal community jargon. One of the Drupal claims to fame (and motto) is "Community Plumbing" (meaning community development and support, with well over half a million sites powered by Drupal). Although Drupal lingo is somewhat obscure to uninitiated we'll try to establish a map between known information management concepts and the Drupal language to understand better the documentation created by the Drupalistas.

The basic Drupal unit is the 'node' - a single record or an entry in a table of the database. A page, a story, an article, a blog post, a photo, etc. are all examples of nodes. Nodes are entries in the so called Content Types. Content Types are the Drupal categorizations of information. You can think of "content" as "information", and "types" as "tables" in the back end database. A better, more sensible, broader definition of a content type is a container for a specific category of information. Nodes stored in content types can be linked in one-to-many or many-to-many fashion. For example, a person may have authored several data sets, and a data set may have many contributors. A research site description, lat/longs, elevation and datum would constitute a 'node' of the content type 'research site'. Several research sites can then be linked to a project description or a data set.

This clearly is relational database design which can be directly inspected in the underlying MySQL database (RDBMS). And because it is a RDBMS the data can be manipulated outside of Drupal via third party administrative applications or custom code. However, access through third party applications is rarely used by the Drupal aficionado but is useful if you want to load information in bulk from a different database.

Another appealing Drupal aspect is that all you need to handle it is a web browser -- any web browser -- Safari, Firefox or Internet Explorer. Drupal functionality (like the other mentioned CMS) can be classified into a modest yet powerful core functionality and a vast extension of functionality, via "modules". The core functionality is preserved and updated by the original developer team (Dries Buytaert, 2010) and the extended functionality is provided by a large community of developers through a unified portal and conforming to a set of development guidelines, principles and open source philosophy. This dual character (core and extensions) resembles the Linux development model. All these custom extensions (modules and themes) are offered through the Drupal portal at http://drupal.org/project.

Currently, our LTER Drupal Group uses many extensions to the core and has developed custom content types for personnel, research site, research project, metadata for dataset with data table and attribute types. We also benefit from optimized modules that manage bibliographies, image galleries and videos developed by the Drupal community.

Using extended functionality it is possible to provide XML, (and with a bit of work, EML, FGDC and Darwin Core, see also developments by the U.S. Geoscience Information Network for ISO) as well as PDFs, Excel spreadsheets, Word documents, tabular views, charts and the like. Our group uses the powerful views API module extensively. The views module allows us to offer the content in many user friendly and intuitive layouts. The views module is nothing but a GUI into the creation of SQL queries, coupled with the actual final web layout. All views are of course managed by the same database, security, logs, user management, LDAP connectivity, SSL encryption and the list goes on.

In summary, the content types are specialized repositories of content. Creating a custom content type in Drupal is fairly simple and includes an associated, configurable input form to capture and edit the information. The simple web form used to create content types triggers a Drupal process to build the underlying database tables to hold the information. In other words we have developed a very simple EML, personnel, bibliography, media and project editor using the Drupal data model. The following figure shows a small subset of the metadata data model in Drupal. (Right-click and View-image to see full resolution.)


Figure 1. This is a simplified, metadata constrained, Drupal data model diagram.

Most of the core tables have been omitted, except for the table director "node", in the red area. The tables shown here are some of the tables related to the management of ecological metadata information. Five colors denote different categories of metadata -- Light green for the personnel directory, Pink for the basic metadata details, Teal for the information pertaining the structure of the tabular data and in yellow, the basic georeferences. Variable or attribute information is located on the orange area.

The main tables in the diagram above have a "content_type" prefix. Add to the prefix "data_set", "person", "research_site", "variable" or "data_file", and you have the main tables for the information categories, dataset, personnel, location, attribute or variable and file structure. The tables with prefix "content_field" could be of two types, but generally contain one-to-many indexes. One type contains placeholders with multiple values, such as "date values". Sometimes a dataset may have multiple dates associated with it, therefore the "content_field_beg_end_date" contains the multiple values, avoiding complete de-normalization. In this case, the key "delta" denotes a number of multiplicity in the multiple relation, that is if a dataset has three dates, the delta will have a value from 0 to 2. In the other case of "content_field" tables, we have the referential tables.

Experiences with migrating existing site information

NTL: We are still in the very early stages of migrating our website into Drupal. However, we already accomplished setting up two websites that are accessing the same underlying database, one for NTL LTER and one for the Microbial Observatory because the overlap in people, publications, projects, and datasets necessitates this. This approach allows us to add websites for the funding cycle of a project and then fold the information into the NTL website when funding ends. We received the content type definitions developed at the LTER Network Office (LNO) and were able to import them seamlessly into our application. Although we already have most of the information required to populate these content type, originally we used a slightly different data model. Queries are currently being developed to migrate the data into the new structure and the basic data model will have to be extended to accommodate everything we will need to replicate functionality from our original website.

LUQ: Luquillo is nearly complete in its migration process. Many new views are offered to the user. All the content is tied together minimizing the risk of running into a page without further suggestions or links. All the content is linked to the record or node level, and powered by a custom controlled vocabulary whose adequacy is being tested against the goal of unearthing related information (discovery functionality).

SEV: The new Drupal website will soon be unveiled. The old SEV website was implemented in PostNuke, a CMS that never gained much popularity. Because SEV content was already in the MySQL backend of PostNuke, Marshall White (LNO) was able to migrate much of it into the Drupal MySQL database with relative ease. The Drupal website also incorporates an EML editor, which thrills SEV IM Vanderbilt to pieces. Inigo San Gil (LNO) wrote a script that parsed all SEV EML files into five content types created to hold metadata:

  • Data Set - contains discovery metadata (through Level 3)
  • Data File Structure- details about the data containing entity (header lines, file type)
  • Variable - captures info about variables (EML attributes)
  • Research Site - captures info about plots or research sites
  • Project - Higher level project information, which can encompass several DataSets

All of the content from the EML will have to be reviewed to ensure the parser didn't miss anything and because EML often contains peculiar formatting as a function of tags. Because each of the content types is essentially a form, new metadata can be entered via the website and stored in MySQL. A style sheet will be created that will display the metadata as a text file or .pdf for user download.

PIE and ARC: The sites hosted at the Marine Biology Laboratory (MBL) are next to start the migration process. The IMC allocated some modest funds to train the information managers through the development of their new information management systems. This process is expected to start sometime during the summer of 2010. Other sites have shown interest in these developments, including CDR, BNZ, CWT and KNZ, and have not ruled out a possible adoption of this enabling data model. Generally, the process of adoption involves several steps:

  1. Identifying the types of information being managed by the site. Typical categories include projects, metadata, data sets, data (tabular data, shape files, etc), personnel profiles, publications, news, stories, field protocols, photos, videos, research locations
  2. Installing Drupal and several modules
  3. Importing the predefined content types, which will generate the data tables in the database and input forms
  4. Selecting a design theme
  5. Customizing everything and migrating information

Synergistic efforts

Two groups outside of LTER have adopted our content models, University of Michigan Field Station and the Wind Energy group at the Oak Ridge National Laboratory. And several others are interested in learning from our experience: The National Phenology Network, The US Virtual Herbarium, The National Biological Information Infrastructure, Spain SLTER, FinSLTER, Israel LTER, Taiwan LTER. Other groups active in eco-informatics already have embraced Drupal as their development framework and are publishing modules for others to use: Encyclopedia of Life modules for biodiversity information management, U.S. Geoscience Information Network tools for metadata management in ISO, and VitalSigns modules for citizen science applications. The Biodiversity Informatics Group at the MBL is creating the software behind the Encyclopedia of Life (EOL - http://eol.org) - a single portal providing information on all 1.9 million known species. The infrastructure is seamlessly aggregating data from thousands of sites into species pages in the Encyclopedia using novel informatics tools to capture, organize, and reshape knowledge about biodiversity. The group collaborates with data providers. The information is then indexed and recombined for expert and non-expert users alike using aggregation technology to bring together different data elements from remote sites. All of the code related to the EOL is open source and available via Google Code ( http://code.google.com/p/eol-website/ ) or GitHub ( http://github.com/eol/eol ). LifeDesks is an Encyclopedia of Life product also developed by the Biodiversity Informatics Group at MBL which uses Drupal as a platform (http://www.lifedesks.org/modules/).

Concerns and opportunities

Here is a list of common concerns heard. Some are Drupal-specific, and need to be carefully addressed. Other concerns, while valid, apply to any type of information management system. We address here most of the concerns expressed by the folks that discussed this Drupal based IMS with us, but note that the issues that apply to all systems. Some view the use of Drupal as "putting all your eggs in the same basket" or "locking ourselves into one system". This is a valid concern, however, it also is a major opportunity, easing collaboration among sites and providing the basics for web sites managing environmental information beyond the LTER network "out of the box". Although the system provides enormous functionality, the data are still in a database that can be accessed, managed and used by any other system. And this system provides tight integration giving a complete vision to all aspects of information management, from details to the large picture, but always connected and contextualized. Because Drupal is a 'framework' that defines certain aspects of program interfacing, all developments being done can be transferable. That is true not only for content types as described earlier, but also for modules containing programmed functionality and so called themes. Themes determine layout, design, i.e the look and feel of a site. Many Drupal themes from professional graphic designers are available for free download and can be modified to fit the particular needs or preferences. In addition to the plethora of modules available, custom functionality for LTER sites can be developed and widely used. For instance, at NTL we are intending to program a query application for data access based on the stored metadata (the same functionality that we currently have as a Java program). Any site that uses the same metadata content type (or an extension thereof) and stores its actual research data in the same MySQL database Drupal is using will be able to deploy that module.

Security

The issue of security plagues every computing application, even the most secure ones in our nation. Drupal does not seem to be more vulnerable than any other system. Zero-day threats related to PHP, Java, Apache or Javascript make any system, including Drupal, vulnerable. However, new releases in the core, and active policing of the extensions can prevent many of the malicious attacks, by following normal Drupal upkeep processes.

Some computer scientists think less of PHP. And they might be right. But PHP has come a long way and does well what it is intended for. Facebook, Flickr, Wikipedia are among some of the web powerhouses fueled by PHP. As for Drupal, add The Onion, MTV UK, the World Bank, the White House and the Economist to these sites. As mentioned above, developing a new content type basically is developing a data entry form in Drupal. And this process is fairly simple and fast. Therefore, at NTL we will be exploring this option for research data entry and hopefully reduce the number of spreadsheets that have to be uploaded to the database. Drupal supports this by providing sophisticated user access management, which allows us a fine-grained user management for different entry applications. Rapid evolution. As we write this, a team of Drupalistas is working on purging the last 100 critical bugs of the next big major release of Drupal, Drupal 7.

Some are concerned because it is not backward compatible. Drupal 7 comes with a number of usability issues addressed, and new features, including the integration of critical extensions into the Drupal core, such as CCK and 70 others. Our group relies heavily on the Custom Content Kit extension (CCK). While some applications will break, we will eventually embrace the evolution. Staying with older, deprecated systems eventually prevent innovation and foster security risks. Of course, this is not a Drupal-specific issue, but an IT issue. It just so happens that in our niche of specialty, the fundamental technologies advances condition severely any mid-term plan (3 years or more). Software projects that do not keep abreast of concurrent advances risk becoming irrelevant by the time the products are deployed.

Resources

AttachmentSize
drupalDataModeFlLite.png113.07 KB