Skip to Content

Developing a Drupal "website-IMS" for Luquillo LTER while learning Drupal

Printer-friendly versionPrinter-friendly version
Issue: 
Spring 2010

Eda Melendez-Colom (LUQ)

LUQ's Drupal New Website under construction (http://gorilla.ites.upr.edu/) is the answer that, for years, LUQ IM had searched for in order to have an interactive-database web site. What we were not aware at the beginning of this ordeal was that the same system that will serve as a website will also hold all LUQ metadata and data. As a content management system as well a "content management framework" (CMF) (http://drupal.org/getting-started/before/overview), Drupal has all the characteristics LUQ needs to finally develop its long-time anticipated and promised "information management common framework" (http://gorilla.ites.upr.edu/node/definition-common-management-framework ). In addition, the LNO staff has added some functionality to the Drupal system that will facilitate the entry of metadata into the system and the generation of EML packages from the system.

In other words, LUQ will have an integrated system that will serve as a website, a file system, a relational database management system (provided by MYSQL), a metadata and data depository, and EML package generator, where all the information is or has the potential to be interrelated. I like to call this system a "website-IMS", just to make it short.

The two major and innovative characteristics of this new system are that it serves both as a website and as an Information Management System (IMS) and that it has the potential to interconnect all its information. A graphical representation of the latter would include sets of keywords joining and interconnecting all the information in such a system. Such a diagram would clearly depict the central role that keywords (Taxonomies, as called in Drupal) have in this system. Taxonomies determine the website-IMS capability to ultimately connect all types of information (Content Types, as called in Drupal). Nevertheless, simplicity of the representation of these concepts would preclude the complexity of the implementation of such a system.

First, in Drupal all individual information becomes an entry in a database (a node, as called in Drupal). The soon-to-be old LUQ web site contains almost 4GB in data, metadata, photos, presentations and descriptive html files, setting aside its remote sensing documents. All this information is displayed statically by the current web server using HTML. Every single document of the old web site will become a node in Drupal; it might be story, a page, a blog entry, one of the custom LTER content types designed by Inigo San Gil and Marsh White at the LNO (a Data Set, Data File Structure, Research Site, Research Project, an Institution Links or Variable -EML Attributes-), a Biblio, or a person. This only gives us an idea of the complexity of migrating LUQ's website to Drupal.

Second, all nodes must be assigned a set of keywords. The complexity of this process goes beyond entering a set of values in a table. The following is a list of content types and their relation to the set of keywords or Taxonomy:

  • Data Set - Set assigned by the owner scientist; One data set will be related to one or many Data Set keywords (Data Set keywords (Drupal Taxonomy)
  • Data File Structure - Indirect: receives keywords from the Data sets they are related to in content type "Data Set" (Data Set keywords (Drupal Taxonomy))
  • Variable - Each becomes a keyword itself that will be related to one or many Data Files [Variable's labels keywords (Drupal Content Type)]
  • Research Site - Each becomes a keyword itself that will be related to one or many Data Sets; Each Data Set will be related to one or more Research Site [Research Site Titles keywords (Drupal Content Type)]
  • Research Project - Each becomes a keyword itself that will be related to one or many Data Sets. A Data Set will only be related to one Research Project. [Research Project Titles keywords (Drupal Content Type)]
  • Publications - Extracted from the set of keywords in the publication; Old publications (1980' and some of 1990's) did not have keyword assignment (Publication keywords)
  • People - Extracted from all information entered in the LNO's personnel database profile for each person (Several sets of keyword types: habitat, organism, etc.)

The assignment and implementation in Drupal for the Data sets' keywords was completed during my 3-week "micro-sabbatical" in the LNO early this year. The process started around 12 months before that when the sets of keywords for each of the almost 150 LUQ data sets were extracted from the old-format metadata forms. The last part of this process represents a teamed, structured coordination effort that lasted almost 9 months.

The following are the steps taken and the iterations of each step to complete this process.

  • Extract a set of keywords assigned in metadata by scientists - every time list was edited or created
  • Keep a list of the keywords assigned to data sets identifiers - every time keywords were edited or corrected this had to be done again.
  • Create a relational database table of the keywords to do QC on the list
    • Eliminate typos that give two or more instances of the same keyword, including undesired spaces and capitalization - preformed three times or more
    • Decide which version of keyword to use when grammatically different but representing the same concept (eg., rainfall and precipitation)
  • Build a hierarchy to the set such that the main list can be narrowed to a maximum of ten terms, but still related to their children keyword and to the children of their children, etc. - performed by LUQ Principal Investigator three times
  • Revise existing taxonomy adding, collapsing or deleting terms - three times by one of the LUQ IMC members; once by other 3 LUQ scientists
  • Export the hierarchy of keywords into a specially structured Excel spreadsheet that can be imported into Drupal - as many times as there is a new version of the Taxonomy (done when the updates included addition of terms only; otherwise corrections were made in Drupal)
  • Import taxonomy into Drupal (after installing Drupal module allowing this)

The keyword assignment for Data File Structures is achieved by including the related data files in the Data Sets content type. This is done by adding the Data File Structure as a Field in this node.

In regard to Research Sites, Research Projects, or Variables the effort is trivial since they actually represent a keyword to the data set they are assigned to.

The real challenge is presented by the set of variable labels. There are many examples where the same variable has different labels. For example the variable "year" can have be labeled as "YEAR", "year of measurement" and many others. One format should be selected and all data files with the same column or variables should be edited to show the same label as the one selected. This Drupal-based system makes it easier to create the views to spot variable redundancy and merging.

As for the sets of keywords related to the publications, ways should be developed to assign keywords to old references lacking them. The same thing happens with the People's keyword with some people with no profile in the LNO personnel database.

Next Steps

There are 3 sets of keywords (Data Sets, People, and publications) which need to be synchronized. The Data Set Drupal taxonomy will be the model for the other two sets. We expect all to have some keywords in common and a mutually exclusive subset with the other two.

The Research Projects and Research sites are being developed in such a way that all nodes for those content types are standardized. Once entered into the system, they are defined as fields in the Data Set content type. This type of information will be standard to all data sets.

In the Data Sets' nodes, the Research Projects are configured as what is known in Drupal as a "Node Referrer", simply meaning that as soon as the corresponding Research Project is entered in the system listing a specific data set as one of its "Related data sets", the fields selected of the Research Projects will show automatically in the "View" of the corresponding Data Set node. Node Referrer is a mechanism that can be used to implement many-to-many relationships in Drupal.

The Research site is configured as a "Node Reference" field in the Data sets nodes. This means that the selection of the sites related to that data set is made within the creation of the data set and the default View of the data set will display it when saved. Each data set is comprised of data gathered at one or more sites, thus, a data set is related to many research sites or locations (a one-to-many relation).

Standardizing the variables (units, attributes, names, date formats) will be a detailed manual process that will require the collaboration of PIs and information managers, that will serve as keywords to the Data Files Structures. If this standardization is not performed, then the functionality of displaying related data files when searching for specific files or data sets will not be as effective. Redundancy in the use of variables at the site level hinders effective integration across data sets and across sites.

All these steps are related to the synchronization and standardization of keywords and Taxonomy within the system only. There are other levels of synchronization that could and should be done in order to foster integration of information with other sources. For instance, LUQ scientists are developing similar Taxonomies in other non-LTER Projects that have many scientific themes and data in common with the LUQ LTER data. Having a common taxonomy not only will make integration and comparison of data easier but will eventually simplify the job for the LUQ scientist community in the generation of documentation, and other data-related documents. The quality of this functionality will further benefit if, in addition, we synchronize our set of keywords with the Keyword Vocabulary developed by our LTER Network of IMs.

Furthermore, the LTER Network has developed a unit dictionary that we may want to incorporate into the system to streamline the process of documentation and prepare the Luquillo data for future integration using PASTA-driven mechanisms.

Closing Remarks

The complexity of the migration of the LUQ web site and IMS into Drupal is due more to the complexity of our LTER system than to Drupal itself. The complexity really lies in the sum of all the standards, best practices and guidelines that may well be a reflection of the complexity of the science we are trying to document. At this moment, I do not know of a better system to host the kind of system LUQ needs to complete its common information management framework. After all, Drupal is not only a content management system but a content framework as well.

References:

1. The Drupal overview - http://drupal.org/getting-started/before/overview.

2. Definition of an "information management common framework" - http://gorilla.ites.upr.edu/node/definition-common-management-framework

AttachmentSize
ForSpring2010-LUQNewDrupalSite.pdf426.9 KB