Considerations for making your geospatial data discoverable through the LTER metadata catalog.
Inigo San Gil (MCM, LNO)
Currently, some LTER geospatial data are not discoverable through the LTER Metadata Catalog. The native metadata formats that many in the GIS community use to document spatial data are different than the Ecological Metadata Language (EML) required by the LTER Metadata Catalog. In many cases, this native metadata format is the Esri version of the Federal Geographic Data Committee (FGDC) products. Because of this, users exploring the LTER metadata catalog may think that many sites do not work with GIS data. This is not an accurate representation of the spatial data resources available at LTER sites.
The following article highlights considerations and background information to help sites integrate their spatial data metadata into the existing LTER EML based metadata catalog. The article will also provide a brief history of the Esri to EML metadata crosswalk, background on Esri’s approach to metadata, details on how the crosswalk works (including customization required), and future direction given the changing metadata picture along with costs and benefits. It is our hope that some of the knowledge expressed here will aid with future developments of the transformation, and in particular the interoperability among the diverse information management platforms.
Several individuals have worked to develop a crosswalk between FGDC products and EML. The latest group has modified an existing program to take Esri formatted FGDC metadata and transforms it to create an EML document that is in compliance with the EML data structure, and meets LTER metadata best practices. The evolution of the crosswalk (also known as transformation) has been complicated by versioning changes within Esri products, with EML specifications, and with changes in the FGDC standards. Note that the transformation described in the article applies to converting existing metadata records that are stored within Esri ArcCatalog, metavist produced records, or similar tools geared towards the FGDC metadata specifications. All the existing metadata records need to be in XML formatted files. The process for creating documentation for spatial data is summarized in another DataBits article this issue: Preparing Spatial Data and Associated Metadata for the GeoNIS.
The transformation is implemented by mappings the different information placeholders (tags) that exist in the Esri specification and their corresponding tags within the EML specification. This implementation is stored in a XSLT, also knows as a stylesheet, which is a limited programming language that uses XML to create directives that map, move and manipulate content within an XML file.
Brief History and Status of the Esri to EML metadata crosswalk
The pre-2005 version of the transformation (ESRI2EML) stylesheet was developed at the Central Arizona (CAP) LTER site, and used by some other sites to prepare EML documents. However, it was designed for a particular LTER site, and sites had difficulty modifying the stylesheet to meet their needs. In 2005/06 an opportunity emerged to work on a suite of tools to make the FGDC based metadata more interoperable with the EML based catalog. The opportunity was framed by a cooperative agreement between LTER and now defunct USGS National Biological Information Infrastructure. Most of the goals of this cooperative agreement targeted leveraging efforts between the programs that would foster interoperability among the scientific communities. One such efforts focused on improving the XSLT based transformation of EML records into FGDC-compliant metadata. The possible mappings and correspondences between the XML-implementation of the FGDC metadata and the EML XML schema, version 2.0.1 were carefully studied. The details of the XML produced by LTER (San Gil et al, 2011) and FGDC records hosted at the NBII metadata clearinghouse were also studied. This in-depth knowledge of both specifications was used to create or enhance the reverse transformation, resulting in an enhanced Esri2EML stylesheet. The base mapping between formats was expanded to offer more flavors of the FGDC related products, which includes the Esri backend specification and the Biological Data Profile. The revamped version of this stylesheet was posted in the LTER Information Managers website project pages section. The resulting ESRI2EML products helped a handful of LTER sites to produce EML-based geospatial metadata.
Other evolutions affected the stability of the ESRI2EML crosswalk. Back in 2005, the Federal agencies continued pursuing a transition to the North American Profile (NAP) of the ISO backed standards, specifically; the USGS was adopting the ISO19115 and XML implementation ISO19139. At the time, Sharon Shin was coordinating the practical steps to finalize the UMLs to give final shape to the North American Profile, and there is no end in sight for the completed transition from FGDC to ISO. Considering the wealth of existing geospatial and other metadata records across the US Federal Agencies, nobody expected a smooth overnight transition. Why is this related to the ESRI2EML crosswalk? At this time Esri started integrating metadata workflows that were ISO compliant. Version 9.+ of ArcGIS had ISO compliant options, with FGDC standards being core to the metadata operations. Version 10.0 (released in 2010) of ArcGIS, the software package for Esri, saw significant changes to the core of their metadata management tools. The ISO standards are now at the core of their metadata management. These new changes rendered the previous ESRI2EML crosswalk and workflow partially obsolete.
The ESRI2EML work was placed in the back burner, with Theresa Valentine (AND), making a push to improve the crosswalk, fixing some bugs and padding gaps. LTER found a renewed interest in geospatial data from several fronts, including sociology, land use change, and projects such as Maps and Locals, which made an impact at the ASM2009. At the same, time, EML released a new version (EML 2.1) with no changes in the geospatial sections, but with new constraints that forced a small rewrite of that end of the correspondences. A working group of Information Managers gathered at the LNO in 2010 to revise the LTER EML best practices, and made great strides in providing guidance and recommendations on documenting spatial data. The metadata changes in ArcGIS 10.0 came as a surprise to many in the GIS community. The XML scheme, editing environment, and even metadata were all changed, and as noted above, was based in a large part on the ISO standard. This resulted in some critical momentum to improve the stylesheet, and to document procedures to help sites.
Some background on the Esri approach to metadata
ArcGIS, a product of Esri (Environmental Systems Research Institute, Inc), consists of a suite of tools for working with GIS content including desktop, server, and web based applications. Most LTER sites have access to this suite of tools through connections with universities that have higher education site licenses. ArcGIS also includes an integrated metadata management system accessed primarily through ArcCatalog (the data management component of ArcGIS).
The team examined XML metadata records that are stored in ArcCatalog, for a window into how Esri handles metadata. To summarize: Esri's approach to metadata in the pre-version 10 flavor, was like FGDC on steroids. Esri's XML tags contain the same general tags and structure of the FGDCs standard. However, Esri added a wealth of tags to accommodate metadata that was deemed important for proper data flow using their products and data structure. Esri needs to tag datasets with unique identifiers that enable proper manipulation in databases. Also, Esri added sets of tags that are critical to geospatial functionality, some of them may have been missed by the Content Standard for Digital Geospatial Metadata (CSDGM). The CSDGM is the actual name of the government sponsored metadata representation, commonly refer to as FGDC. The suite of profiles and extensions were the preferred implementation prior to the NAP of ISO19115.
At version 10.0 of Esri's products, the XML representation of the metadata has increased in volume, and the underlining structure changed. FGDC format was dropped in favor of a shortened Esri standard that was critical for implementing a new search function within the software, along with an expanded metadata editing system based on ISO standards. A patch was developed that allowed the importing of FGDC documents and conversion to the ArcGIS Metadata format, along with a stylesheet to export ArcGIS format to FGDC format.
In addition to all the FGDC tags and Esri's owns fields, there are ISO-like fields that appear in Esris XML files. Many times, the new ISO XML tags (or fields) duplicate the same targeted information placeholder that the FGDC side offers. For example, the information about the data "distributor", which is covered at length by the FGDC, is now duplicated in the Esri backend, with the ISO branch that stored "distributor" information. To illustrate this example, refer to Figure 1, where the "Distribution" Information related information groups (tags) are highlighted. Expanding the respective placeholders for the "distribution" placeholders show the parallelisms, creating redundancies in the text. Furthermore, there are many more tags that are duplicated by virtue of merging two synergistic XML specifications such as the FGDC's and ISO19139. Several are visible upon inspection of the figure and a sample, pseudo-XML.schema is available here.
Figure 1. A reduced version of the hierarchical representation of the Esri 10 schema and section from the Content Standard for Digital Geospatial Metadata Workbook
Figure 1 is a screenshot of the Esri ArcGIS Version 10.0 schema, as distilled from an Andrews Forest LTER XML metadata record instance. Esri does not have an XML schema for their metadata available for distribution and the sample was created by filling in a sample metadata document, using all the possible entries, and exporting the resulting XML file to standard XML tools. The diagram is a visual compliment to the high level description that follows. The XML root element is "metadata", which is also the root element of the FGDC schema. The top group of XML tags, surrounded by a green background, corresponds to official XML tags from the FGDC XML schema, while the small group of tags, with orange background, was added by Esri. These tags were present in versions prior to ArcGIS 10.0. The bottom group of XML tags, surrounded by aqua blue, is borrowed from the ISO19139 XML schema. A detailed file is available for download at: http://databits.lternet.edu/sites/databits.lternet.edu/files/esri10_xsd_RENAME_EXTENSION_TO_XSD.txt .
At first, Esri’s metadata strategy of merging two synergistic standards may seem like a dangerous proposition. Accommodating both standards in this fashion nicely reflects the transition from FGDC/CSDGM/BDP to ISO, but there are clear drawbacks. One such inconvenience is the redundancies entered in the information location. The Esri metadata team was consulted for their insights, without success. Our conclusions about their merging strategies are derived from our own analysis and the use of Esri'S tools. Esri is focusing their efforts at the application layer. When you use ArcCatalog, you may choose an ISO view (or skin) for manipulating metadata (default) or an FGDC skin. Both the ISO and FGDC skins have a similar look, but a different set of information is being gathered depending on the form used. The targets of these different forms may be the corresponding XML tags, either FGDC or ISO format. Likewise, you can export the records in both ISO compliant and FGDC compliant formats. In all, Esri assumes that no or few users will be handling the raw XML. Esri treats this XML as the vehicle to manipulate and transform some metadata in the backend, while at the same time; it complies with one of their most important clients, the US government
Details on the ESRI10toEML2.1.0 crosswalk creation process
The LTER community needs EML backed metadata to account for geospatial metadata through the metadata catalog, and many sites are using Esri GIS products to document their spatial data. The XML representations in ArcCatalog vary by use (ISO or FGDC), and may include hybridizations of both tags, purely FGDC, or ISO tags. Keep in mind that future versions of Esri may gradually deprecate the FGDC skin and replace it with the ISO, as the Federal agencies continue the slow transition to ISO-backed standards.
Given time and budgetary constraints, we set out to tackle improvements to the ESRI2EML crosswalk. We started with the aspects of the crosswalk that focused on those records manipulated through the FDGC-skin of Esri. These metadata records include all the legacy (pre-Esri 10) geospatial metadata documents, those documents produced with ArcCatalog-Esri 10 FGDC skin, and the documents that were in a FGDC format (non-Esri). We worked for about a week improving the crosswalk, including a one-day site visit where both invested the day exclusively in finding and correcting bugs and problems associated with the existing crosswalk. Better documentation was also programmed as part of the effort. Guiding resources are discussed in this Databits issue. We used both XMLSpy and oXygen XML editors to improve the standard, as well as ArcCatalog. For validation we used mainly the XMLSpy tools, but also the ecoinformatics parser that performs some extra validation checks. It is noteworthy to say that the previous version of this crosswalk mapped Esri 9 records to EML2.0.1, and in the newer release of EML (2.1), no empty XML tags are allowed. Since the Esri and FGDC tools are very lax, many records lack critical content. Because of the new EML constraints, the checks for content had to be tightened quite a bit.
The resulting stylesheet (ESRI10toEML2.1) was tested during the recent GEONIS working group held in Boulder. During one morning, participants volunteered to test the crosswalk on their site data. The exercise results were very interesting, as we came across uses that were unforeseen, and made the work challenging. A workflow was developed to integrate the new stylesheet with ArcCatalog, so you would have the option to export metadata as FGDC, ISO or EML. However, since ArcCatalog lacks an EML-skin linked to the backend XML product, and our crosswalk does not address the bulk of ISO tags/fields, the results were surprisingly disappointing. The largest problem was that some information entered into ArcCatalog was dropped during the export to FGDC step. The export to EML directly from ArcCatalog (without going to FGDC first) resulted in many un-mapped tags and needs significant work . Each potential process resulted in lost data that was needed within the final EML documentation. The best practices document (http://im.lternet.edu/project/GIS_document) describes the workflow, which has several steps and still may miss some critical metadata. The current best option is to export to FGDC format, and then transform to EML. The workflow could be simplified with a modified direct ArcCatalog to EML stylesheet.
Ideally, the stylesheet should be improved in a way that would consider all observed uses of Esri metadata. However, this in practice is not cost effective. Both Esri´s backend standards and EML are likely to change. The next version of ArcGIS (10.1) is scheduled for release in the next quarter. While significant changes to metadata aren’t expected, there are always the unexpected results of a version change. The main goal of the stylesheet is to aid with the conversion of Esri formatted metadata into EML, and the best practices document is intended to guide the user with the process. It is worthwhile to explore some options for the future.
- Provide a crosswalk from the ISO fields of Esri 10 to EML. The payoff is sizable. For one thing, it would avoid possible metadata leakage from Esri 10 encoded metadata. Many users may not really want to read any guidelines, and simply hit the "Export as EML", unbeknown of the perils of its limited use. Also, there is some chance that we would use the ISO2EML transform in other contexts.
- Perform more debugging iterations to improve the quality of the metadata products. No matter how much effort we put into the crosswalk, there is always one more bug, or one more improvement. A list documenting those issues would be good for those who want to keep improving the crosswalk. It would be beneficial to prioritize the fixes, as some may be the effect of local practices, and it’s important to keep the stylesheet as generic as possible so that many organizations could use it.
- Improve the documentation and guidelines. The crosswalk is as good as its documentation. ISO and Esri do not enforce metadata. Mandatory fields are suggested in the interface, but the editing tools do not prevent you from saving and closing records that do not comply with the mandatory rules. EML is stricter, and the user has the right to know the challenges.
- There are some LTER sites and other organizations that use non-Esri tools to develop metadata for their spatial data. It would be important to identify the users and their tools, and make sure that they can transform their metadata into EML.
San Gil, I; Vanderbilt, K. V. and Harrington, S. A.