Skip to Content

Fall 2007

Featured in this issue:

Dear Reader,

As the diverse nature of this issue illustrates, DataBits continues to be a platform for exchanging wide-ranging ideas and resources for strategically and practically improving the practice of scientific Information Management.

This issue features an article on career development in the field of Information Management by Karen Baker, a longtime LTER Information Manager who also conducts theoretical research in this field.

We also hear critical opinions on the pros and cons and progress of networking in LTER, and learn of the status of a new and welcomed network project.

The "Editorial" section has been renamed "Commentary" in order to capture the experience of community participants and to suggest that observations and opinions presented are open to dialogue.

We also added a new section - "Good Tools and Programs" - for sharing experiences with third party tools and utilities, which starts off with three very handy contributions.

Finally, this issue concludes with discussions and suggestions for further reading and a far from complete calendar promoting Information Management related meetings and conferences.

Featured Articles


Auditing LTER Data Access

- Mark Servilla and James Brunt (LNO)

Authors' note - The information in this article was first presented in May 2007 as a Request for Comments (RFC) sent to the LTER Information Managers. This article has incorporated some of the remarks provided back from the RFC, and the authors would like to thank those who provided comments.

Introduction

The LTER Network has invested considerable time, effort, and funding into the collection of scientific data. Access and use of this data is formalized through the end user's acceptance of the LTER Network Data Access Policy, Data Access Requirements, and the General Data Use Agreement, which was approved by the LTER Network Coordinating Committee on 6 April 2005. Motivation behind these policies and agreements is driven by the need to document the flow of data from the LTER Network out to the community to validate broader impacts of the LTER program. As such, the LTER Network has adopted a "standard" for data access and use that now needs to be implemented into both local and Network-wide computing infrastructure. This standard, in simple-terms, requires that the end user registers basic identifying information, including name, affiliation, email address, and full contact information, into a registry of the LTER Network. Further, acknowledgment and acceptance of either the General Public Use Agreement or any Restricted Data Use Agreement applied to a data set, and a statement of the intended use of the LTER data, will be recorded prior to the release of any LTER data.

The following article briefly describes a proposed Network-level architecture that can be adopted by individual sites for conformance to the LTER Data Access Policy. This architecture requires minimal effort on behalf of the site, and can be implemented for all or just a subset of the site's data holdings. This architecture is only a "proof-of-concept" and should not be considered the final design.
Background

The LTER Network makes data available through two primary venues:

  1. Each LTER site supports an independent website of their own management and provides access to static or dynamically streaming data through a URL
  2. Direct referencing of LTER data through network-based links (e.g., URL or database connection) that are described in an Ecological Metadata Language (EML) document hosted by one or more Metacat XML database management systems

In either case, access to LTER data is often just a "hyperlink" away. For site-based data access, registration processes and the information collected vary between LTER sites (if at all), and often require the end user to re-register when accessing new or different data (see the LTER Network Data Access Policy Revision: Report and Recommendations ). Notification of a data access event may or may not be furnished to the data owner/provider; and, the end user will likely never be notified of the original data owner/provider contact information for citation purposes. In the case of data access through links in an EML document, the end user is provided site data without any rigorous identification process. If the site performs local event logging when the EML data access takes place, the site is only capable of recording the network address of the computer being used by the end user and, again, may not provide event notifications to the data owner/provider. In some cases, the end user is redirected to a foreign web-page that must be navigated further to reach data. Such efforts to mitigate unrestricted data access often prove fatal to the operation of automated processes. Unfortunately, current site-based or EML/Metacat approaches do not fully meet the requirements of a network-wide LTER Data Access Policy.


Network-wide implementation of the LTER Data Access Policy demands three functional requirements:
End user registration for collecting nominal information for entry into a user registry, along with a statement of the intended use for data, and an acceptance acknowledgment of the General Public Use Agreement. End user registration is assumed to be a one-time step that occurs either at the user's convenience prior to any attempt to access data or by the system invoking the registration process when a non-registered user attempts to access data.

End user identification to verify user registration and policy acceptance for all data requests. End user credentials must be available to compare against information contained within the user registry. Once verified, a client-side token (e.g., cookie) may be used to automatically identify the end user for future data requests. Strong authentication, such as user verification through a third-party authority, is not assumed for compliance to the LTER Data Access Policy.

Data access event logging for reporting purposes. All data access events should be recorded in an audit log that includes the identification of the end user, identification of the data accessed, and a date/time stamp of the event. The system portal should provide an interface for the data owner/provider such that events can be queried, viewed, and reports be generated. In addition, the system should provide real-time or near real-time notifications to the data owner/provider at the time of a data access event. Similarly, the system should also provide pertinent contact information of the data owner/provider to the end user for compliance to the General Use Agreement when data is accessed.

Approach

The LTER NIS development team has identified a general model for a Network-wide LTER Data Access Policy implementation strategy called the Data Access Server (DAS). The DAS model proposes a centralized NIS service that would perform all necessary policy actions, including the pass-through of LTER site data, on behalf of the site. The passthrough process would rely on the replacement of the URL that references site data with a "proxy" URL that points instead to DAS hosted by the LNO. The purpose of the DAS is to validate the user credentials, thus confirming their compliance with LTER Data Access Policy, before allowing access to any site data. This approach requires the site to register their data URL with the DAS so that a one-to-one correspondence between the data URL and the proxy URL is declared within the server registry. The proxy URL is used in lieu of the actual data URL within any LTER metadata document (including EML) that is published for public viewing. When an end user wishes to download data by selecting the online distribution URL in the metadata document, they would be directed to the DAS first and have their credentials validated, before a data stream is returned on the site's behalf. If the end user has not registered at this point, they would be directed to the appropriate registration interface. If they have already registered and there exists a token (e.g., cookie) on their workstation, they would be provided the data without restriction. Otherwise, the end user would be directed to a log-in interface prior to receiving any data. Figure 1 presents an overview of the DAS model network-level architecture.

 Conceptual view of the LTER Data Access Server network architecture.

Figure 1. Conceptual view of the LTER Data Access Server network architecture.

Any download event invoked by the end user will be logged into an audit record for reporting purposes. At this point, the DAS would send an email notification to the end user with the data owner/provider's contact information, the General Use Agreement, and any special Restricted Data Use Agreement for the specific data set that was downloaded. In addition, the DAS will also send a notification to the data owner/provider of the data download event, along with the end user's contact information and the name of the downloaded data set. The DAS model assumes that the site's data URL provides direct access to the data stream through the HTTP protocol. Further restrictions on data access can be achieved if the site only allows a specified network address or address range to connect to their data source - in this case, the network address for the LNO DAS.

A user interface specifically for registering data URLs would facilitate URL mappings (i.e., from the data URL to the proxy URL). URL registry management tasks would include adding, deleting, or modifying such mappings. This approach is similar to those used by globally unique and persistent URLs, such as Life Science Identifiers (LSID) and Digital Object Identifiers (DOI), where URL resolution occurs through an independent service. In this case, however, the DAS would incorporate user identification and audit logging/notification, processes necessary for compliance to the LTER Data Access Policy. Provisions for managing query string parameters appended to the proxy URL would be necessary to support more dynamic data systems that provide content filtering. The URL registry can also be used report non-functioning data URLs by periodically testing the URL for connectivity.

Functionally, the DAS model would have to support a distributed architecture to ensure high-performance and fail-over. In addition, the DAS architecture would be expanded to deliver both character-based and binary data formats. Flexible MIME (Multipurpose Internet Mail Extensions) selection will allow the end user to select how data are to be displayed or opened in a specific application, such as Microsoft Excel.
Proof of Concept

The NIS development team has deployed a minimal proof-of-concept DAS that utilizes data made available for the EcoTrends Project prototype web application for demonstrating the use of a proxy URL in place of a data URL. In this case, all URLs displayed within the web application and in the EML metadata documents found within a test Metacat have been replaced with proxy URLs. The proxy URL (Figure 2) points to the DAS and contains a MD5 hash value created from the original data URL (Figure 3) that is used as an index to the site's data URL

Figure 2. Example proxy URL.

Figure 2. Example proxy URL.

Figure 3. Original data URL.

Figure 3. Original data URL.

If the end user is correctly identified through either an initial log-in (Figure 4) or by an existing browser cookie previously loaded by the DAS, the DAS opens a file stream from the site data URL and passes it on to the end user's web browser as if the original site data URL were accessed. New users may register through a typical "forms" web page (Figure 5) that saves their contact information and data use intent statement in a local database. A production system, however, would likely utilize a modified version of the current LDAP user registry. A simple list of data access events (Figure 6) is available through the DAS web site. More detailed user information (Figure 7) is obtained by selecting the user's name in the list. A notification process has not been implemented in this proof-of-concept.

Figure 4. DAS log-in page.

Figure 4. DAS log-in page.

Figure 5. DAS registration form.

Figure 5. DAS registration form.

Figure 6. DAS audit list.

Figure 6. DAS audit list.

Figure 7. DAS user information

Figure 7. DAS user information.

Advantages

The DAS model does not require sites to participate or change their current practice of providing direct access to their data. It is a model that may be utilized at the site's convenience, perhaps addressing sensitive or high-profile data first.

The DAS model is not tightly coupled to EML, the Metacat, or any other subsystem, and therefore, it can be used at both the site and Network-level. Figure 1 shows that the proxy URL can be used through links embedded in the EML metadata document residing in Metacat, as part of a data link reference in a separate web application, such as the EcoTrends Project, or simply as a data link provided in an email message.

Since the DAS would run as a centralized service (potentially distributed) at the LTER Network Office, tools and enhancements based on the DAS model would be available to all participating sites, including data access reports that can be perused directly by NSF officials. This can be an effective method for standing groups like the Information Manager Executive committee or the LTER Executive Board to analyze LTER data access through a single interface.
The DAS model fits nicely within the current LTER LDAP user registry used by the LTER Metacat for user identification. Other Metacat sites (and their users) would not have to conform to the LTER Data Access Policy, but their users would have to register with the DAS before being allowed access to LTER data.

Disadvantages


The current DAS proof-of-concept relies on the use of HTTP cookies for identifying registered users. Cookies, when enabled by the web browser, are sent automatically with each client request to the web server (in this case, the DAS). Any other application (e.g., Kepler or MatLab) that could not send a cookie would automatically fail the user identification process and not be allowed access to LTER data. A more robust method for providing generic identification would have to be identified.
The DAS model requires sites to change their data access URLs within their EML documents and/or any data references that would be bound by the LTER Data Access Policy.
A new registration interface would be required to collect the necessary Data Access Policy information. This would require users to submit new information into the DAS, even those who are already registered in the LTER LDAP.

Conclusion


The DAS model is one method for sites to easily conform to the LTER Data Access Policy. A fully functioning DAS implementation is expected sometime during 2008. The LTER NIS Development Team welcomes all comments and suggestions for improving this model, and anticipates working closely with beta-sites to evaluate and test the DAS model.

The Ecological Metadata Language Milestones, Community Work Force, and Change

- Inigo San Gil (LNO), Karen Baker (PAL and CCE)

We recently welcomed an important milestone in the Ecological Metadata Language (EML) saga. Today, all the LTER sites have submitted some metadata records in EML. These records are re-distributed through a number of server, central clearinghouses, and one-stop type of web shops. Over half of the LTER sites have finished standardizing all their metadata to a designated level. Finally, over half of the sites offer rich metadata content that describes in detail the structure and content of the data containers. For example, a rich description may specify the number of columns, column labels, units and codes used, protocols followed for a measurement and number of header lines in a data file. In summary, a rich metadata record may have content for all the critical components that enable machine parsers to read, interpret and manipulate the data effectively and appropriately.

EML Status Evolution

Figure 1. Some aspects of the changes on EML adoption over time (beginning of 2005 until the Summer of 2007).

Figure 1 shows the adoption of EML by the LTER sites over time. That is, in the vertical axis (Y-axis), you see the number of LTER sites adopting EML. In the horizontal axis we have the time span of the study, which is a little over two years and a half. The blue curve reflects the number of sites in possession of at least one EML document in the given time. The curve in green shows the number of sites at a given time that have EML documents harvested into the central Metacat servers. In red, we have the sites that have EML documents with rich content. As you know, there are 27 LTER sites including the LTER Network Office, therefore two of these three categories depicted in this graph reached the milestone in August 2007.

Do not be fooled, reaching a milestone does not mean that the metadata work is finished. But this is a start, a good one. This may be interpreted as an indicator of a community's understanding of the value of standards development for data sharing and integration as well as for exploring the processes of creating standards and facilitating articulation of critical data issues (Millerand and Bowker, forthcoming; Baker and Millerand, 2007). The first prototypes of network level data synthesis revealed gaps, inaccuracies and disparities in the content of some EML documents present in Metacat. You may wonder why, and the short answer is that most of the legacy metadata has not undergone a rigorous quality assurance and control (QA/QC) process. There are some noted exceptions. The LTER information managers (IMs) can champion the process of making sure that the EML content is accurate, best-practices compliant and when possible, adapted to a community standard for content, such as custom units, consensus variable (aka attribute) descriptions, controlled vocabularies and the like.

Let us elaborate on the need to enrich and control the EML content. EML documents are a critical supporting component for the work of data integration at the LTER sites as well as data synthesis at the sites and the Network. For instance, the Network is developing the Provenance Aware Synthesis Tracking Architecture (PASTA), a synthesis model sanctioned by the Network Information System Advisory Committee (NISAC). For EML to achieve synthesis type of functionality, it is not sufficient to have EML content to what is known as the EML "attribute level" or "level 5". Any EML document harvested by Metacat does pass a schema-compliance test. However, this test can only verify the existence of certain information placeholders, and in some instances, the test checks certain content formats (i.e; content must be a real number, or a specific date format, etc). Nonetheless, the test cannot verify the accuracy of the content declared therein. This is to date very much a manual and human process that can be quite intense. What we have before us is an intense time-consuming task that, generally speaking, involves multiple personnel at each LTER site. Working within the community standards framework and making use of the EML enactment phase as part of an iterative design process, information managers from four different LTER sites are designing augmentations and developing prototypes that will inform further development of EML and help open up the dialogue about quality control and assurance. These particular four IMs, Barrie Collins, Eda Melendez, Karen Baker and Sabine Grabner, have been an LTER EML community workforce, on the one hand eliciting data description categories and devising mechanisms to better represent the data, and on the other hand bridging or augmenting the EML standard. This work in essence bridges multiple spheres: the knowledge of the data taker, the data integration needs of the local user, and the data synthesis needs of the public user (Millerand and Baker, forthcoming). Indeed, novel quality assurance tools were presented at the 2007 IM meeting by a group from ILTER-Taiwan.

Barrie Collins from Coweeta opened a dialogue with all the Coweeta dataset owners to ensure that the quality of the metadata is up to a quality standard that ensures the data a life span much longer than of any of ours. Coweeta's researchers understand the importance of quality metadata and cooperate fully with Barrie to document up to the last detail of each data table associated with the given project or dataset. In this process, we discover that a minority of some of the associated data files are missing, perhaps forever, questioning the usefulness of the study itself. As Barrie points out, each Coweeta professional has a contract with the Coweeta LTER site by which the site provides the researcher means and tools to conduct the study. In return, each professional is committed to provide a fully documented dataset, including data files, methodology description, abstract, standardized geo-references, temporal scope, contact information and extensive description of the measurements made, codes and units used and the like.

Eda Melendez is also reviewing the project documentation for Luquillo's data sets. In this process, Eda looks for a one to one correspondence between the descriptions of the data file structures (file names, number of columns, headers, column descriptors, contents of the data files, missing values and the like). This labor intensive process finds scores of discrepancies that are resolved as the work progresses. Fixes are propagated from the web to a database and onto the corresponding EML documents.

As part of a long-term design effort synergizing with community development of standards, Karen Baker led a dictionary working group coordinating with a long-lived controlled vocabulary working group led by John Porter and with contemporary ontology approaches led by Peter McCartney and Deanna Pennington (Databits Baker, Pennington and Porter). The 'living dictionary' effort targets the usefulness of dictionaries as a mechanism for sites to contribute to site-network co-development. A unit registry prototype was designed, developed and prototyped by an LTER community working group over five months, culminating in a demonstration at the LTER Information Manager Meeting in 2005 (Databits Baker et al). The prototype provided an opportunity to focus three seperate critical issues:

  1. Establish the need for the development of a unit registry as central to data integration
  2. Explore mechanisms to create a dynamic dictionary that enables ongoing community contributions
  3. Understand the processes involved in creating site-network interfaces in order to create alternatives to traditional centralized models

For instance, an enactment model focuses on an interdependent design-development-deployment approach (Millerand and Baker, forthcoming). The unit registry effort has been re-energized as the Unit Task Force under site-network leadership of Tod Ackerman (SGS) and Inigo San Gil (LNO/NBII). In addition, the Dictionary Working Group is now turning to next steps, design and development of attribute and qualifier dictionaries.

Interestingly enough, the new LTER sites face this challenging quality process differently. Sabine Grabner, from Moorea Coral Reef and Karen Baker, Mason Kortz and James Conners, a team at the Palmer Station and the California Current Ecosystem sites, designed their data management systems with metadata driven integration and synthesis in mind. MCR, PAL, and CCE data management systems guide researchers through this in-depth data documentation system. It does not mean that these system are flawless; ultimately, it is up to the researcher to ensure that pertinent and valid information is entered.

In contrast to the new LTER site new setup and frame of mind, metadata driven data analysis systems were an afterthought at the inception of most of the LTER sites. At that time, storage ranked high among the concerns of the researcher. Most of the mature LTER sites have standardized metadata whose original content and quality did not have the role or weight of today's metadata. Some sites are facing now the quality check revision process, but many other sites will need to start such a process in order to successfully contribute both to network wide synthesis projects and to enabling data integration at the site. The process has many upsides: it will get people re-acquainted with important old datasets. The data will be likely sometimes be revived from being at the brink of its usable life. The IM will be empowered with new approaches to data handling. In enabling the site-network communications (Melendez REF from recent IMC), the network is empowered as an integrative node in a network of engaged sites so that the full potential of networks interplaying with agency partnerships can avoid dysfunction and break-down. Processes that support articulation and innovation at all locations along the path of data flow are key to developing new practices that can support cross-project, cross-site data analysis.

The LTER network office will work with the LTER Information Management Committee working groups to assist evaluating the quality of the EML harvested to date. Also, the data manager package, part of the new Metacat distribution will enable us to test the metadata as vehicle to automate the data ingestion of the associated datasets. Should the data integestion fail, we will be able to pinpoint specific deficiencies of the metadata in EML form. Finally, we will be able to place new evolution curves on the EML evolution graph, which will serve as guidance through the milestones on this road map to network synthesis.

Professional Learning Opportunities: Conferences, Meetings, and Mindsets

- Karen Baker (PAL and CCE)

The role of information management (IM) covers a broad range of elements including work with data and technology, classification and programming, semantics and standards, as well as design and articulation. This work is at times called data management, information management, informatics and/or 'the glue' that holds together an LTER site. As schools of informatics, information management, and information science begin to emerge at major universities, there is also a broad but uncoordinated range of activities available to inform a professional about the work of information management. Though database and system administration classes abound in the form of evening or extension classes, targeted technical classes do not address the multiple interdependent facets of IM work in a scientifically situated, scholarly, or timely manner.

The LTER Information Manager Committee (IMC) is recognized as valuable for fostering social and communicative ties that enhance exchange of conceptual and technical information. The annual meeting creates a highly effective informal learning opportunity. Indeed this year's meeting emphasized professional development by arranging four working groups, two poster & demonstration sessions, and a panel presentation. Rather than focusing solely on reports or services, these presented a mix of practical experience and theory as well as the technical. The IMC meeting is made valuable by our shared theme, an interest in the data ecology of a site biome. This provides a common frame for participant interests in terms of size and types of data as well as in terms of location in the dataset flow. Conferences are a complementary venue for information exchange and professional development. They nurture and spur the multiple mindsets - reflective, analytical, cross-project, collaborative, and action-oriented - required for IM work. Over the years, a great variety of conferences relevant to the work of IM have developed and continue to develop. To help sort through the abundance of over-lapping interests, three annual conferences that emphasize interdisciplinarity and work with digital artifacts are summarized below. The 3rd annual Digital Curation Conference (DCC) this year joins established conferences such as the 41st annual Hawaii International Conference for System Science (HICSS) and the more than 60-year-old American Society for Information Science and Technology (ASIST) annual meetings. These conferences are similar in that they do not focus on a particular discipline or technical application but on a broad theme such as work with data (DCC), information (ASIST), and systems (HICSS). Indeed, these meetings complement the IMC meeting by exposing us to new sizes and types of data as well as to activity at other locations in the dataset flow.

Conference forums provide unique opportunities for listening to panels, attending seminars and workshops, and for synthesizing as well as identifying and reflecting upon our contributions in the realm of informatics. ASIST, HICSS and DCC also all have calls for papers, posters, and panel presentations the year before the conference. In the interim, submissions are reviewed and published so that upon arrival at a conference the proceedings volume is available. Such conferences may help us keep pace with rapidly unfolding dimensions of digital data and of technology development. The types of expertise and inspiration required for technical and liaison work with data today may often be found at such conferences.

Conferences may help insure that LTER IMC members remain proactive, informed, and interested. Preparing a traditional paper (~10 pages) or a poster with an associated short paper (< 3 pages) is an important part of conceptual and synthetic work. As expressed earlier (Baker, Databits Fall 2006): "The LTER IMC has some flexibility - even somewhat of a mandate - to explore new approaches and types of venues for information exchange and professional growth." The IMC has explored and carried out a number of professional activities in past years: the Eco Informatics set of papers presented in 1996, the Data & IM in the Ecological Sciences Workshop in 1998 (DIMES; http://intranet.lternet.edu/archives/documents/reports/Data-and-informat...), and the set of twelve papers written for the Systemics, Cybernetics and Informatics (SCI) conference in 2002 (http://intranet.lternet.edu/committees/information_management/sci_2002). The IMC is well situated to consider joining and creating special interest groups (SIG) or to present a panel on Long Term Informatics. Conferences are a community resource that bring with them themed networking and information exchange as well as organizational infrastructure. Such meetings enable scientific scholarship as well as professional development and validation.

ASIS&T

NAME: American Society for Information Science and Technology
URL: http://www.asis.org/

The American Society for Information Science and Technology brings together those interested in a wide variety of information and technology topics. ASIST is a professional society that bridges the gap between diverse needs of researchers, developers and end users and that focuses on the challenge associated with emerging technologies and applications ranging across the fields of library and information science, communication, networking technologies, and computer science. ASIS&T holds an annual conference and produces an annual proceedings. A themed Bulletin is published bimonthly. There are a series of special interest groups (SIGs; http://www.asis.org/AboutASIS/asis-sigs.html) including ones on Scientific and Technical Information Systems (STI), Knowledge Management (KM), Classification Research (CR), Human-Computer Interaction (HCI), Social Informatics (SI), Information Architecture (IA), Digital Libraries (DL), and History and Foundations of Information Science (HFIS)

HICSS

NAME: Hawaii International Conference for System Sciences
URL: http://www.hicss.org

The Hawaii International Conference for System Sciences, the IEEE Computer Society and the ACM are among the sponsors of HICSS annual conference. Many conferences focus on a specific discipline or subject. Although specialization is important, HICSS has chosen to become one of the few general-purpose conferences addressing issues in the areas of computer science, computer engineering, and information systems. The fundamental purpose of this conference is to provide a forum for the exchange of ideas, research results, development activities, and applications. HICSS brings together highly qualified interdisciplinary professionals in an interactive environment. An annual proceedings is published.

DCC

NAME: Digital Curation Conference
URL: http://www.dcc.ac.uk/events/#conferences

This is a UK sponsored conference initiated in 2005 promoting the curation and preservation of information to a range of stakeholder communities. The focus is on community awareness regarding the breadth of current research activity and a shared commitment to collaborate on the development of tools, resources, and best practices within the UK and internationally. The DCC encourages collaboration and distributes information on related international curation and preservation initiatives, including the delivery of a series of curation and preservation seminars. The aim of the series is to disseminate specific research results and to demonstrate tools and resources developed by individual projects and initiatives.

Commentary


Kind Thoughts on Joint Projects of the LTER

- Barrie Collins (CWT)

Note: The thoughts and opinions expressed in the following article are offered as such and discussion, insights, and disagreement (especially disagreement) are welcomed. I offer these words as much as a philosophical program as anything. Errors and omissions are my responsibility only.

If we consider the Decade of Synthesis in the ongoing life of the LTER, I believe it would be safe to say that information managers in the Network have often been a strong representation of this ideal. Going back to the work of Michener and Boose and McCartney and Porter and Stafford, perhaps they wouldn't appreciate being termed as war horses, but when we look at the leadership of these people (and many others) there is a record of commitment to connecting research, data, and ultimately science. Today, many of the ideals and indeed products they championed serve as a foundation for all of us. One only has to look at one's email account for a monthly reminder as ClimbDB and HydroDB results make their way through. Certainly EML, with support from the likes of Brunt and Porter and San Gil at the LNO and many of the site information managers (to recognize the efforts of McCartney or Sheldon for example is not to forget each and every one of our peers, all of whom have made their presence felt, sometimes painfully, surely a self-indicting statement). The point is that information management at the LTER level has lived the words so eloquently written and often casually received. It is one thing to speak of sending a man to the moon; it is another thing entirely to do so.

As I sat and pondered through these past years, there was an appreciation of the passion and the effort of my peers. Tangibly, I came to have opinions about the criteria for cross-site collaboration. Yes, this is important, an easy and self-evident first step. But, what makes a good collaboration? What are the criteria? Importantly, what is the LNO's role? Have we met our potential? And what about commitment.

One only has to look at the workings of the European Union (or one's own university) to appreciate the dangers of bureaucracy. We are involved in a bureaucracy that we information managers not only are part of, but help create. To sit at the table to discuss ontology or EML or controlled vocabularies is an exercise in a polite cacophony that leads me to somber and morose thoughts. Sometimes our desire to be all inclusive leads, I respectfully submit, to not very much at all.

Thus, the two key qualities of a cross-site collaboration begins with the first quality, a definition: I would personally define cross-site collaboration as a project that has value for a number, potentially all, of the individual sites in the Network. Further, this value may include the ability to effectively collect similar data sets across the network and connect this data and potentially the science. Secondly, once there is an agreement for the need of a cross-site collaboration, there should be a central party that accepts ownership (this being where it seems the LNO would fit in some, but not all, cases). To his credit, Peter McCartney had a vision and stuck through it (commitment), and now we have EML, a product whose considerable positives (and negatives) act as an object lesson for us all. The point, really, is that a core group of adopters believed and supported and followed McCartney's lead, and this is something worth emulating and respecting.

I would also stress that it is not my belief that the LNO should be the lead in every situation. We all have our strengths and weaknesses and to these things we should play.

Coweeta LTER

Coweeta Lead PI Dr. Ted Gragson has demonstrated a history of support for cross-site collaboration. At Coweeta, we tend to identify a project of interest, agree on our responsibility, and then develop a product. There is an inherent danger that our delivered solution may not adequately address the needs of the cross-site team, but conversely, there is a product on the end that can be used as a basis for further discussion. We attempt to push the effort forward.

Example 1: AGTRANS - Agrarian Landscapes in Transition

http://sustainability.asu.edu/agtrans/
http://coweeta.ecology.uga.edu/agtrans/agtrans_intro.html

This interdisciplinary project traced the effects of the introduction, spread, and abandonment of agriculture at six U.S. long-term ecological research (LTER) sites, with cross comparisons in Mexico and France, using a variety of monitoring strategies, quantitative modeling, and comparative data. This project is a fine example of a loose collaboration. The participating organizations agreed on a path and a framework, and each individual site was responsible, for example, for creating land cover. Coweeta LTER expanded this project to include historic land cover based upon five year intervals (1986, 1991, 1996, 2001, 2006 is in process), giving us a twenty year record of land cover for the southern Appalachians.

Example 2: EcoTrends Socioeconomic Data Catalog

http://fire.lternet.edu/Trends/
http://coweeta.ecology.uga.edu/trends/catalog_trends_base2.php

Coweeta Lead PI Dr. Gragson again led a team at Coweeta LTER whose goals were very specific, to create a historic database of socioeconomic data across the LTER that would be accessible from a web-based client interface. Underlying this deliverable was an understanding of what we were to accomplish as part of a larger team, while allowing us the freedom to deliver on our platform. Clearly, the demographic solution we offer may not be the final form of this product, but our ability to work independently in the framework of the greater good of the LTER allowed us to deliver a product that, at least, serves as the basis for further development.

A Project: Internet Mapping Across the LTER

As I consider the (mis-named) concept of 'internet mapping', should each LTER have its own individual GIS map service? Should we expect users to re-learn the conventions of each and ever LTER site's GIS map service? I would answer no to both of these questions. For projects like an internet map service and others perhaps we might develop joint projects served and managed in one place, agreed in principle, with responsibility passed to one of our organizations Conclusion In the framework of this loosely written (hopefully conversational) piece, I've attempted to offer some thoughts about the way in which we go about solving large information management projects. I've provided a couple of examples specific to Coweeta LTER.

Ultimately, the point is this: As we consider large efforts, it seems that we need to be able to transfer ownership, responsibility, to a group (perhaps supported by an advisory committee of LTER managers) and let them develop a solution, guided by a framework that all can agree on (and I'm not suggesting this hasn't been done before, as there are examples before us now). Once developed, we have not only a platform for further refinement, but perhaps more importantly, we have a platform for discussion.

Finally, I have personally been blown away by the commitment I see in IM. I've watched you accomplish so much as I've plowed along at Coweeta. Sometimes I think the only thing standing in our way is when we allow our search for consensus to paralyze our actions.

Announcement:

Alternatively, this issue will be published with the Open Journal Systems software, developed by the Public Knowledge Project, released under GNU GPL. Since the current format of DataBits is showing its age (retro style rainbow banner, eek!) and there are more sophisticated ways than our copy-paste editing technique, we are test driving OJS to to see if

  1. Meets our needs
  2. The learning curve is acceptable considering our rotating editorship

Please evaluate the OJS version of DataBits for its look and feel. A likely next step for the Spring 2008 issue would be to make use of OJS's article submission feature and customize the settings.

DataBits continues as a semi-annual electronic publication of the Long Term Ecological Research Network. It is designed to provide a timely, online resource for research information managers and to incorporate rotating coeditorship.

Availability is through web browsing as well as hardcopy output. LTER mail list IMplus will receive DataBits publication notification. Others may subscribe by sending email to majordomo@lternet.edu with two lines "subscribe databits" and "end" as the message body. To communicate suggestions, articles, and/or interest in coediting, send email to databits-ed@lternet.edu.

----- Co-editors: Sabine Grabner (MCR), Wade Sheldon (GCE)

Good Tools And Programs


Adding Dynamic Web Site Content - the Easy Way

- John Porter - VCR LTER

The primary purpose of LTER web sites is a serious one: making information available to researchers and the general public in ways that they can find and use. However, there is still a place for visually attractive information sources that help keep a web site looking "new". Most sites want to avoid dynamic web content that requires high levels of day-to-day human interaction. Much to be preferred is content where the updates are automatic, or nearly so. Here are some suggestions for adding dynamic graphical content to web sites that require minimal input.

Add the Weather:

The simplest way to add dynamic graphical content is to add a "weather sticker" from Wunderground.com. "Weather Stickers" are added by including a snippet of HTML code that accesses a frequently updated image file from the wunderground.com web site. For example, adding the lines:

<a href="http://www.wunderground.com/US/VA/Oyster.html?bannertypeclick=miniWeather04">
<img src="http://banners.wunderground.com/weathersticker/miniWeather04/language/www/US/VA/
Oyster.gif" border=0
alt="Click for Oyster, Virginia Forecast" height=50 width=150></a> 

adds a small weather sticker that links back to the full forecast on the Weather Underground web site:

Wunderground Sticker

You can find the appropriate sticker for your site by pulling up the weather forecast for your site using the normal search interface (yes, they even have weather for Antarctica!), then going down to the "Free Weather Stickers® for Your Homepage!" box and clicking. It will then come up with a whole page of "sticker" options - large, small, dynamic or static. For example the page for our site (Oyster, VA) is: http://www.wunderground.com/geo/BannerPromo/US/VA/Oyster.html . Click on the sticker you want, copy the HTML code and paste it into the appropriate place on your site, and you're done!

Add the News:

An easy way to add text-oriented content with news provided from the LTER Network Office is to use one of the many RSS (Really Simple Syndication) feeds they provide. The page: http://www.lternet.edu/rss/ contains information on the many RSS feeds the LNO provides. Feeds can be automatically displayed on a web site using a variety of JavaScript (http://ezinearticles.com/?How-To-Display-RSS-Feeds-on-Your-Website&id=21585) and PHP (http://www.rss-specifications.com/display-rss.htm) tools.

Add some Data:

If you are already generating automatically updated graphs of data, a small script using the ImageMagick "convert" tool (http://www.imagemagick.org/) to resize and combine images can produce a dynamic GIF file that will continuously cycle through the data. For example, the UNIX shell script accesses images for the current date and year and combines them into "alldata.gif" :

MONTH=`date '+%m'`
YEAR=`date '+%Y'`
/uva/bin/convert -loop 50 -delay 600 -size 200x150 \
/home/jhp7e/metgraphs/HOGI.$YEAR.$MONTH.TEMP.gif \
/home/jhp7e/metgraphs/HOGI.$YEAR.$MONTH.WINDS.gif \
/home/jhp7e/metgraphs/OYSM.$YEAR.$MONTH.WINDS.gif \
alldata.gif

This script is then scheduled to run several times per day using the Unix CRON or Windows Scheduler. The web page simply includes a standard <img> tag aimed at the alldata.gif file.

If you have periodically harvested webcam images, the same "convert" command can be used to shrink and stack the webcam images into a single dynamic GIF image:

/uva/bin/convert -delay 600 -size 140x95 \
broadwater/cam1_latest.jpg \
cobbfalcon2/cam1_latest.jpg \
broadwater/East_small_latest.jpg \
machipongo/northcreek.jpg \
all.gif

You can view the resulting dynamic image at: http://www.vcrlter.virginia.edu/wwwcam/all.gif.

Web-Based Data Visualization With JPGraph

- Mason Kortz (PAL and CCE)

With the increasing size and complexity of datasets available on the web, in-place data visualization is becoming more important. While most web-based visualization tools lack the capacity for actual data analysis, they are still very useful for finding datasets of interest, previewing data, and performing quality control. One tool for web-based data visualization is JPGraph, a PHP plotting library available as both a free download and a paid, professional version. The professional version is licensed for commercial applications and includes more support, bar code functionality, and windrose and odometer graph types. This article discusses the benefits and drawbacks of the free version of JPGraph.

JPGraph is a library of PHP classes that can be used to create many types of graphs, including line, bar, scatter, and error plots. Versions are available for both PHP 4 and PHP 5. PHP must be compiled with GD support enabled to support JPGraph. Further, in order for JPGraph to access TrueType fonts, PHP must be compiled with TrueType font support. This fonts feature allows the use of superscripts, subscripts, and special characters such as symbols and Greek letters. Information on both these compilation options can be found in the PHP manual.

One of the advantages of JPGraph is that, being a PHP library, all execution is done on the server. Unlike Java, Javascript, or Flash visualization tools, the output of JPGraph is not dependent on the client's software or configuration. JPGraph's class structure is powerful and grants a great deal of control over the content and appearance of graphs. Callback functions can be applied to all axes, allowing user-defined manipulation of data points where the built-in functions are not sufficient. JPGraph allows multiple Y-axes with independent scales, and allows multiple plots to be overlaid on a single graph, allowing for graphs such as line plots with error bars or scatter plots in multiple colors. JPGraph can handle fairly large datasets; graphs with over 10,000 data points can be plotted in just a few seconds. Finally, and perhaps most importantly, JPGraph's class structure is very well documented and numerous examples are available both in the manual and on the website.

JPGraph does have certain limitations. Because the output of JPGraph is an image, it does not have the capability for interaction that Java- or Flash-based graphs may have. Graphs can be associated with image maps, but this becomes impractical with even moderately large datasets. Date and time axes are handled as UNIX timestamps, which means dates prior to January 1st, 1970 must be handled via a user-defined callback function in order to plot correctly. Similarly, inverting axes requires a callback function. While these functions are not difficult to define, they do add to the overhead of displaying graphs with many data points. There is no support for graphs with multiple X axes. In addition getting the proper positioning for labels, legends, and titles is often a matter of trial and error. In terms of dataset size, JPGraph has it's limits - graphs with over 100,000 data points can take over a minute and may cause the client browser to time out before completion.

Overall, JPGraph provides a free, straight-forward, and well-documented answer to online plotting. While it lacks the power of offline, commercial solutions, it is ideal for providing a quick visual window into a dataset without a great deal of setup overhead.

YUI: An Open-source JavaScript Library

- James Conners (CCE and PAL)

In developing web interfaces to data management systems, providing a rich suite of interactive functionalities while at the same time maintaining accessibility is often a major factor in design. Limiting applications to server-side processing can result in awkward, loading-time dependent interfaces that discourage use. Incorporating client-side DHTML through JavaScript, even minimally, can substantially enhance a web application's interface. Developers often hesitate, however, in doing so because of bad first case experiences with JavaScript and what seems to be such heterogeneous levels of support for the language across browsers. It has been noted, however, that in actuality "Most cross-browser incompatibilities are based on differences in the underlying Document Object Model (DOM) exposed by the browser, rather than on the language itself" (Powers). This means then that there are still differences that must be acknowledged and addressed. As a result, a number of open-source JavaScript libraries have been developed that provide a stable API-level environment for working with javascript. Well known examples include Mochikit, Dojo, Prototype, and Yahoo's YUI. The last option on this far from exhaustive list is the one that will be discussed here.

The YUI JavaScript library focuses on providing a core set of utilities and controls upon which many of the more advanced features are based. The core components focus on exactly the discouraging issue of cross-browser incompatibility. Even minimal interface enhancements, such as allowing form element addition or removal (perhaps a form for filling in metadata of 1:N cardinality), improve usability, but depending on the design implentation method can break due to differences in DOM support. Graded Browser Support is Yahoo's treatment of JavaScript development's resulting compatibility issues. At the highest level of description, the notion of graded browser support attempts to divide the vast number of browsers into three categories and then base support standards on these categories. The concept of Progressive Enhancement is a key component of graded browser support. Briefly, the idea focuses on providing access to core content while providing progressively more features to browsers that can support them. With this approach, the YUI library can be utilized for levels of interface enhancement from basic cross-browser safe DHTML to more advanced and visually appealing interactivity. Since the library is sorted into components, there is usually a very close correspondence between the amount of code you need to include and the page's scoped functionality.

With open-source code libraries there is no accepted community standard for support or for documentation. One of the most appealing features of using the YUI libraries is the level at which both of these elements are addressed. As far as support goes, there is a great advantage in the fact that the code is developed and maintained by a team of industry professionals. Regarding documentation, the site (http://developer.yahoo.com/yui/) includes both a searchable API and usage examples. I've found these examples can expedite implementation of the library's components.

There are quite a few JavaScript libraries available and it's not likely that only one of these will provide a complete suite for all DHTML development needs. Yahoo's YUI does very well, however, at providing a high level of quality across multiple facets including: features, conceptual outlining/framing, modularity, as well as documentation and support.

References:

Powers, Shelly. (2007). Learning JavaScript. Sepastopol, CA: O'Reilly Media, Inc. Yahoo Developer Network (http://developer.yahoo.com/yui/)

Good Reads


An ontology for describing and synthesizing ecological observation data

- Margaret O'Brien (SBC)

Joshua Madin, Shawn Bowers, Mark Schildhauer, Sergeui Krivov, Deanna Pennington, Ferdinando Villa, "An ontology for describing and synthesizing ecological observation data." Ecol. Informatics, 2007, doi:10.1016/j.ecoinf.2007.05.004

Considerable recent discussion has centered on synthetic projects which integrate data from small, focused studies into larger datasets to supply powerful analyses. However, the scope and design of available data and the variety of their descriptions make integration highly labor intensive. Ontologies represent a flexible and powerful mechanism to capture the structure, content, semantic subtleties, and relationships among data variables. By their incorporation into metadata descriptions (e.g., EML) ontologies create opportunities for eventual automated integration.

This paper presents the SEEK Extensible Observation Ontology (OBOE) which provides a framework for describing the semantics of generic scientific observations and measurements. OBOE's general approach allows extension with other domain-specific vocabularies, such as taxonomic references. It also enables the robust description of measurement units and can facilitate automatic conversions, such as micrograms nitrogen per liter to millimoles nitrogen per liter (i.e., a conversion requiring knowledge of an additional domain: the periodic table). OBOE's structure can also be used to suggest appropriate data aggregations and determine when a particular action is reasonable - an enormous assistance to synthetic data projects. The details of the OBOE ontology are illustrated by real-world examples, making the paper quite readable for those not in the informatics profession.

Note: This references an article in-press, and the complete bibliographic details are not yet available. The DOI is persistent, and the print version is expected later in 2007.

Managing Information, A Practical Guide

- Sabine Grabner (MCR)

Griffiths, D. M. 2006. Managing Information, A Practical Guide. Online: http://www.managing-information.org.uk/Managing information v5_1.pdf

"Managing Information" is a very easy and quick read practical guide on the topic. In our IM community we try to structure ecological data and related tasks on a daily basis. Getting caught in conceptual and technical details, we sometimes lose the big picture. This article contains important messages in the introductory summary that will guide you back on track to what our efforts are all about and what the context is. Keep reading if you are interested in the verbose explanations. Enjoy an absolutely non-technical read!

Place, Location, and Geographic Conventions

- Karen S. Baker (PAL & CCE) and Robert Thombley (CCE)

Chapman, A.D. and J. Wieczorek (eds). 2006. Guide to Best Practices for Georeferencing. Copenhagen: Global Biodiversity Information Facility. http://www.gbif.org/prog/digit/Georeferencing

When we are there, we have a sense of place. When sampling in the field, we observe and are part of the context within which measurements are recorded - although how to capture this experience in metadata is the subject of ongoing research. A georeferencing guide deals with one aspect of metadata - that of location as designated by geographic coordinates.

A ninety-page report on georeferencing best practices has been published by the Global Biodiversity Information Facility (GBIF) drawing from more than a half dozen other initiatives and illustrating that there is more to capturing location than a simple documentation of degrees and minutes for latitude and longitude. There's accuracy, reliability, and transparency in addition to place names, feature classes, and reference systems. A geographic dictionary, sometimes called a gazetteer, is an index of geographic names, but draws on all of these components to accurately portray a given location.

This particular guide contains useful context and geographic topic summaries as well as references to tools, online applications, and links to previous programs that focused on georeferencing.

Additionally, the guide is peppered with real world examples, with the latter half containing a recipe book of standard ways to georeference various locales and features.

Figuring on Insight through an Insightful Figure

- Lynn Yarmey, CCE/PAL

Baker, K.S. and Millerand, F. 2007. Scientific Information Infrastructure Design: Information
Environments and Knowledge Provinces. Proceedings of the American Society for Information
Science and Technology ASIST 2007. http://cce.lternet.edu/docs/bibliography/033ccelter.pdf

From 'Scientific Information Infrastructure Design: Interdependent Provinces and Knowledge
Environments', Figure 2 (pictured here) is so simple yet brings up so many questions, and
furthermore seems to hold potentially weighty implications for our work as Information
Managers. The paper itself offers condensed insight into different conceptual models of
information and data flow, but the figure encapsulates in particular a number of topics and
tensions

Complex-Simple

Figure 2 is 'a multi-dimensional perspective with distinct data handling provinces...[it]
provides a conceptual platform for pluralism.' The 'information infrastructure landscape' is
defined as a 2D field with axes of size and complexity with distinct areas of uniqueness.
Areas of porous and portentous overlap are made conspicuous for information management, data
management and cyberinfrastructure. This layout serves not only to visually define the
boundaries and ambiguities between these three fields but suggests, through the common
framework, how, where, and why they interrelate and separate.

In looking at the commonalities and differences displayed in the figure, critical questions
arise:

  • Why is information management limited to the 'small' end of the spectrum, ie. what about information management makes it inapplicable or inaccessible to large datasets?
  • Does the figure in fact refer to dataset size and complexity, or possibly physical ecosystem, digital network size/complexity or a combination?
  • IM and DM have been recognized on some level as roles; what new roles (if any) will come with cyberinfrastructure, and how do these roles fall within the LTER structure of networked sites?
  • Would adding a third dimension help, ie local v. general users, short-term product-based v. long-term curation timeframes?

This too brief paper touches only on disintermediation vs. intermediation issues but many
others come to mind for defining and developing such provinces. This figure encapsulates too
much, but in so doing prompts much needed further discussion.

Calendar


Calendar for Fall 2007

October 19-24, 2007 - American Society for Information Science and Technology, Milwaukee, WI (http://asis.org/Conferences/AM07/)

December 11-13, 2007
- The 3rd International Digital Curation Conference, Washington DC (http://www.dcc.ac.uk/events/#conferences)

January 7-10, 2008 - Hawaii International Conference On System Sciences, Waikoloa, Big Island, Hawaii (http://www.hicss.org/)

May 7-8, 2008 - LTER Science Council Meeting, Baltimore, MA

July 9-11, 2008 - Scientific and Statistical Database Management, Hong Kong, China (http://www.ssdbm.org/)

August 3-8, 2008 - ESA Annual Meeting, Milwaukee, WI (http://www.esa.org/milwaukee/)

September 29 - October 2, 2008 - LTER IM Committee Meeting, Albuquerque, NM (http://intranet.lternet.edu/im)