Auditing LTER Data Access
- Mark Servilla and James Brunt (LNO)
Authors' note - The information in this article was first presented in May 2007 as a Request for Comments (RFC) sent to the LTER Information Managers. This article has incorporated some of the remarks provided back from the RFC, and the authors would like to thank those who provided comments.
The LTER Network has invested considerable time, effort, and funding into the collection of scientific data. Access and use of this data is formalized through the end user's acceptance of the LTER Network Data Access Policy, Data Access Requirements, and the General Data Use Agreement, which was approved by the LTER Network Coordinating Committee on 6 April 2005. Motivation behind these policies and agreements is driven by the need to document the flow of data from the LTER Network out to the community to validate broader impacts of the LTER program. As such, the LTER Network has adopted a "standard" for data access and use that now needs to be implemented into both local and Network-wide computing infrastructure. This standard, in simple-terms, requires that the end user registers basic identifying information, including name, affiliation, email address, and full contact information, into a registry of the LTER Network. Further, acknowledgment and acceptance of either the General Public Use Agreement or any Restricted Data Use Agreement applied to a data set, and a statement of the intended use of the LTER data, will be recorded prior to the release of any LTER data.
The following article briefly describes a proposed Network-level architecture that can be adopted by individual sites for conformance to the LTER Data Access Policy. This architecture requires minimal effort on behalf of the site, and can be implemented for all or just a subset of the site's data holdings. This architecture is only a "proof-of-concept" and should not be considered the final design.
The LTER Network makes data available through two primary venues:
- Each LTER site supports an independent website of their own management and provides access to static or dynamically streaming data through a URL
- Direct referencing of LTER data through network-based links (e.g., URL or database connection) that are described in an Ecological Metadata Language (EML) document hosted by one or more Metacat XML database management systems
In either case, access to LTER data is often just a "hyperlink" away. For site-based data access, registration processes and the information collected vary between LTER sites (if at all), and often require the end user to re-register when accessing new or different data (see the LTER Network Data Access Policy Revision: Report and Recommendations ). Notification of a data access event may or may not be furnished to the data owner/provider; and, the end user will likely never be notified of the original data owner/provider contact information for citation purposes. In the case of data access through links in an EML document, the end user is provided site data without any rigorous identification process. If the site performs local event logging when the EML data access takes place, the site is only capable of recording the network address of the computer being used by the end user and, again, may not provide event notifications to the data owner/provider. In some cases, the end user is redirected to a foreign web-page that must be navigated further to reach data. Such efforts to mitigate unrestricted data access often prove fatal to the operation of automated processes. Unfortunately, current site-based or EML/Metacat approaches do not fully meet the requirements of a network-wide LTER Data Access Policy.
Network-wide implementation of the LTER Data Access Policy demands three functional requirements:
End user registration for collecting nominal information for entry into a user registry, along with a statement of the intended use for data, and an acceptance acknowledgment of the General Public Use Agreement. End user registration is assumed to be a one-time step that occurs either at the user's convenience prior to any attempt to access data or by the system invoking the registration process when a non-registered user attempts to access data.
End user identification to verify user registration and policy acceptance for all data requests. End user credentials must be available to compare against information contained within the user registry. Once verified, a client-side token (e.g., cookie) may be used to automatically identify the end user for future data requests. Strong authentication, such as user verification through a third-party authority, is not assumed for compliance to the LTER Data Access Policy.
Data access event logging for reporting purposes. All data access events should be recorded in an audit log that includes the identification of the end user, identification of the data accessed, and a date/time stamp of the event. The system portal should provide an interface for the data owner/provider such that events can be queried, viewed, and reports be generated. In addition, the system should provide real-time or near real-time notifications to the data owner/provider at the time of a data access event. Similarly, the system should also provide pertinent contact information of the data owner/provider to the end user for compliance to the General Use Agreement when data is accessed.
The LTER NIS development team has identified a general model for a Network-wide LTER Data Access Policy implementation strategy called the Data Access Server (DAS). The DAS model proposes a centralized NIS service that would perform all necessary policy actions, including the pass-through of LTER site data, on behalf of the site. The passthrough process would rely on the replacement of the URL that references site data with a "proxy" URL that points instead to DAS hosted by the LNO. The purpose of the DAS is to validate the user credentials, thus confirming their compliance with LTER Data Access Policy, before allowing access to any site data. This approach requires the site to register their data URL with the DAS so that a one-to-one correspondence between the data URL and the proxy URL is declared within the server registry. The proxy URL is used in lieu of the actual data URL within any LTER metadata document (including EML) that is published for public viewing. When an end user wishes to download data by selecting the online distribution URL in the metadata document, they would be directed to the DAS first and have their credentials validated, before a data stream is returned on the site's behalf. If the end user has not registered at this point, they would be directed to the appropriate registration interface. If they have already registered and there exists a token (e.g., cookie) on their workstation, they would be provided the data without restriction. Otherwise, the end user would be directed to a log-in interface prior to receiving any data. Figure 1 presents an overview of the DAS model network-level architecture.
Figure 1. Conceptual view of the LTER Data Access Server network architecture.
Any download event invoked by the end user will be logged into an audit record for reporting purposes. At this point, the DAS would send an email notification to the end user with the data owner/provider's contact information, the General Use Agreement, and any special Restricted Data Use Agreement for the specific data set that was downloaded. In addition, the DAS will also send a notification to the data owner/provider of the data download event, along with the end user's contact information and the name of the downloaded data set. The DAS model assumes that the site's data URL provides direct access to the data stream through the HTTP protocol. Further restrictions on data access can be achieved if the site only allows a specified network address or address range to connect to their data source - in this case, the network address for the LNO DAS.
A user interface specifically for registering data URLs would facilitate URL mappings (i.e., from the data URL to the proxy URL). URL registry management tasks would include adding, deleting, or modifying such mappings. This approach is similar to those used by globally unique and persistent URLs, such as Life Science Identifiers (LSID) and Digital Object Identifiers (DOI), where URL resolution occurs through an independent service. In this case, however, the DAS would incorporate user identification and audit logging/notification, processes necessary for compliance to the LTER Data Access Policy. Provisions for managing query string parameters appended to the proxy URL would be necessary to support more dynamic data systems that provide content filtering. The URL registry can also be used report non-functioning data URLs by periodically testing the URL for connectivity.
Functionally, the DAS model would have to support a distributed architecture to ensure high-performance and fail-over. In addition, the DAS architecture would be expanded to deliver both character-based and binary data formats. Flexible MIME (Multipurpose Internet Mail Extensions) selection will allow the end user to select how data are to be displayed or opened in a specific application, such as Microsoft Excel.
Proof of Concept
The NIS development team has deployed a minimal proof-of-concept DAS that utilizes data made available for the EcoTrends Project prototype web application for demonstrating the use of a proxy URL in place of a data URL. In this case, all URLs displayed within the web application and in the EML metadata documents found within a test Metacat have been replaced with proxy URLs. The proxy URL (Figure 2) points to the DAS and contains a MD5 hash value created from the original data URL (Figure 3) that is used as an index to the site's data URL
Figure 2. Example proxy URL.
Figure 3. Original data URL.
If the end user is correctly identified through either an initial log-in (Figure 4) or by an existing browser cookie previously loaded by the DAS, the DAS opens a file stream from the site data URL and passes it on to the end user's web browser as if the original site data URL were accessed. New users may register through a typical "forms" web page (Figure 5) that saves their contact information and data use intent statement in a local database. A production system, however, would likely utilize a modified version of the current LDAP user registry. A simple list of data access events (Figure 6) is available through the DAS web site. More detailed user information (Figure 7) is obtained by selecting the user's name in the list. A notification process has not been implemented in this proof-of-concept.
Figure 4. DAS log-in page.
Figure 5. DAS registration form.
Figure 6. DAS audit list.
Figure 7. DAS user information.
The DAS model does not require sites to participate or change their current practice of providing direct access to their data. It is a model that may be utilized at the site's convenience, perhaps addressing sensitive or high-profile data first.
The DAS model is not tightly coupled to EML, the Metacat, or any other subsystem, and therefore, it can be used at both the site and Network-level. Figure 1 shows that the proxy URL can be used through links embedded in the EML metadata document residing in Metacat, as part of a data link reference in a separate web application, such as the EcoTrends Project, or simply as a data link provided in an email message.
Since the DAS would run as a centralized service (potentially distributed) at the LTER Network Office, tools and enhancements based on the DAS model would be available to all participating sites, including data access reports that can be perused directly by NSF officials. This can be an effective method for standing groups like the Information Manager Executive committee or the LTER Executive Board to analyze LTER data access through a single interface.
The DAS model fits nicely within the current LTER LDAP user registry used by the LTER Metacat for user identification. Other Metacat sites (and their users) would not have to conform to the LTER Data Access Policy, but their users would have to register with the DAS before being allowed access to LTER data.
The current DAS proof-of-concept relies on the use of HTTP cookies for identifying registered users. Cookies, when enabled by the web browser, are sent automatically with each client request to the web server (in this case, the DAS). Any other application (e.g., Kepler or MatLab) that could not send a cookie would automatically fail the user identification process and not be allowed access to LTER data. A more robust method for providing generic identification would have to be identified.
The DAS model requires sites to change their data access URLs within their EML documents and/or any data references that would be bound by the LTER Data Access Policy.
A new registration interface would be required to collect the necessary Data Access Policy information. This would require users to submit new information into the DAS, even those who are already registered in the LTER LDAP.
The DAS model is one method for sites to easily conform to the LTER Data Access Policy. A fully functioning DAS implementation is expected sometime during 2008. The LTER NIS Development Team welcomes all comments and suggestions for improving this model, and anticipates working closely with beta-sites to evaluate and test the DAS model.