Skip to Content

Addressing Scaling Issues Associated with Data Access

Printer-friendly versionPrinter-friendly version
Issue: 
Fall 2010

James Conners (CCE, PAL)

Overview

One of the primary features of our local data system, DataZoo, is providing queryable access to data sets. During the last year, however, we began to face two main issues with some of our larger data sets. The data accessed from DataZoo are stored in a relational database. From a web form, queries can be built and submitted for any data set. At some point, our original implementation for these queries began to strain as the amount of data stored scaled upward. Because both the query against the data base and the retrieval of results were all being performed during a single request, some queries to our larger data sets were resulting in browser timeouts. In addition, in order to provide paging capabilities for previewing the data results the queries had to be buffered through application code to be able to get information like the number of records returned, causing queries to consume large portions of server memory and to affect performance. This article describes the approach we took for both addressing these immediate issues and planning for future system architecture development.

Design

The approach taken for the redesign of our data queries was to frame it as a local web service with two primary characteristics: A level of abstraction that generalized queries to data in different storage back- ends across our system and an asynchronous client-server communication protocol to eliminate web-server execution time limits. The idea was that this web service would provide a generic interface for querying and accessing tabularly represented data, utilizing a set of defined resources that characterized a suite of activities useful for asynchronous interaction throughout the data query workflow. Below is the description of that workflow, with each step also generally defining the set of resources making up the web service as a whole:

  1. Retrieve table and field information for selected data storage back-end
  2. Register query
  3. Intermittently check status of query, including data result information if available
  4. Access data result

Although steps 1-4 comprise the entire workflow as defined for client interaction, only steps 2-3 are coordinated within the server application through the use of a message table. The message table supports communication between the registration, query, status and access processes by providing state and summary information. Each process works relatively independently, retrieving decision logic from the message table and initiated by client requests, with the exception of the query, which is started by the registration process.

Use Case

The primary use case we used during development is described as follows. The client (a web browser, typically) retrieves table and field information from the info resources and uses it to build a form for a user to construct valid queries against a particular storage back-end. The query string is constructed from the form fields and sent to the registration resource, immediately receiving a status object with a unique identifier. The client then intermittently makes requests to the status resource using this ID for updates on the current state of processing, along with other details such as a descriptive message. If the query is successful, the next time the client requests the status of the query it will receive a message within an updated status along with additional information describing the data result, i.e. its location, size and expiration. The client can then retrieve the data using the access resource, which supports result retrieval with some optional parameters that specify range selection and specifics for field and file names.

Details

The following presents a technical description of the workflow from the service perspective, from steps 2 onward. When the service receives a request for the registration of a query (via the register resource), it creates a message with an initial state of “processing” and sends back a uniquely identified status object describing that message. It also initiates the query process, passing it the ID of the message to refer to. The query process begins execution using information stored in the message. At any time during this workflow, the client is able to access the status resource to retrieve the current state of the workflow. When the query is finished, or upon error, the process updates the message and exits. If the query was successful, a data result was written to disk as a CSV-formatted file. All information about the data result is obtained by the status resource using this file, precluding any data-result buffering with application code. Data results are then retrieved from the access resource. This resource is a layer of abstraction over the data results that provides added functionality like range specification, file name options and alternative field names. Data is always returned as CSV. Later, another web service was developed over the top of this access resource to provide re-formatting of the data result as Excel spreadsheets and other formats. Set in the configuration of this service is the time-to-live for these data results, currently set as 24 hours for our system. If either the status or access resources are accessed using a query ID after this period, the message is analyzed and appropriately set to expired and the client is notified. Below is a sequence diagram illustrating this interaction.

Data query sequence diagram

Figure 1. Sequence diagram showing query and response between application layers.

Architecture Implementation

The data access service was developed with the idea that each data source to be supported could be hooked in to the service through modularization. Each data source module is an abstraction of tabularly formatted data that could be queried using a generic syntax. The actual storage technology doesn’t need to represent the data this way, so long as the abstraction represents them this way. For example, an abstraction of a relational model could be a set of tabular views into the data accessible through the source’s module. For DataZoo the implementation was relatively simple, since the current back-end stores each dataset in a MySQL table. For another project we support, the module abstracted a set of XML files as a table, one record per file. In this way we are now able to query two relatively disparate data sources using a similar syntax and client-server protocol. Additional development was done in order to re-design an existing plotting web service to work with the data access service. The plot service now takes the unique identifier created during a query, along with a set of plotting parameters, and creates a graphic using the access resource of the service to retrieve the data result. It provides the capability within our architecture to create interfaces for querying and plotting data across multiple sources using a single client interface.

Summary

The development of the data access service has helped us move toward providing a common interface for interacting with data that require varying representative models. The asynchronous character of the service provides a stable client-server interaction independent of other application-specific constraints, like web server timeouts. And the stored data result simplifies post-query interactions like viewing, plotting, downloading and re-formatting. We look forward to hearing and learning from other sites’ experiences and approaches to scaling issues associated with data access.