Skip to Content

Reducing data complications while capturing data complexity

Printer-friendly versionPrinter-friendly version
Spring 2011

Margaret O'Brien (SBC) and M. Gastil-Buhl (MCR)

The tasks of organizing data for publication are often lumped together as "data cleanup", which implies the process is one of labeling some columns and dealing with some bad values.  In fact though, the process of outlining, clarifying and describing a single data product can require weeks of iteration and communication. Of particular interest to the LTER are time-series data, or data that could be organized into a time series. Often these datasets are comprised of multiple tables, or contain fields that were merged and/or split over time. The longer the time-series, the more likely it is that complications or subtle issues will have been introduced.

First, a few definitions: from, something that is complex is intricate and subtle, but it also is well organized and logically constructed. But a thing that is complicated, in addition to being fundamentally intricate, is irregular, perverse and asymmetrical. “Complex” is more formal and technical (e.g., a mathematics problem) while something like personal life can be “complicated”.  We looked long and hard for a term to use for the process of un-complicating something. The closest we came was “explicate”. Other terms come to mind (e.g., untangle, explain), but these imply a simpler process. Explication implies that the subject or thing is more complicated or detailed (also from

Ecological data tends to be both complex and complicated.  Complications arise from unavoidable realities that cause missing data or inconsistencies. Instruments break. Cyclones wash away plots. Maintaining a regular structure from irregular input requires coding those complications as missing or flagged values. Quality controls identify complications in the data.  In an ideal hypothetical experiment, there are no complications.  Ecology is complex so naturally the data from ecological observations are complex.  Organisms are classified in hierarchical taxonomy.  Sampling is structured into points along transects within sub-sites within super-sites. Instrument calibrations at a single time point apply to continuously collected data. Capturing that complexity in a model allows manipulations of the data, and in our ideal hypothetical experiment, we still have all the complexity.

Every data cleanup project is unique but they all have some basic activities in common, which we group into two types: data-explication and code-creation.  "Data-explication" is iterative, labor-intensive and mostly manual. It includes extraction and organization of details, and usually solutions are unique. Data-explication will always be required. Code-creation on the other hand is not always necessary; e.g., if an adequate database exists, data might be entered manually. But code can help immensely, and if it’s well planned even ad hoc code can be reused or adapted. Different skills are required for the two types of activities: a good explicator will understand the data issues but may not be able to write adequate code (although s/he must be able to organize details with code in mind). It's unlikely that a professional programmer will have scientific training, and so will not be able to ask the right questions for organization.

We think of the process as having three major steps in a cycle, usually with several iterations (Fig. 1).

Figure 1. The data cleanup process. Cyclical explication steps are in blue. Code-creation is in red, and where optional, the lines are dotted. Figure 1. The data cleanup process. Cyclical explication steps are in blue. Code-creation is in red, and where optional, the lines are dotted.

Step 1. Capture the researcher's knowledge. Usually, this starts with a group of data tables or other files. The researcher and/or field technicians supply information about the data such as the names of the measurements, the sampling methods, and the project's goal and design. Some basic info may be captured with a form, but this is usually followed by discussion.  Usually, the older the data are, the more effort this step requires. The asynchronous nature of communication between information managers and researchers (e.g., by email) also adds complications to this step.

2. Scrutinize the actual data. Someone must examine the data closely, plotting or tabulating where necessary. S/he looks for inconsistencies in names, categories, data typing, units and precision, and also for shifts in the apparent scope, methodology or ranges of values.  One form of scrutiny is to model the data as a relational database schema, and insure that data are clean enough to be uploaded into those tables. Some work can be automated as a workflow or at least a script, e.g., bounds checking, or aggregate counts to verify completeness.

3. Reconcile the researcher's perception of the data with reality.  Every researcher thinks his/her dataset is less complicated that it actually is. To reconcile, the IM/explicator must be able to ask very specific questions to clarify the discrepancies that became evident under scrutiny. The reconciliation step will result in some data being explained, but also will probably start a new iteration of the cycle.

After all aspects of the data have been explicated, we can move on to the easy part: standardize where possible, re-label the columns and deal with those bad values. If, after this is done, the data can be modeled into a RDB and the tables uploaded, then this is likely to be sufficient. Finally, when the data package is congruent and complete enough to support synthesis, we are done. In general, we've found that the entire process is much simpler when the goal is to add more data to a pre-existing type, than it is when preparing a new dataset for publication.

As data publishers, we've all seen users bypass our data catalogs and go straight to the researchers to learn more about a data set. The network also has the experience of EcoTrends, where one person manually scrutinized nearly all the incoming data, in much the same way as described here. The investment in data cleanup and clarification is particularly important where the goal is to integrate or synthesize. If our data are to be centrally available and in a usable state, then we cannot afford to skip steps or iterations in this process, nor should we underestimate the time it takes. However, it is probably impractical for every dataset to get this treatment, and priorities should be set which reflect our goals and offer the best return on investment.