Integrity Checks For Large Scale Databases
Editor's note: Kali Abel runs a small data management company based in Alaksa named Turnagain Consulting. As a fellow data afficianado we have invited Kali to submit content to this issue of Databits, both as an introduction as well as to provide the perspective of outside commentary.
I recently worked on a project that involved a large influx of data coming in from a 620,000 square mile site, derived from 900 different research work plans, 1500 researchers, and a preposterous number of samples and resulting data. Those data flooded in beginning 4 years ago and continue to come in steadily; all the while additional notes, instrument records, photos, coordinates, and corrections continue to be added. Although one could spend hours thinking about the technological processes that go into entering, storing, distributing, and describing such a large data set, perhaps the most interesting aspect of the project is the challenge of preserving data integrity when those data come in such large quantities.
For this particular project we designed and implemented various analyses that we termed "integrity checks". Because it was unrealistic to analyze each sample for accuracy due to the enormous size of the data set, it became important to find a way to screen for errors in a more efficient manner. Nearly all samples coming in were linked with a sample time and a coordinate location. Even when sample types varied, descriptive information of each sample was consistent. As can be common with field data, errors often came from transcription between field collection forms and data entry into a database form. This can mean something as simple as a latitude having two numbers reversed in the decimal degrees, or a time being recorded as AM instead of PM. To create robust data sets, screening for these errors and proposing corrections to the researchers was essential.
One such integrity check focused on reasonable velocities. For samples collected from a boat, a reasonable travel velocity was considered to be 20 miles per hour or less (given the information of the kinds of boats being employed for sampling). Using simple coding and a little bit of math, we were able to determine whether the time and distance between successive samples during a trip required a reasonable velocity. To do this we looked at the time between successive samples and the distance between coordinates using the Haversine formula (which calculates distances between two points on a sphere). All instances where velocities exceeded 20 mph between successive sample locations were flagged for further review. As a result, 498 samples were flagged out of a set run of nearly 100,000.
Additional integrity checks included an "instantaneous displacement" query which flagged all samples that had the samplers in two locations more than half of a mile apart at the same moment in time, and an average displacement query which flagged all samples that were over twice the standard deviation of average distance between all locations in a trip.
These integrity checks flagged numerous samples that could have otherwise been easily overlooked, potentially creating a domino effect of compounding errors as the data is searched and used by other researchers in their work. Integrity checks such as these allow large data sets to be analyzed for consistency more efficiently and more systematically than can often be done with large data sets. The design is often simple but the outcome allows for more interaction with research groups and a more robust and reliable data set.