Skip to Content

Data Integrity, The Phrase I Don't Hear Enough

Printer-friendly versionPrinter-friendly version
Fall 2013

Ryan Raub (CAP)

Data integrity is basically the maintenance and assurance of the accuracy and consistency of data throughout its life-cycle. This aspect of data management is a key component of good archival system practices, and can have frightening consequences if omitted.

One of the scariest scenarios for an data archival system is corruption going un-noticed for a long period of time, perhaps longer than your backup recovery window. Imagine finding out that someone three years ago inadvertently opened up a published data file and changed one value. You have now unintentionally distributed this erroneous data to an unknowable quantity of users over the last three years. This is what we want to prevent, by adding assurances to the data you archive so that you can be sure it remains unchanged.

Data corruption could even occur with the file or operating system and it's up to your archival practices to discover any changes. You shouldn’t just rely on the file system for preservation of archival data; file systems are not perfect and do have a small percentage of problems. With the increasing volume of data, the occurrence of these problems increases proportionally. There are some file types that have some redundancy or checksums built into their format (e.g. zip or tar), however these redundancy features are only intended to answer the question "is this a valid file?" and not "has this file changed?".

A very common way of checking the contents of a file (or folder) for changes without duplicating the data is to compare a computed hash value (or checksum) of the data with a prior known hash value. There are several standard options for hashing functions with pros and cons, but for the scope of this article I’m going to recommend the SHA-1 hashing algorithm. These hashing functions will always return a fixed sized output (e.g. 40 character string) given any input file, regardless of the file size. Additionally these functions will be able to compute different outputs given two files of any size that only differ by one bit. So even the smallest change in a terabyte sized file is noticeable.

If you want to start data integrity checks for a small volume of files (less than one gigabyte) you can easily use a version control system like Git to store prior versions of files and their hashes. Git is a powerful tool and I would recommend it to anyone who wants a simple way to “keep tabs” on a directory (read more). The less than one gigabyte per directory (repository) is just a rule of thumb; it doesn’t have any hard limits. However if you are beyond that, there are probably better ways to achieve this goal.

If you cannot split your data holdings into less than one gigabyte directories, you can use some simple command-line tools to create a list of the files and their checksums (commonly referred to as a manifest). I’ve put together a simple linux bash script to generate and compare a list of checksums for a directory. This won't tell you what has changed in a file, only that the file has changed.

As the data volumes grow, you’ll need to scale your tools accordingly. However the principles remain the same: compute a hash of a file periodically and see if the hash has changed. You can even use hashing functions in databases to catch changes within tables.

Other data archival systems like the Planetary Data System (PDS) that NASA uses have standards that require the hash to be stored in the metadata for each dataset. They even go as far as to require data integrity checks to be run over their entire data holdings on a monthly basis to ensure that nothing gets altered. Granted, the PDS operates at a much larger scale compared to the LTER, but the goals are the same. Perhaps we should consider adopting storage of the data file hash values as part of our data management best practices.