Skip to Content

The Future of Archiving (Research) Data

Printer-friendly versionPrinter-friendly version
Issue: 
Spring 2014

Ryan Raub (CAP)

 

The biggest concern for long term archiving data is preservation, how to continually store more and more. With an ever growing volume of data, we need to adapt storage systems that can grow proportionally as well. This is when data collections start to span multiple machines and the existing methods for distributing resources create an unnecessary increasing management overhead.

There is a culture of data storage systems that have shifted away from managing single sources and towards more organized distributed methods for file systems. Distributed models are inherently more complicated, but we can handle this by using abstraction layers that hide some of this complexity and in turn give us more powerful tools to work with. These systems can also give you some big advantages in terms of durability, performance, and capacity; which are all very desirable for data preservation.

Lets talk about some of the abstractions that get introduced, the first is a basic one: How to identify a file? On your personal computer you can uniquely identify a file by it’s name and it’s location. With a distributed system, that same file exists on multiple computers in different places. What we can do instead, is identify a file by its contents rather than where it is. It is a little counter intuitive to identify to a file by it’s contents because of its size, but we address this by using it’s contents to compute a (fixed size) hash1 as a derived unique representation of it. Once we make this transition to working with files based on their hash values we can now identify and talk about a file and not need to know it’s actual location(s).

Now that you have this hash identifier for a file: How would you ask for the file itself? Popular existing methods of file transfer rely on a URI2, which we can and should still use. But we can do better; instead we’ll want to make a request to a “matching service” (which runs on all of the nodes of the data network) that can tell us which computers have this file. There will be several sources for every file (redundancy is a requirement) and we can leverage the internal redundancy of this system to aid performance by asking each source for a different part of the file at the same time. Once you collect all the pieces you can recombine them and verify that the end result is what you requested by its hash value.

Adding capacity to this system is as simple as adding new nodes3 or expanding allocations on existing nodes. If we had a need to provide anyone with faster access to this data network, all we would need to do is create new nodes near their network and optionally set priorities to have it replicate relevant data for them.

This underlying distributed file system still needs to rely on search engines for discoverability; how else are users going to know that a file exists, let alone which file they want. This is an interesting and similarly complicated part of the system that is still being refined and developed today.

I would also like to comment on other popular reference systems; we’re currently in the golden age of DOI4s. These provide a great authoritative source and a matchmaking resolution service that can be maintained. However they do not provide any way to prove that where they directed you is what the author is actually trying to reference. With just a few more characters they could cite a hash value of the referenced file, uniquely identifying it and providing a method of verification.  Another desirable by-product of technique is versioning becomes a non issue; a particular files hash will always only identify that same file, any change regardless of how small will have a different unique hash.

As an example of how this system could be applied to create an data network for the LTER: Each site would provide an node(s) to host their own sites data (given higher priority) and some amount of data from other sites (parity). As a whole the network would benefit from having many geographically separated resilient copies of every version of each file managed automatically. The network improves its disaster recovery potential as well as its overall performance, availability, and storage capacity.

These systems do exist currently and their adoption in the research community is slow. Looking forward with increasing demands and growing amounts of research data, the adoption of these types of systems is now, before we have serious issues. There are many more aspects of data archival that I don’t have room to address in this article, such as; file formats, standards, and compatibilities. If you are interested in this topic and want to participate, feel free to contact me.






Footnotes:

1 - More about hashing: http://en.wikipedia.org/wiki/Secure_Hash_Algorithm

2 - Uniform Resource Identifier: http://en.wikipedia.org/wiki/Uniform_resource_identifier

3 - Node: A single computer that contributes computational, storage, and network resources by participating in a larger network of nodes.

4 - Digital Object Identifier: http://www.doi.org/