Skip to Content

Efficient Data Curation with the EML Congruence Checker

Printer-friendly versionPrinter-friendly version
Issue: 
Fall 2013

Margaret O'Brien (SBC)

First envisioned at the ASM in 2009 and in production for a year now, the EML congruence checker is paying off. It’s something we can point to proudly as a Network: a tool that make data curation more efficient and helps ensure higher quality at all sites.

A goal at SBC was to have all data packages uploaded to PASTA by the end of 2013, so I’ve spent much of December immersed in data, EML, and the data package congruence checker. Today, SBC has 156 data packages with 286 entities; 80% of the entities are EML “dataTable”. Many are legacy packages that were originally built with Morpho (before I started at SBC), and recently imported into Metabase (by me). I anticipated that getting through them all would be a grueling, difficult slog. The most intense work took only about 5 days, and everything has made it through.

Having helped write the requirements, I knew that the congruence checker was up to the task. But what I didn't know until I immersed myself in the reports was how thoroughly and carefully Duane Costa had answered our requests for certain features. Duane has written the most submitter-friendly software that I could imagine.

Sometimes finding errors is like peeling an onion: you find one error, fix it, and then another one is exposed. So the IMC asked Duane for a feature: "during evaluate, don’t stop on the first error. Tell the user as much as possible about the data package before it stops". Duane made this work. For example, when you run in "evaluate" mode, you might learn that you have a) invalid EML, b) no dataset abstract or keywords, and c) the EML metadata lists 11 dataTable attributes, but 3 of the rows have only 10 fields. In typical checking schemes, the invalid EML would have stopped the process. After I’d fixed that problem, I’d then see the first short-row error. Then I’d have to process the package twice more to see the other two errors.  But now with the the congruence checker, you don’t have to peel the onion. Duane’s code does it for you, and has given us a terrific advantage.

Yes, some of the error messages are still a little cryptic – but it’s the IMC’s job to come up with the most useful language for those, not Duane’s. And now that we have a framework, we could start the exciting part: we could validate ranges of measurements, or provide descriptive summary statistics of the data values themselves -- something a data consumer will appreciate.

The checker is helping us reduce the cost of curating dataset updates. We can now hand off the task of evaluating datasets prior to submission to a part-time assistant, and have the high-level of assurance that if a dataset passes “the checker”, that it is known to be structurally correct. This code was written as part of the EML suite of utilities, which means it can be used by anyone needing to proofread EML datasets, not just LTER.