Data Integration Experiences
James Connors (CCE, PAL)
Information Management for CCE and PAL LTER has had the opportunity to work on a small set of projects dealing with data integration. These experiences have shaped our current data management workflow and data integration approaches. In general, these experiences have been limited to working to integrate data long after a research plan and data collection have been carried out, so may not be related to data collected according to protocols designed from the outset to support integrated data products.
One of the first approaches to data integration that we implemented was based on simple database table joins. Within our data system interface, a user could select a dataset to begin with (lefthand table) and select another (righthand table) to be joined. The interface provided dropdown selections where the user could create the criteria for joining, specifying which fields in the datasets would be matched on for joining. One of the major criticisms that we received after this implementation was that is was difficult for any users who didn’t know the specifics of how the data were related (e.g., which fields served as shared indexes). Also, because the datasets stored in the system are combined across sampling studies, you also needed to be aware of possible differences across studes. Particularities of a specific sampling study that may have affected the indexes and these differences need to be understood to properly assess the quality of the join. In addition to these conceptual problems with the interface’s design, which could have probably been addressed, there were additional issues related to storage types. The table joins ran into problems when data storage types for indexes varied across datasets, i.e. strings vs. numeric types, etc. This table joining utility was eventually removed from the system. For our first approach, it gave us insight into a number of specific issues related to integrating data. Foremost, that data integration was more than a technical solution, requiring a level of knowledge of the data that couldn’t be generalized.
A subsequent approach to data integration benefited from this experience. As part of our support for the CCE LTER project, a database programmer in our group, Mason Kortz, worked on a project to develop a system for providing integrating data results from two research projects, the Brinton and Townsend Euphausiid database and the Marinovic Euphausiid database. For this project, Mason worked closely with the scientists involved in order to narrow the scope of integration across the two data sources, in addition to defining many parameters that were integrated into the system’s functionality in order to accommodate known issues or differences across the data collections. This project produced a functional system (https://oceaninformatics.ucsd.edu/cequi/) that successfully provided integrated data results across projects. The approach taken for this project was the reverse of what was done for the simple table join interface described above. The begin with, the developer worked closely with the scientists to understand the data and to narrow and define the context for integration before a technical approach was designed. The data, it’s condition and it’s intended use provided the realistic scope that led the
design of the system.
Currently, within our primary data system (Datazoo) we are treating the issue of data integration differently. Datazoo serves a wide variety of data across projects and scientific domains. The system provides a set of generalized interface utilities for searching, browsing, downloading, previewing and plotting datasets. Because of this, narrowing the scope and developing a well defined context for integration across the catalogs of datasets is not feasible. With a recent redesign of our data management workflow, a different approach was taken. Relational data across projects is managed in prepublishing databases, where data managers can work with and understand the characteristics of data and how they integrate before they are published to our public system. With this setup we can define integrated products, using database and programming tools, that are then reviewed for quality issues before they are published. The workflow for producing these integrated products is saved and can be rerun after new data are added to datasets. For now, this seems to be the best approach for maintaining data quality as well as a generalized data interface that can accommodate a growing catalog of datasets.