Roswell Park Cancer Institute began an initiative to centralize the management and dissemination of clinical data to support translational research. A new office was created called the Clinical Data Network (CDN). Under the direction of Dr. Carmelo Gaudioso, this office has begun to address the issues faced in having clinical data stored in many disparate heterogeneous databases. This paradigm results in the duplication of data and data entry efforts, challenging and sometimes impossible data integration, and the inability to query across databases when asking research questions.
A proof of concept pilot project was initiated in which the caGrid technology and infrastructure would be used to link and extract data from “non-caBIG” disparate data sources. In essence, this pilot would demonstrate the capabilities of federated queries for data retrieval and the creation of a virtual data warehouse.
The goal for the project was to successfully demonstrate federated queries against three internal databases: LIMS Tissue Bank (biospecimens), DBBR Sample Inventory (blood specimens), and ERS Cancer Registry system (tumor registry).
We were able to leverage the caGrid technology and the standards that have been created by the caBIG community to successfully create our own grid services. In summary, data models were built representing the three databases and then “grid-enabled” which allowed us to query them using a federated approach.
During the initial phase of the pilot, it was determined that a robust front-end query tool for writing federated queries was required. We decided to pilot another caBIG tool as the front-end user interface. The tool piloted was caB2B (cancer Bench-to-Bedside).
The caB2B application was designed as a tool to query caBIG data services over the production grid. To utilize caB2B, modifications were made to configuration files which permitted us to load the data models and thus allowed caB2B to recognize the new data services.
The development of the caGrid data service that enabled access to the disparate relational database systems can be summarized in five main steps:
- Installing the pre-requisite software
- Creation of the information model (object model and a data model)
- Semantic annotation of the object model (OpenMDR repository was installed and the common data elements from the caDSR were mapped)
- Creation of a data-oriented system using the caCORE SDK tools (creates a client-server application that provides query support for the backend database)
- Creation of a caGrid data services using the Introduce toolkit
After the data services were created, they were then loaded into caB2B through the backend via configuration file changes. This was required since these data services were not registered in the caDSR.
caB2B has three components:
- Admin Interface: control the data sources which can be queried
- Client Interface: queries can be created and saved
- Web Interface: saved queries can be accessed by users via a web browser
We are able to successfully demonstrate the ability to perform federated queries from three different data sources (LIMS, DBBR, Tumor Registry) using caGrid and caB2B.
Work continues on this project as additional use cases and feature requests are being worked on by the caGrid and caB2B Knowledge Centers (KC). We would like to thank those two KC’s for all of the assistance they have provided.
Federated query result from 3 data sources:
1. Roswell Park Cancer Institute: Mayurapriyan Sakthivel, Ken Quinn
2. caGrid: William Stephens, David Ervin, Arka Pattnayak
3. caB2B: Gaurav Mehta, Mukesh Sharma, Baris E. Suzek, Jim Humphries
Ken Quinn - Ken.Quinn@RoswellPark.org
Mayur Sakthivel - Mayurapriyan.Sakthivel@RoswellPark.org