Author(s): Tahsin Kurc
caGrid version: 1.2+
This guide uses a number of terms that may be unfamiliar to readers. As you read, please refer back to this section for definitions of unfamiliar terms.
Software that provides the support for higher-level (application-level) software components and applications to execute and interact with each other. Middleware consists of a suite of components, services, tools, and runtime system that can be employed collectively to develop and deploy applications and application-level software components.
A process that facilitates collective access to and use of multiple, disparate, and potentially independently developed resources, services, and applications. It is also the process of such resources and services agreeing on operation and interaction standards to enable collective access and use.
- Grid computing
- Service Oriented Architecture (SOA)
- Model Driven Architecture (MDA)
- Analytical resource (and analytical service)
- Data resource (and data service)
- Administration and Security domain
- Common Data Elements
- Controlled Vocabularies
- caCORE SDK
- Index Service
For other terms, please view the Glossary.
This guide provides an introduction to developing software using caGrid. It is targeted at software developers who are just getting started with caGrid and want to learn how they can use caGrid to develop and deploy Grid services. You do not need to be familiar with the concepts of Grid computing, Service Oriented Architecture, and Model Driven Architecture. This document will walk you through the basics of caGrid and provide pointers to tutorials so that you can gets hands on experience developing Grid services. This guide also provides pointers to additional technical information for those readers who are interested in the design and implementation of caGrid.
In order to get started with caGrid, we recommend the following steps:
1. Read this document
2. Download and install caGrid
3. Complete caGrid introductory tutorials
4. Visit the caGrid Knowledge Center website for more information.
caGrid is middleware designed to facilitate secure and federated access to information and analytical resources in a multi-institutional environment. Typically, resources available in this environment have been developed by independent groups. caGrid provides tools, libraries, and runtime support: 1) for resource providers to implement and deploy their analytical and data resources as secure, interoperable services and 2) for resource consumers to discover available resources and use them (e.g., submit queries to multiple data sources and retrieve the query results).
caGrid is designed to solve the problem of sharing data and analytical resources in an environment where resources are hosted by different organizations and located in different administrative and security domains. In addition, caGrid works just as well within a single institution, providing the tools needed to share data seamlessly across departments. For example, a research project may require integrative analysis of microarray, imaging, and clinical data. These datasets may be collected by different entities (such as shared resources and medical information warehouses) and may not be stored in a centralized system. caGrid can be used to create a "virtually centralized" data warehouse of such datasets. Each dataset is managed by the respective owner, but is integrated to this virtually centralized data warehouse using caGrid service interfaces and tools, so that a researcher can access data from any of those datasets through a common interface. Authentication and authorization controls can be used to limit access to the datasets. A key benefit of using caGrid is that caGrid makes it easy to evolve from sharing data within an institution to sharing data with external collaborators. In most cases, no new software needs to be deployed. Resources can be shared both within an institution and with external collaborators simply by changing the security access restrictions.
caGrid employs Grid computing. Grid computing refers to the notion of using distributed resources hosted at multiple institutions to solve large scale, challenging problems in science and engineering. It was initially conceived as a mechanism to enable remote access to computational and storage machines across the administrative boundaries of supercomputer centers in order to solve large scale, compute intensive scientific and engineering problems. Over the years it has evolved into a platform, made up of standards, tools, and middleware infrastructures, for sharing data and analytical resources as well as computation and storage systems. At its foundation, caGrid employs the basic principles of Grid computing and existing Grid computing tools, more specifically the Globus Toolkit, to enable access to remote and disparate data and analytical resources. As a user of caGrid, you will likely not need to know the details of Grid computing and Grid computing tools. These details are hidden from the caGrid user by higher level tools and middleware components provided by the core infrastructure. For the purposes of getting started with caGrid, it suffices to say that using caGrid one can create an environment where resources are located at multiple institutions, but can be accessed securely across institutional boundaries. In Grid terminology, such an environment is referred to as a Grid environment.
caGrid is a service oriented system. In a service oriented system, each resource is made available to the (Grid) environment as a service. A service wraps the functionality of the resources in a set of well-defined interfaces. These interfaces (and the associated client side application programming interfaces) are used by client applications to interact with the resource. For example, a Gene expression database, stored in a relational database system, may be wrapped as a service with two interfaces: query and insert. The query interface allows a client program to issue queries for the Gene data. The insert interface can be used to insert data into the database. With the service-oriented interface, the client program does not directly interact with the relational database system. Note that by providing a service interface, a service developer can change the implementation (hidden to the user). For example, a service developer can upgrade the service to use multiple threads in response to tighter performance requirements.
Most SOA systems employ Web Services technologies as the underlying platform. Web Services provides access to services via standard web protocols. caGrid uses the Web Services Resource Framework (WSRF) standards. The WSRF draws from the Web Services standards, but extends them with such concepts as stateful services, service lifetime, service context, etc. These extensions enable more efficient and richer services to be implemented for scientific application scenarios. The caGrid infrastructure provides the Introduce toolkit for service providers to easily implement service stubs and service interfaces for their resources. The Introduce toolkit also provides support for client application developers to interact with remote services using high-level Java language APIs. You can find more information about Service-Oriented Architecture, Grid Computing, and the WSRF standards in the following references:
• Service Oriented Architecture: http://en.wikipedia.org/wiki/Service-oriented_architecture
• Grid Computing: http://www.globus.org/alliance/publications/papers/anatomy.pdf
• Web Services Resource Framework (WSRF): http://www.globus.org/wsrf
caGrid draws from Model Driven Architecture. The model driven architecture (MDA) paradigm has gained popularity in recent years. This paradigm promotes the use of object-oriented design practices and rich metadata in order to facilitate implementation of interoperable systems. caGrid adopts a Model Driven Architecture approach to enable interoperability through object-oriented abstractions, common data elements, and controlled vocabularies. That is, client and service APIs in caGrid are object-oriented. These objects, in turn, are defined using common data elements and controlled vocabularies registered on the Grid. For example, the names of an object's fields are terms from the controlled vocabularies. In addition, the type of a field (Integer, String, etc.) matches the type specified in a common data element. The benefit of this approach is that resources are defined in one location (the vocabulary or common data element) and used to generate all Grid artifacts, preventing any issues with re-modeling (the same) data at each Grid layer. A caGrid data service abstracts data as objects. Similarly, an analytical resource (e.g., an analysis program) implemented as a caGrid analytical service provides methods that input objects and return objects.
While the caGrid infrastructure builds on several complex frameworks and standards, caGrid provides a suite of high-level tools and graphical user interfaces that make it easy to use. Most of the details of the underlying standards and frameworks and lower level middleware tools are hidden from the user. These tools and GUIs are covered extensively in caGrid tutorials. We will provide pointers to those tutorials later in this document. In the next section, we will present a simplified application scenario to better illustrate the use of caGrid for building a multi-institutional collaborative study.
Consider the following simplified scenario. A team of researchers from four different institutions would like to create a collaborative environment to evaluate image analysis algorithms for detection of clinically significant lung nodules. In this collaborative environment, the team would like to be able to share images from multiple image repositories hosted at their institutions among the team members only, because of IRB regulations and intellectual property restrictions at their respective institutions.
The main type of resource in this application scenario is an image archive. Each team has an image archive and would like to share it with team members at other institutions. Using caGrid, each team can develop a secure service for their image archive and deploy it so that it can be accessed remotely. Each team can also develop Grid applications to discover and utilize these services.
|Interoperability and caBIG compatibility of data and analytic resources. While caGrid provides support for developing interoperable services, it does not magically make them interoperable and caBIG compatible. Interoperability (and more specifically syntactic and semantic interoperability of resources) and caBIG compatibility are broad topics and are out of the scope of this "Getting Started with caGrid" guide. We refer the reader to the caBIG Compatibility Guidelines document for more detailed descriptions of interoperability, "semantic and syntactic" compatibility with caBIG resources, common data elements, controlled vocabularies, and semantic annotations on data elements (https://cabig.nci.nih.gov/guidelines_documentation/compat_v3/) and https://cabig.nci.nih.gov/guidelines_documentation for details on caBIG compatibility. For the purposes of "Getting Started with caGrid", it suffices to say that in order to implement interoperable services using caGrid, service developers in the collaborative team should first agree on data elements and data structures for the image types in the image archives. Second, they should agree on the semantic meaning of the data elements so that information stored in the data elements can be interpreted and consumed correctly. Third, they should wrap their resources in services and service interfaces so that the resources can be accessed programmatically and remotely.|
It is impractical to assume that in a large scale environment such as the caBIG™ program all developers would come together and agree on data structures, common data elements, and semantic meanings of the data elements before starting a new development effort. That is why caBIG provides tools, guidelines, and processes for development of compatible systems. The caBIG™ program requires developers to follow a harmonization process so that independently developed tools, applications, and services can interoperate via well-defined interfaces and reuse of existing object models, data elements, and controlled vocabularies. There are a suite of tools, such as caDSR, EVS, GME, Semantic Integration Workbench, caCORE SDK, and guidelines developed by the caBIG program to assist developers in implementing caBIG compatible, interoperable systems.
For a collaborative effort not involving many institutions, it could be more efficient if the developers worked together and agreed on the object models. Note that even if the collaborative team did not desire compatibility with the caBIG environment, they could still use caBIG tools, such as the caCORE SDK tools to generate XML schemas and object domain model files for use in the Introduce toolkit of caGrid, when developing data and analytical services.
Each team can use the Introduce toolkit of caGrid to create a Grid service to access their image archive. The Introduce toolkit will also generate the client APIs (currently with binding to Java) to access the service. If the image repositories are implemented as relational databases, then Introduce can be used to generate a data service from caCORE SDK artifacts. Using the caCORE SDK, a developer creates a data-oriented application from an object model; the SDK generates the codes to submit object-based queries and to map CQL object queries to relational queries against the backend databases. The Introduce toolkit provides an extension for caCORE SDK-generated systems that allows the developer to create a data service from caCORE SDK artifacts. If a developer wants/needs a custom data service implementation, he/she has the option to implement the standard query interface and the execution logic that will map a caGrid CQL query to the query language of their backend system.
When a caGrid service is deployed, the service optionally configures service metadata using the Introduce toolkit. Both the service itself and the service metadata can be registered to the Index Service. The Index Service can be queried to discover services in the caGrid environment. Metadata associated with the caGrid data service in our example can specify what types of image data are served by the source. A client can query the Index Service to discover, for example, caGrid data services that contain chest CT imagery.
In the example scenario, the data services are deployed securely to allow the data owners to control access to their data. That is, the collaborative team does not want to share the image data with researchers who are not part of the collaborative project. The Introduce toolkit provides the service developer with the capability to set authentication and authorization requirements for a service. Support for implementing authentication and authorization is provided by the caGrid GAARDS (Grid Authentication and Authorization with Reliably Distributed Services) infrastructure. There are several different configuration options to enforce authentication and authorization. Each institution may run an instance of the Dorian service in the GAARDS infrastructure to manage accounts for their own researchers. Another option is to stand up a single consortium-wide Dorian instance to manage accounts for the collaborative consortium. Alternately, the collaborative team can leverage the NCI caBIG™ managed Dorian instance. This option is likely to reduce administrative overheads on the team to manage the security infrastructure, but would require the team members to apply to the NCI caBIG™ security team for accounts. See Request caBIG Production Account Using GAARDS UI for details. For authorization support, the collaborative team needs to create "roles" (or "groups"), which will be used to control access to the data services. For this purpose, the team can stand up a Grid Grouper service and a GTS instance to manage groups for authorization and trust levels within the collaboration as well as with other institutions. They may also use the NCI caBIG™ Grid Grouper instance to reduce administrative load. As an example, the team may create a "Collaboration Researcher" group in the Grid Grouper service so that only users belonging to this group (or role) are allowed access to image data from the data services. The GTS can be set up so that only the Dorian instances used by the collaborative team are trusted - what this means is if a client with credentials obtained from another Dorian instance tries to access a service, the service will deny the client access.
When a user is registered in Dorian, the user can be assigned a group; in our case, this would be "Collaboration Researcher". When a data service is implemented using Introduce, the authorization settings of the service can be set to accept requests only from clients in group "Collaboration Researcher". In this way, even if a user provides a Grid credential (certificate) obtained from a trusted Dorian instance, the service will not grant access to the user unless the user is in the "Collaboration Researcher" group. Thus, only members of our collaborative research team, not all Grid users, can access the image data.
Each team in the collaborative study can implement client applications using the client APIs for each service as well as the APIs for discovery, security, and federated query provided by the caGrid infrastructure. The client application can be implemented to use the federated query support. With the federated query support, the client application can be configured to compose queries that span multiple data services. In this way, a user can search for data from multiple image archives or submit queries that will involve joins across two or more data services - e.g., the user may request images from two archives where image acquisition dates are the same.
So far we have provided introductory background information on caGrid. Additional information on caGrid can be found on the caGrid Knowledge Center web site. It is now time to get your hands dirty with caGrid. In this section we will outline some of the basic steps to start using caGrid and provide pointers to relevant tutorials.
The first step in using caGrid is to download the caGrid infrastructure distribution and install it on a machine. The caGrid 1.2 distribution can be downloaded from the following URL: caGrid 1.2 Installation Quickstart.
For basic service development and deployment and for the purpose of getting started with caGrid, you need only choose "Install caGrid" and use one of the existing caGrid environments for advanced features such as Grid account management and security. For this walkthrough, choose the "1.2 Training Grid".
There are several tutorials on the caGrid Wiki web site: Tutorials. These tutorials are designed to walk a service developer through analytical service and data service development.
To implement an analytical service, follow the tutorial at the following URL: Develop an Advanced caGrid Analytical Service.
caCORE SDK makes it easy to develop object-oriented data abstraction layer on top of relation databases. It also generates data oriented systems that have caBIG compatible APIs. The Introduce toolkit provides plug-ins to support development of caGrid Data Services that use caCORE SDK generated backend data oriented systems. This tutorial will walk you through implementing such a data service. The tutorial is accessible at the following URL: Develop a caGrid Data Service Supporting caCORE SDK 4.0
Congratulations, you have run through a set of tutorials taking you through the major caGrid components and are now a caGrid Service Developer!
For more detailed technical information on caGrid, please read the caGrid Technical Overview.
From here on out, you will determine what you'll need from caGrid build more complete Grid applications. Here are common topics that caGrid users are interested in:
In addition, we encourage you to view Community Projects. You might find that a project exists to do what you need!
View existing Grid Communities. Public Grid Communities offer Grids with public account registration is available. Private Grid Communities target a specific community and require approval of your account registration.
To get additional help developing Grid resources, visit our support resources.