A growing list of tools is maintained on the caBIG website, here https://cabig.nci.nih.gov/tools
When trying to create or access service metadata, or information from the caDSR grid service, I get an error like "is not a valid XML character", what is the problem?
You might see something like:
This is caused by characters in information stored in caDSR, that are not valid (in the WS-I profile required encodings) to be passed as XML on the grid. Most often the problematic characters are "smart quotes" introduce by MS Word when editing descriptions. The easiest solution is to identify and remove the characters in caDSR. Below is a program created by Satish Patel that can help identify them. The caDSR team is putting checks in place to identify these characters before they are introduced.
Alternatively, the caGrid cadsr project contains a graphical tool which can locate such errors and generate reports which can be saved as text files. From the top level directory of caGrid, navigate to projects/cadsr, and execute the command 'ant runCadsrModelProblemFinder'.
There is no requirement for backend databases. However, there is alot of tooling that is provided by NCICB and caGrid for taking standard relational databases and gridifying them. Excel or SAS files could be used, but you would have to create the programming logic to query them based upon the object-oriented query language that the grid supports.
There is no API requirements for the backend database. However, if you are using the provided tooling (caCORE SDK, Introduce), then the backend database should be a relational database (e.g. MySQL or Oracle) with a traditional relational model.
At the Annual caBIG meeting there was discussion of silver compliance-accrediting. Does a grid service have to be accredited? Can we make it, following the rules of silver compliance, but not submitting to review?
There is no requirement that a grid service be silver compatible to be used. However, the production Index Service housed at the NCICB will only allow gold-compatible services to advertise. To accommodate others, there will be a sandbox Index Service that will allow any grid service to advertise.
The caCORE SDK provides tooling to take a UML model (defined by XMI), create a database (if one doesn't already exist), generate Java objects, and generate Java APIs. The caCORE SDK is command-line based. The caGrid Introduce toolkit then can be used to create a grid service from the caCORE SDK based system. Introduce is a graphical interface that provides wizard-like functionality for building, configuring, and deploying the service. Details on caCORE SDK Support.
When a user searches across the grid to discover information in the caDSR, what is the query actually looking for; concept codes, public IDs, key words, administered items?
There are currently a number of different implications to what you suggest.
Services register their models in the caDSR. These models are then exposed by grid services through their service-level metadata. This is published to a central Index Service such that people can discover those services. The domain model is a subset of the information in the caDSR, but includes much of the information, such as CUI, class, property, value domain, etc.
All services on the grid that choose so may advertise their metadata (e.g. CDEs, API, etc.). Then, clients (users) can search that metadata to find services to invoke (query). For example, a client can search for all services that expose a particular CDE or portion of a CDE.
Data service querying
Once discovered, clients can query data services based. The query is structured based upon class name, property name, and associations. Value domain is not taken into account.
There is a component in caGrid that performs queries across data services, performing "joins" and aggregating data. CDE equivalence is not taken into account by this service, but tools that invoke the federated query processor can be written to leverage CDE equivalence.
There is a grid service that allows you to query the caDSR, which is aptly named the caDSR Service. This allows you to extract pretty much all the information from the caDSR. It is based upon the Java client for the caDSR, and exposes more information than is possible to acquire through the service level metadata (a.k.a. discovery).
This is discussed at a high level on the overview page. Additionally, relevant discussion occured on the caGrid user's mailing list, and is reproduced below:
From: caGrid Users discussion Forum CAGRID_USERS-L@LIST.NIH.GOV On Behalf Of Shanbhag, Krishnakant (NIH/NCI) [E]
Sent: Friday, January 11, 2008 10:18 AM
Subject: Re: The coorinated use of caDSR and EVS
Here's a potential <<example>> where the use of information provided by caGrid Metadata, caDSR and EVS can be utilized to coordinate a data workflow in a grid environment. There are many more flavors of using it; but this may give you some ideas..
A typical <<use case>> enabled by caGrid infrastructure, could involve a bioinfromatician wanting to get all information about Gene "Brca1" that is provided by grid enabled data services published by service providers on caGrid infrastructure.
In order to implement the above use case, here's a <<one>> potential way that the user may use to get this information programmatically.
Step 1: In order to do any search for <<genes>>, it is important that the user use computable semantics and not depend on textual search. This is where EVS Grid service would help. The user would use the EVS Grid service to get the <<concept code>> corresponding to search term "Gene", see below for code:
Step 2: Once a set of concept codes are obtained from Step 1, the user could use the caGrid metadata service APIs in caGrid to obtain the list of data services that contain the selected concept code
Step 3: The user may now want to determine the specific <<object class>> that is potentially interesting to the user from the list of selected data services. This will involve calling the caGrid Metadata APIs that internally uses the caDSR Grid service. At the end of this step, the user would like to get a list of specific object classes from the various data services that can be further queried for actual data related to genes. This is where primarily the caDSR Grid service becomes useful. The combination of caGrid Metadata APIs and caDSR Grid service allows users to determine the specific attributes that may be common between different <<object classes>> in different data services that are serving the same concept code. This is important because, this knowledge would essentially allow users to develop what we call <<joins>> among multiple data services.
For example, the <GridPIR> service and <caBIO> grid service have both <Gene> class. The user may be interested in doing a query for the <Gene> class in GridPIR that have the same Genes that are available in <<caBIO>> grid service and then constrain it to be of type "Brca1". To do this, DCQL (Distributed caBIG Query) needs to be written with the appropriate join attributes between the two services. The appropriate join attributes are determined computationally by "Common Data Element (CDE)" value (i.e. public ID and Version Number) which can be accessed from the caDSR Grid Service. In the above example, the join attributes for the <<Gene>> class for the two services turns out to be on the attribute "name" and "symbol" respectively. As you see, the attributes have different <<names>>; still they are computationally joinable because they have the same CDE.
The actual code to enable the <above> is a little more elaborate than can be written in a few lines. But, you may want to check the following APIS:
Step 4: The user now has information about specific data services, object classes and join Keys. Now, the user can use the Federated query processor service to aggregate data from the selected data objects (e.g. Genes in GridPIR and caBIO) or do a distributed object join base on the selected join keys.
Foll. is an example of DCQL XML that can be generated <<dynamically>> from step 3 using the information in the caGrid Metadata API and caDSR Grid Service to do distributed querying.
The user can send the above DCQL to the Federated query processor service and get instance data related to Genes "Brca1".
Are grid queries accessing the metadata in the caDSR as part of their query function, or are they accessing metadata that is from the caDSR that is actually stored in applications/data sets?
Advertisement and Discovery accesses metadata that is extracted from the caDSR at service creation time. It is essentially static once the service is created (though service developers can change it). Queries to the actual services are based upon this metadata.
Is the data type a data element considered when querying across the grid or is it just the object and property (Class/attribute) portion used in the query?
Datatype can be used during discovery, but not querying the actual service for data.
Are queries accessing public ids of CDEs, CUIs in data element concepts? Object Classes? Properties?
CUI, object class, and property all accessible through discovery. Queries for the actual data of the data service are created using the class, property, and associations.
Discovery generally uses full match of the names. However, there are ways to full text search. Queries to the data service use the full name.
Is it possible for us to establish equivalency of Person/Patient/Participant by the use of alternate names for standard CDEs?
Alternate name is not included in the metadata for the service. However, this information is exposed by the caDSR service, so potentially you could query the caDSR service the actual name and then perform discovery. The caGrid team is investigating whether to expose alternate name in the service-level metadata.
Is a user able to query the grid by a CDE alternate name type (ex. Alternate Name "DICOM Tag" has a value of 0008,0008) and get back all CDEs that have this tag associated with the CDE?
See answer above.
There are basically two aspects to that question. The first is just how to use the tools. The second is meeting the compatibility requirements to actually be classified as "Gold." The tools provided by caGrid can help one create systems which should meet Gold level compatibility, but not every system created by caGrid tools is automatically Gold compatible.
In the simplest case, to just use Introduce you really just need XML schemas which describe your data types, and then you can add operations to the service which use those types. Then you can implement the "stubbed" methods Introduce creates for you (you should find sufficient information on this on the wiki and in the release documents for caGrid). However, that's far from a "Gold" service; you basically just have a grid service. Much of the guidelines around Gold compatibility are oriented around both semantic and syntactic interoperability of grid services, and to meet those requirements one must do additional steps to ensure that they just don't have a "grid service" but that that grid service is interoperable with the rest of caBIG, and that clients of caBIG can appropriately leverage it. That is one of the key aspects which differentiates caBIG/caGrid from other grid efforts. We are currently working on version 3 of the compatibility guidelines, which will provide much more clarity and detail around the Gold level. Some of the details are aimed around leveraging harmonized and curated data types, semantically describing your data types, and publishing/registering appropriate metadata. The existing information is here. We hope to have version 3 out in the near term.
The caDSR stores the model and its semantics, but doesn't describe how the data moves over the grid as XML. That's what the XML Schemas are for, and why they are registered in the GME. XML Schemas define datatypes in a particular namespace (to avoid naming collisions of elements/types). The particular namespace used by the schema doesn't really matter as long as it's a valid URI, and is unique (so as to not be confused with other models). The caDSR team is currently working on registering information in caDSR which identifies the relevant namespaces and XML types a given item (Project, Class, etc) is represented as on the grid, but that is not yet available. Therefore, we currently use a naming convention for namespaces (based on things you the caDSR Project name and Context) which allows us to locate the appropriate schemas for a given Project. If you follow that convention, things like the caDSR types browser in Introduce can automatically locate and extract the necessary schemas for Projects you want to use. If you don't follow the convention, that is fine, but you must manually associate the schemas with the models you are using (until such information is programmatically available from the caDSR).
There is an effort going on the identify a minimal set of caDSR datatypes which should be used to construct all domain models, and make sure those datatypes have a formal mapping to a representation in XML Schema. The programming language choices for mapping to and from the XML are then completely irrelevant and can vary from developer to developer. For example, the same grid service may be invoked by multiple clients from multiple programming languages; certainly a Perl client won't actually be using java.lang.Integer. What is important is the registration of the logical structure and semantics, and a formal and consistent mapping to XML Schema.
Now that the caGrid 1.1 infrastructure has become fully operational, one of the most important thrusts in caGrid 1.1 has been to implement the first set of policies and procedures created by the caBIG™ Security Working Group. One of the security requirements of caGrid is to have both user and host service credentials to be adhering to the initial Levels of Assurance (LOA1).
For instructions on obtaining caGrid grid accounts, see links below:
Additionally, the caGrid security infrastructure enables institutional credential providers to participate in the caGrid trust fabric. Please contact the caBIG™ Security Working Group for guidance on becoming part of the caGrid trust fabric.
caGrid 0.5 De-support: De-support of the caGrid 0.5 "Test bed" infrastructure was announced on October 1, 2007, and completed on January 1, 2008.
caGrid 0.5 User Accounts: Due to LOA1 requirements, it was not practical to migrate caGrid 0.5 user credentials to caGrid 1.1; users must register for a new account on the caGrid 1.1 Dorian Identity Provider.
If you have any questions or problems obtaining these credentials, please do not hesitate to ask for support.
When a call is made to a secure grid service (using https transport), the following error message might be seen:
The Axis Engine underlying the client API does not have the https transport registered. This can be resolved by adding $GLOBUS_LOCATION to the client's classpath.
What are the differences between the caGrid 1.2 domain model XML file and the data that is in the caDSR?
1) Isn't the "Workflow Status" for CDE's included in the XML?
"Workflow status" for a class is not contained in the domain model. However, you can search for the CDE in the caDSR using the provided publicID and version.
2) Also the names of the classes/attributes are available in the XML, but the OC Names, Property Names , CDE Names do not appear in the XML. Is that true? Do I need to explicitly query caDSR to get these details?
All the concept names and IDs are available, but only the UML names are provided as those are what are used for querying. Similar to the above, if you need additional information about the model you would need to look it up in the caDSR. The DomainModel extract you are referring to is used to support the Data Service discovery and query use cases.
3) Also there seems to be some discrepancy in the ordering of concept codes for Property/OC in what appears on caDSR and what appears in the XML. For example, in a domain model file for caTissueCore(1.2), the property Clinical Status has 2 concept codes in order "Status"(0) and "Clinical"(1)
The ordering is ascending from the primary concept (so 0 is the primary, which is consistent with the caDSR).