Access Keys:
Skip to content (Access Key - 0)

Metadata


Metadata 1.4 Design Guide


Contents

Metadata 1.4 Overview


caGrid Metadata Infrastructure

The caGrid metadata infrastructure consists of numerous components and services, which are described below:

  • Metadata Models - Standardized service metadata models.
  • MMS - Provides ability to generate and semantically annotate standard metadata using an external metadata repository like the caDSR.
  • GME - Acts as the authoritative repository the XML Schemas used on the grid.
  • Index Service - Provides the white and yellow pages of the grid.
  • caDSR - caDSR provides access to registered models and their semantic annotations.
  • EVS - EVS provides access to controlled terminology.
  • Introduce Support - Introduce Support for caGrid metadata.
  • Advertisement - Provides the means to advertise services.
  • Discovery - Provides the means to locate services and data of interest on the grid.

Metadata Models Overview


A primary distinction between basic grid infrastructure and the requirements identified in caBIG and implemented in caGrid is the attention given to data modeling and semantics. caBIG adopts a model-driven architecture best practice and requires that all data types used on the grid are formally described, curated, and semantically harmonized. These efforts result in the identification of common data elements, controlled vocabularies, and object-based abstractions for all cancer research domains. caGrid leverages existing NCI data modeling infrastructure to manage, curate, and employ these data models. Data types are defined in caCORE UML and converted into ISO/IEC 11179 Administered Components, which are in turn registered in the Cancer Data Standards Repository (caDSR). The definitions draw from vocabulary registered in the Enterprise Vocabulary Services (EVS), and their relationships are thus semantically described.

In caGrid, both the client and service APIs are object oriented, and operate over well-defined and curated data types. Clients and services communicate through the grid using respectively Globus grid clients and service infrastructure. The grid communication protocol is XML, and thus the client and service APIs must transform the transferred objects to and from XML. This XML serialization of caGrid objects is restricted in that each object that travels on the grid must do so as XML which adheres to an XML schema registered in the Global Model Exchange (GME). As the caDSR and EVS define the properties, relationships, and semantics of caBIG data types, the GME defines the syntax of the XML serialization of them. Furthermore, Globus services are defined by the Web Service Description Language (WSDL). The WSDL describes the various operations the service provides to the grid. The inputs and outputs of the operations, among other things, in WSDL are defined by XML schemas (XSDs). As caBIG requires that the inputs and outputs of service operations use only registered objects, these input and output data types are defined by the XSDs which are registered in GME. In this way, the XSDs are used both to describe the contract of the service and to validate the XML serialization of the objects which it uses.

Cancer Data Standards Repository (caDSR)


Semantic Annotation of UML Domain Models

Proper semantic integration requires that each class and it's attributes from the UML domain model gets mapped to appropriate concepts in a controlled terminology. The caCORE SDK utilizes the NCI Thesaurus as its primary terminology source, but any well structured, concept-based description logics terminology should in principle be suitable. The concept selection process can be entirely manual, or it can be partially automated using the Semantic Connector, a tool supplied by the caCORE SDK. The Semantic Connector uses the UML domain Model expressed in XMI as input and uses the caCORE EVS APIs hosted at the NCI to search the NCI Thesaurus for appropriate concepts. Semantic annotations for classes and attributes are specified using tagged values in the UML domain model.

UML Domain Model Loader

The UML domain model, annotated with semantic concept codes, contains a considerable amount of metadata about the ultimate system – both data and analytical services - that will be deployed to the grid. However, it is not in a form that is amenable to query and retrieval in a runtime environment nor easily queried by humans to make use of this information for other purposes. UML domain model loader addresses these limitations by transforming and loading the models into the caDSR, which provides APIs that support runtime access to metadata. UML domain model annotated with semantic concept information is exported to XMI format using a UML modeling tool such as Enterprise Architect. It is then used as an input to the UML domain model loader, which uses a set of mapping rules to load metadata represented by Classes, Attributes and Associations into entities of caDSR. Following section contains the details of the UML to caDSR mapping rules.

UML to caDSR Mapping

Metadata represented in UML domain model is mapped to caDSR administered component types, and using the following mapping rules:

  • A UML Class is mapped to an Object Class, which according to ISO 11179 specification represents a thing in real-world.
  • An attribute of a UML Class is mapped to a Property, which according ISO 11179 specification represents an attribute of a real-world thing.
  • Combination of a UML Class and one of it's attributes is mapped to a Data Element Concept.
  • Combination of UML Class, one of it's attributes and data type of the attribute is mapped to a Data Element, commonly referred to as a Common Data Element (CDE).
  • Project to which the UML domain model belongs to is mapped to a Classification Scheme.
  • Packages in the UML model – which may represent sub-projects within a project – are mapped to Classification Scheme Items
  • Association between two classes is mapped to Object Class Relationship Refer to "Registration of Metadata in the caDSR" chapter of caCORE SDK Programmer's guide for complete details on loading UML domain models to caDSR

caGrid Reliance on caDSR

After a UML domain model is transformed, loaded and curated in caDSR, the model is ready to be used as the basis of an object oriented grid client and service. All data movement in caGrid between client and service is done so using instances of Classes registered in the caDSR. caGrid requires that all data types used in the grid are registered in caDSR, and come from a given Project version. That is, even though Attributes and other items in caDSR can be versioned individually, in order to use those types on the grid, they need to be associated with a specific Project version. Several components of caGrid make use of the wealth of information in the caDSR. As mentioned above, grid services use registered data models as their information model. By doing so, they are able to advertise both the syntax and semantics of the model by exposing an export of the relevant caDSR information as service metadata. The details of the model used to expose this information are shown in the section below. Once the information is exposed in this model, caGrid leverages for grid service advertisement and discovery. These processes are described in the discovery section. Finally, the information models registered in caDSR are used as the conceptual foundation for the actual communication format used to exchange data on the grid. This process of serializing and deserializeing data instances on the grid, is detailed in the serialization overview.

Metadata Models


All caGrid Services are expected to publish a set of standard metadata which draws heavily from the metadata registered in caDSR and EVS; it details the functionality of the service, and the institution providing it. The following sections describe these models.

Standard Metadata Model (gov.nih.nci.cagrid.metadata.ServiceMetadata)

The ServiceMetadata class is the main entry point for the standard service metadata. Shown below in the metadata domain, this model references heavily from the common and service packages, also shown below. Instances of this model describe the grid service, its hosting environment, and the underlying semantics of the data models used by the service's operations.

Standard Data Service Metadata Model (gov.nih.nci.cagrid.metadata.data.DomainModel)

caGrid Data Services, in addition to caGrid standard service metadata, expose a standard data service metdata (DomainModel), which details not only the UML Classes exposed by the service, but their relationships such as associations and inheritance. This information describes the logical model over which data service queries are executed.

caGrid Metadata in WSRF


Globus Information Services

The Globus Information Services component, realized as the Monitoring and Discovery System (MDS) is a suite of web services to monitor and discover resources and services on Grids. This system allows users to discover what resources are considered part of a Virtual Organization (VO) and to monitor those resources. MDS services provide query and subscription interfaces to arbitrarily detailed resource data and a trigger interface that can be configured to take action when pre-configured trouble conditions are met. MDS is composed of the following three main components:

  • WS MDS Index Service – This service contains a registry of grid resources and collects information from them, making it accessible and queryable from one location. Generally, a virtual organization deploys one or more index services, which then collect data on all of the grid resources available within that VO.
  • WS MDS Trigger Service – This service collects data from grid resources and passes the data to appropriate programs to perform various actions in response to events. (not currently used by the caGrid metadata infrastructure).
  • WS MDS Aggregator – This is the infrastructure on which the previous services are built. It collects, manages, and indexes data from an aggregator source and sends that data to an aggregator sink for processing.

WSRF Resource Properties

The section provides a brief recap of information about WSRF and the Globus implementation of it. More details can be found in the Globus documentation.

The Globus 4 toolkit provides a toolkit for create WSRF grid services. The WS-Resource Framework (WSRF) is a set of six Web services specifications that define what is termed the WS-Resource approach to modeling and managing state in a Web services context. In this approach, a resource is an entity that encapsulates the state of a stateful web service. Generally, each resource is a separate object but in certain cases it might be a singleton. A resource may just be a front end for state kept in an external entity, such as a file in a file system, a row in a database or an entity bean in a J2EE container.

A resource key is represented by a ResourceKey interface. It is a combination of a key name and the actual key value. A resource is represented by a Resource interface. It is a marker interface without any method defined. All resource objects must implement this interface. Resources are managed by an object that implements the ResourceHome interface. The ResourceHome interface provides methods for finding and removing resources as well as methods for identifying the SOAP header element and class for the resource key. In addition to the methods specified by the interface, ResourceHome implementations will generally provide an implementation-specific create() call or any other methods that operate on a set of resources.

Resources may have resource properties. Resource properties are declared in the WSDL of the service as elements of a resource property document. The ResourceProperties interface contains a single accessor method for retrieving the ResourcePropertySet from a resource. It must be implemented by all resources that want to expose resource properties. The ResourcePropertySet is the representation of the resource property document associated with the resource. It contains methods for managing the set of resource properties, e.g. adding and removing resource properties, and for discovering properties of the document itself, e.g. its name. The ResourceProperty interface needs to be implemented by all resource properties. It contains methods for: managing the set of values associated with the resource property, discovering properties of the resource property element, and serializing the resource property to a array of SOAP or DOM elements. The ResourcePropertyMetaData interface contains metadata information about a ResourceProperty such as resource property name, cardinality, etc.

Once metadata items are exposed as ResourceProperties, they can be queried using standard web service operations defined by the WS-ResourceProperties specification. Consult the specification for more details, but a synopsis of the operations are provided here:

  • GetResourceProperty: allows access to the value of any resource property given its QName.
  • GetMultipleResourceProperties: allows access to the value of several resource properties at once, given each of their Qnames.
  • QueryResourceProperties: allows complex queries on the resource properties document. Currently, the query language used is XPath.

Index Service


For the purposes of Advertisement and Discovery, caGrid leverages the Globus-provided Index Service. The Index Service implements the standard WS-ServiceGroup specification. When services are added to the service group, they specify what and how metadata should be accessed from them, and the Index Service performs this aggregation. Clients can then query this aggregated information using standard Resource Property operations. More information on these operations can be found here.

caGrid services are expected to maintain soft-state registration to a well-known, Index Service instance, specifying polling of standard caGrid standard metadata. For more information, see the section on caGrid Advertisement.

For more information on the Index Service, see the Globus documentation (http://www.globus.org/toolkit/docs/4.0/info/).

Index Service Administrators Guide

The Index Service Administrators Guide is intended for administrators that wish to install the Index Service. The administrators guide provides detailed information regarding the installation, operation, and administration of the Index Service.

Advertisement Overview


caGrid provides the service infrastructure necessary to leverage MDS to enable service advertisement. The advertisement is made possible by realizing the conceptual caGrid standard metadata, described in the previous chapter, as ResourceProperties of caGrid services. Each caGrid service is expected to create a singleton Resource to manage its service metadata. Each service metadata item the service wishes to expose, is represented as a separate ResourceProperty of this singleton Resource. For example, a Data Service may expose two metadata items: ServiceMetadata, and DomainModel. Each of these items will be represented as ResourceProperties, contained in the ResourcePropertySet of the singleton Resource of the service.

The caGrid provided implementation of ResourceHome, BaseResourceHome, manages the initial creation and management of the singleton Resource. The caGrid provided implementation of Resource, BaseResource, contains all of the logic necessary to manage the collection of service metadata items, populate them from a file, and advertise them to an Index Service. The ResourceConfiguration class maintains the information needed to configure the advertisement process. Each of these classes and the corresponding configuration files are managed by Introduce. When a service developer adds or removes service metadata in Introduce, these files are edited, using code generation capabilities, to reflect the metadata configuration of the service. Developers not using Introduce to create their services can either: reuse these classes as a starting point, use this design and re-implement it, or use a completely separate process. The only requirement is that the end result is caGrid standard metadata exposed as ResourceProperties of a singleton Resource of their service. The implementation requires that each metadata item be represented as a Java Bean capable of serializing itself to XML and deserializing itself from XML. caGrid provides XML Schemas describing the XML format of its standard metadata items. These XML schemas are used to generate appropriate Java Beans. The BaseResource, when initialized, will read the ResourceConfiguration. This configuration will specify which metadata items are to be instantiated from corresponding XML files, and where those files are located. It then will read each file, and deserialize the metadata instances. Those metadata items that are not populated from file are expected to be instantiated from the service's implementation code. Once the metadata items are instantiated, the BaseResource creates and populates its ResourcePropertySet with all of the appropriate items. These metadata items are then made available as service metadata exposed as ResourceProperties of the service's singleton Resource.

The final process the BaseResource performs on initialization is the registration, or advertisement, of the appropriate metadata items to the Index Service. Again, this process is configured through the ResourceConfiguration. For each metadata item, the configuration specifies whether or not the Index Service should aggregate its value. This enables services to expose some service metadata as ResourceProperties, but not register it with the central Index. Additionally, the configuration specifies the location of a registration configuration file. This file is used to configure the MDS registration process. The file specifies which Index Service to register with, how the Index Service should obtain the values of the metadata being registered (on configuration options of that method), and which metadata items are being registered.

Advertisement Client

To perform the actual advertisement of this service metadata, caGrid leverages a component called the Advertisement Client, which leverages the MDS ServiceGroupRegistrationClient to perform the Index Service registration. It handles the "soft state registration" process, wherein the service periodically renews its registration with the Index Service for the duration of the service's lifetime. The registration with the Index Service is only valid for a short lifetime (several minutes) and if the service fails to renew its registration, the Index Service will purge its corresponding entry. This dynamic process guarantees that Index Service will only contain relevant entries, as expired entries are discarded. It also ensures that the Index Service contains the most recent value of the metadata items, as it periodically gets the latest value using the process specified by the service at registration time. This process ensures that the integrity of the caGrid "yellow and white pages" will survive periodic Index Service failure, registered service failure, and general network failures. It should be noted, however, that there are various delays in this process and the Index Service will always contain slightly stale information, and if the most up to date information is needed, it should be extracted from the service in question directly. The Advertisement Client is essentially just a wrapper for the ServiceGroupRegistrationClient, which adds some features the current MDS client lacks (such as removal of entries at service shutdown). This approach was taken, as it is expected future versions of the MDS client will add these features, and the use of a specialized Advertisement Client, abstracts service developers from these changes. The public API of the Advertisement Client is very simple; it basically provides register() and unregister() methods, where the constructor takes the aforementioned MDS configuration parameters. The register method is a simple pass-thru to the ServiceGroupRegistrationClient, where the unregister method is basically a pass-thru to another new caGrid API called the Termination Client (described shortly). The main novel aspect of the Advertisement Client is that in its constructor, it registers a JVM Shutdown Hook, to be sure unregister is called when the JVM is shutdown to ensure the service is unregistered when its container is shutdown. Additionally, the Advertisement Client registers itself to receive asynchronous status update callbacks from the ServiceGroupRegistrationClient, which is logs appropriately.

The Termination Client is a general API which can be used to unregister a particular EPR from a given (passed into the constructor) Index Service. The Advertisement Client constructs one, passing the Index Service from its configuration, and calls unregister, passing the EPR of the service it registered. The unregister method first locates any registrations in the Index Service which correspond to the given service EPR, and for each, it leverages the WS-ResourceLifetime capability to set the termination time on the resource to 5 seconds in the future. The Index Service then has a background process that looks for registration resources with a termination time in the past, and removes them, thus unregistering the service. This process is slightly complicated, as the Index Service's registration resources don't support immediate termination, nor does the ServiceGroupRegistrationClient return the EPRs to the registration resources (so they must be discovered). The upside to this more complicated process is that any past registrations that may exist and which didn't get removed, are also terminated. The Advertisement Client makes use of this by unregistering the service (using the Termination Client) before it attempts a new registration.

The Termination Client supports the setting of a security descriptor such that authorization on the registration resources could be supported. While caGrid's Index Service currently does not require authentication or authorization, the Advertisement Client allows such a descriptor to be passed in through the MDS configuration, and appropriately passes it along to the Termination Client if it is present.

Discovery


Overview

Often the address of a service of interest is not a "well known" value, and is something that is discovered at runtime. caGrid provides the means to discover services of interest by querying a live registry of available caGrid services. All caGrid services are required to publish standard metadata (described in the caGrid Metadata Design Document) that describes their functionality. This information is aggregated in the registry Index Service, and can be used to find out information about the currently running services, including their current Endpoint References (EPRs). Building on this information, a Discovery API is provided with caGrid that facilitates the querying of this information toward the aim of discovering service EPRs.

Client API

The Discovery API is intended to be used by any applications or services that wish to consume of data, analytics, and core services provided by caGrid. While there are still cases when interacting with a particular instance of a service is desired, the Discovery API provides a means by which applications can locate services by the information or capabilities they provide. One of the key advantages of the grid approach to caBIG is the dynamic discovery of available resources.

Discovery Process on caGrid
In order to make use of the Discovery API, the discovery process must be "bootstrapped" using a well-known service address of an Index Service. The default constructor of the DiscoveryClient, the main interface to the Discovery API, should default to the official NCI Index Service. However, this behavior can be modified by using the constructor that takes the Index Service URL, or by calling the appropriate setter method (setIndexEPR). Additional details on this, as well as all Discovery API information, can be found in the Discovery section of the caGrid 1.1 Programmer's Guide.


Basic Use

The simplest discovery scenario is to just query the Index Service for all registered services. The boolean value specified in line 3 of the example, indicates whether services should be ignored if they do not expose the caGrid standard metadata. In most application scenarios, a value of "true" is used, as services without standard metadata are either: not compliant, not properly configurable, or inaccessible (e.g. behind a misconfigured firewall).

String indexUrl = "http://cagrid-index.nci.nih.gov:8080/wsrf/services/DefaultIndexService";
DiscoveryClient client = new DiscoveryClient(indexUrl);
EndpointReferenceType[] allServices = client.getAllServices(true);
for (EndpointReferenceType epr: allServices) {
System.out.println(epr.getAddress().toString());
}

As shown in the example on line 3, the method returns an array of EPRs. This is true of all discovery operations. The EPRs in the array represent the services matching the specified criteria (in this case just that it is a valid caGrid service), and can be used to create clients to invoke operations on the corresponding services (detailed later).

client-config.wsdd
All CaGrid client programs need to have the file client-config.wsdd on their classpath. There are two recommended ways to arrange this:
  • Include $GLOBUS_LOCATION in the classpath.
  • Copy client-config.wsdd from $GLOBUS_LOCATION to a directory that is already on the classpath.

Advanced Use

There are many discovery operations available in the DiscoveryClient. They provide a range of capabilities from "full text search" suitable for a freeform webpage-like interface, simple text-based criteria such as specifying operation names or concept code, and complex criteria ("query by example") such as specification of point of contact information or UML class criteria. While there are many discovery methods that take a UMLClass prototype, to discover services based on data types, an example is shown in the example below. This method, discoverServiceByOperationInput, locates services that provide an operation that takes, as input, an instance of the specified data type. The example below shows services that provide operations that take caBIO's Gene instances as input. This prototype object can be as partially populated as desired (such as only specifying the package name, or being more explicit in specifying the exact project name and version).

EndpointReferenceType[] services = null;
UMLClass inputClass = new UMLClass();
inputClass.setClassName("Gene");
inputClass.setPackageName("gov.nih.nci.cabio.domain");

services = client.discoverServicesByOperationInput(inputClass);

Additionally, there are methods to discover services by "type". For example, there are several methods named like discoverDataServices*, which only return services that implement the standard Data Service operations. Services may also be discovered by identifying the concept code matching the service type of interest, and invoking the discoverServicesByConceptCode method, which searches for services based on concepts applied to the service. There is a concept representing "Grid Service" in the ontology and derived concepts such as "Analytical Grid Service" and "Data Grid Service". It is expected additional concepts will be derived in the future, as driven by the community.

Introduce Metadata Support Overview


Several metadata-related features of caGrid manifest themselves as functionality in Introduce. Introduce is the graphical service development environment used in caGrid, and supports an extension framework, whereby functionality can be plugged into Introduce dynamically. While such functionality could have alternatively been implemented directly in Introduce, this approach promotes a loose coupling between the components without lose of functionality of any difference to the end user. This chapter details the various metadata-related Introduce extensions individually. All of these extensions are automatically installed into Introduce during the caGrid build process, and the caBIG Introduce Creation Viewer automatically loads the appropriate service-specific extensions to all services created with Introduce in caBIG.

caDSR Type Discovery

The caDSR Grid Data Service, provides read and query access to the information available in the caDSR. As such, the service provides useful information when creating services and so is integrated with Introduce as two extensions. Introduce has two types of datatype "discovery" extensions, which are both implemented for the caDSR. Specifically, there is a Discovery Tools extension, the CaDSRTypeDiscoveryComponent, and a Discovery Selection extension, the CaDSRTypeSelectionComponent.

The CaDSRTypeDiscoveryComponent, allows the user to browse registered Projects from the caDSR, and view a UML rendering of a selected package. This provides a means to browse the caDSR for available data types which could be used in the development of services. This component is a simple Panel, which uses the caDSR Data Service to populate the Project and Package combo boxes, and makes use of the MMS to generate Domain Models. The caGrid graph project is then used to render UML views of the generated Domain Models.

CaDSRTypeDiscoveryComponent

The CaDSRTypeSelectionComponent extension complements the CaDSRTypeDiscoveryComponent by providing a way to add XML Schemas to an Introduce service, which correspond to projects registered in caDSR. This component integrates into the "Import Data Types" section of the Types panel in Introduce. When a user browses to a particular package and presses the Add button, the component identifies the appropriate XML Schemas(s) for that package, and retrieves them from the GME. As of caGrid 1.3, the component makes use of the caDSR's ability to annotate Projects and Packages with their approproate XML Namespaces. If the selected Project and Package do not have such information registered in the caDSR, then the Namespace field will be populated with a guess based on XML Schema naming conventions, and will be shown in a blue font. The schemas may still be attempted to be added to the service using the Add button, but it should not be unexpected if an error is generated when the GME is consulted for those schemas. If the caDSR actually has such namespace annotations, then the Namespace field will be populated with the appropriate information, and shown in a black font. This indicates a much stronger confidence that such XML Schemas actually exist in the GME. Upon adding the schemas to the service, the extension will annotate the Introduce Namespace entries with details indicating the caDSR Project name and version from which they were extracted. This information can be used by other extensions (such as the caGrid Service Metadata Generator described below).

caGrid Service Metadata Generator

One of the most metadata-relevant Introduce extensions is a service-specific extension, which is actually a suite of components that hook into the Introduce service synchronization process when the extension is added to a service. These components comprise the Service Metadata Generator extension, and are responsible for creating an instance of the standard Service Metadata for a service whenever it is saved. The components, ServiceMetadataCreationPostProcessor, MetadataCodegenPreProcessor, and MetadataCodegenPostProcessor, run during the post creation, pre code generation, and post code generation processes (consult the Introduce design document for more details), respectively. The ServiceMetadataCreationPostProcessor is responsible for copying and installing the caGrid Service Metadata XML Schemas into the service, adding the appropriate metadata jars, and generating a shell Service Metadata instance. Then, each time a service is saved in Introduce, the code generation components read the Introduce model and edit the Service Metadata instance appropriately. That is, all of the descriptions from Introduce are put into the metadata, and all of the service contexts, operations, metadata, etc are updated. Essentially these components are responsible for extracting all the metadata-relevant information from the more complex Introduce service metadata model, and representing it in the caGrid standardized Service Metadata model. While somewhat laborious, the process is fairly straightforward. Upon editing the metadata model appropriately, the extension extracts any Namespace annotations that are present in the Introduce Namespace model (such as would be present if the schemas were added by the caDSR type selection component), and sends the model and annotations to the MMS Service for annotation.

A similar extension exists to generate standardized Data Service Metadata instances as part of the suite of Data Service Introduce extensions. Details about that extension and its functionality can be found in the Data Service design documents.

caGrid Service Metadata Editor

Complementing the Service Metadata Generator, an editor for the user-editable fields of the standardized Service Metadata instance of the service is provided as an Introduce metadata editor extension. The component, ServiceMetadataEditor, is a simple Panel that displays these fields from the current instance of metadata, and allows the user to edit and save it. Specifically the component allows the hosting research center information to be edited, including points of contact, address, and other information such as the display name and websites. It also allows the points of contact for the service to be edited.

caGrid Data Service Metadata Viewer

Similar to the ServiceMetadataEditor, the DomainModelViewer is an Introduce metadata editor extension. It provides a read-only UML display of the domain model to which the data service is providing query access. It uses the caGrid graph project to render the view, and simply reads the service's current Domain Model metadata instance to populate the view.

DomainModelViewer

caCORE SDK Schema Generation

To simplify the process of making use of the caCORE SDK XML Schema generation capabilities, as described in the section on Schema generation titled caCORE SDK in the Metadata Design, an Introduce extension is provided which can produce XML Schemas from an XMI model. The extension, SDKTypeSelectionComponent, is a type discovery extension as is the caDSR Grid Service Type Discovery component. Rather than using the caDSR however, this extension makes use of the caCORE SDK transparently without requiring the user to install or configure the caCORE SDK. The feature can be found on the "Types" tab of Introduce, under the "Create from XMI" subtab.

The component, provides the user with a browse button to select their XMI file, and various input boxes to enter supplemental information about the project represented by the XMI file. These fields are used to control the XML Schema generation process (indicating the information needed to select a namespace, and which packages from the Project to process). As the user fills out the form, a status box below is updated, as are status icons on each of the field. These are validators that ensure valid information is provided. If the user presses the "Add" button before all of the error indicators are cleared, a dialog showing the errors will be displayed, and no processing will occur. The warnings will not prevent processing from occurring, but generally indicate something that should probably be examined. Once the "Add" button is press with no validation errors, the component goes through the process of generating XML Schemas, and adding them to the service.

The components which implement the logic of the extension, use the following process when a type is to be added. First, as mentioned above, the input is validated. Next, if everything is valid, the component extracts a local copy of the caCORE SDK version 4.1.1, to a temporary directory. Then, the SDKExecutor is used to execute the process, by passing it an instance of SDKGenerationInformation, which is basically a Bean to represent the input gathered from the user. The executor creates an instance of SDKExecutionResult, which is a bean which represents access to the artifacts created by execution on the SDK. To do this, the executor applies the necessary configuration changes in the SDK configuration file, by reading the values of the SDKGenerationInformation bean. Then the SDK is executed as an Ant process. The results are then validated and returned. If everything is valid, the component then copies and installs the generated schemas to the service.

Serialization and Deserialization


Overview

XML schemas play a role in several aspects of the caGrid runtime environment. Both the client and service APIs are object oriented, and operate over well-defined and curated data types. Client and services in caGrid communicate through the grid using respectively Globus grid clients and service infrastructure. The grid communication protocol is XML, and thus the client and service APIs must transform the transferred objects to and from XML. This XML serialization of objects is restricted in that each object that travels on the grid should do so as XML which adheres to an XML schema registered in the Global Model Exchange (GME). As metadata registries like the caDSR and EVS define the properties, relationships, and semantics of data types, the GME defines the syntax of the XML serialization of them. Furthermore, Globus services are defined by the Web Service Description Language (WSDL). The WSDL describes the various operations the service provides to the grid. The inputs and outputs of the operations, among other things, in WSDL are defined by XML schemas. As caBIG requires that the inputs and outputs of service operations use only registered objects, these input and output data types are defined by the XSDs which are registered in GME. In this way, the XSDs are used both to describe the contract of the service and to validate the XML serialization of the objects which it uses.

Object Serialization

As mentioned in a preceding section, objects must serialize to and from XML as they traverse the grid. This section details the alternative approaches for said process.

Standard Globus Serialization

caGrid is built using the Globus 4 toolkit. The Globus toolkit has a complete set of tools for automatic generation of serializable objects using models defined in XSDs. Using this mechanism, Globus has the ability to automatically create a set of Java Beans which represent this model which will be able to be serialized and deserialized automatically at client and service runtime via the Globus toolkit with no extra configuration.
In order to use these types in a grid service the developer must describe the types that they will be using in the WSDL file. This will enable Globus to locate the types, and generate the required beans during stub generation time. For more information on this process please see the Globus documentation.
Using this approach, the object-oriented client and service APIs are written using the Globus generated Java Objects. The toolkit will automatically serialize and deserialize the Objects as the travel to and from the grid.

Custom Serialization

If a developer already has a java object model that they are already using which is either not serializable or uses a custom serialize the developer will need to configure the Globus service and client to be able to use the custom serialize. This can be done by using the Globus type mapping configuration xml in the services WSDD and in the client's configuration WSDD. This type mapping paradigm is document in the Globus documentation but we will cover the basics in this document.
If a user has an object called MyObject which has its own serializer then the following configuration (shown below) must be placed in the service WSDD and in the client configuration WSDD:

<service ...>
  ...
 <typeMapping encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"
   serializer="gov.nih.nci.cabig.MySerializerFactory"
   deserializer="gov.nih.nci.cabig.MyDeserializerFactory"
   type="java:gov.nih.nci.cabig.bean.MyType"
   qname="ns1:myType" xmlns:ns1=http://cabig.nci.nih.gov/1/myType/>
   ...
</service>

This configuration will now allow the service to use the MySerializerFactory and MyDeserializerFactory to marshal its object back and forth across the grid. The typeMapping element must be in the server and client wsdd and in order for the custom serializer and deserializer to be invoked.

caCORE SDK Serialization

caGrid provides an implementation of the previously mentioned custom serialization process for objects generated with the caCORE SDK version 4. This feature takes advantage of the fact that the caCORE SDK generates a "XML mapping" file which specifies the mapping between every Class and attribute to a corresponding XML entity. This mapping file is of the format specified by Castor (http://castor.codehaus.org/xml-mapping.html), and can be used by the Castor Marshalling framework to marshall between an XML document meeting the corresponding XML schema, and the corresponding Java objects. Similar to the process used in the Globus generated Java Beans, this provides sufficient functionality to use these Java classes in the service and client code, and have them automatically serialized and deserialized to and from the grid. Castor provides the additional functionality, however, to separate the Java Beans themselves from the serialization process. In this way, Castor can be used to serialize and deserialize between arbitrary Beans and XML, given an appropriately defined XML mapping. caCORE SDK creates such a file for each model it creates Java Beans and XML Schemas. The caGrid SDK Serializer and Deserializer make use of this functionality by providing the necessary hooks to automatically invoke Castor, using the appropriate mapping, from the Globus infrastructure.

! !

The components used to support this functionality are shown above. The Factory classes sole responsibility is to be hooked into the underlying Axis framework, and return an appropriate instance of the Serializer and Deserializer classes as needed. Both of these classes internally utilize the Castor APIs to marshall and unmarshall the Java Beans as needed. They leverage the EncodingUtils to access the appropriate Castor mapping. This utility first attempts to read a configuration parameter "castorMapping" from the current Axis context. This property can be specified in the client and service WSDD file, as shown in the highlighted section of the example above. The value of this parameter is expected to be a classpath reference to the Castor XML mapping file which should be used. This parameter allows multiple services or clients running in the same environment, to use different mappings. If this parameter is not set, the mapping is expected to be loaded from the default location (/xml-mapping.xml). This location will work for SDK generated Java Beans, as it is included in their jar files under said location. However, as mentioned, the preferred approach is to explicitly specify a unique location, because if two SDK generated systems were used in the same environment, only one of the mappings would be loadable (determine by the classpath settings).

Schema Creation

As detailed above, XML Schemas play an important role in the runtime environment of caBIG. This section details some of the alternatives for creating these schemas. It should be clear from the preceding sections that the mechanism used to serialize the objects for a given service are highly dependant on the schemas it uses, as the XML serialization must conform to the schemas. In other words, if the runtime objects were not generated from an existing schema, either the serialization process must adhere to some generated schema, or a schema must be created to describe the serialization format; the two are highly dependent on one another.

XMI Based Generation

Object Management Group's standard for XML Metadata Interchange (XMI) is data representation developed to enable easy interchange of metadata between modeling tools (based on the OMG-UML) and metadata repositories (OMG-MOF based) in distributed heterogeneous environments.
As caBIG requires the modeling of data types in UML, and provides a UML modeling tool capable of generating XMI, Enterprise Architect, to its developers, it is reasonable create schema generation tools that process XMI. Furthermore, the UML Loader, which is responsible for registering metadata in the caDSR, takes a semantically annotated XMI file as input. The following sections detail approaches for generating XML schemas from XMI.

Enterprise Architect

In Enterprise Architect, an XML schema corresponds to a UML package. Therefore the XSD generation is a package-level operation in Enterprise Architect (EA). The basic generation is fairly simple to invoke, and is detailed below in Table 2. Using this process on the caBIO model, data types such as the Chromosome shown below are produced.

1. Select the package to be converted to XSD by right-clicking on the package in the Project Browser.
2. Select Project Generate XML Schema from the main menu.
3. Set the desired output file using the Filename field.
4. Set the desired xml encoding using the Encoding field.
5. Click on the Generate button to generate the schema.
6. The progress of the schema generator will be shown in the Progress edit box.

Table 2: Steps to generate an XSD in EA

As can be seen in the example, the naming scheme used by EA is possibly not what one would expect. The parent data type's name is prefixed on all of its children. This is not only overly verbose, but also would produce non-standardly named Objects if it was used. Another disadvantage is fact that schemas are generated on a package-level basis and thus if a project contains several packages, the process must be done manually for each. EA does provide several configuration points for customizing the XSD generation. More details can be found in the user documentation: http://www.sparxsystems.com.au/resources/xml_schema_generation.html. Of particular interest are the mechanisms to set the namespaces. All customizations can be set using the "tagged values" feature of EA. See the EA documentation for more details on how to create tagged values. The basic steps are to open the tagged values view, click on a package of interest, click the new tag button in the tagged values view, type the appropriate name (see below), then edit its value appropriately. Specifically the "targetNamespace" should be set to the namespace under which the generated schema will be published. Additionally, the "targetNamespacePrefix" should be set to a meaningful abbreviation (e.g. cabio for the caBIO schema). An important final step in generating the XML Schemas is to correct any xsd:import statements EA generates. These should have an appropriate relative path to the imported schema for the attribute schemaLocation.
The tagged values described by the EA documentation are an implementation of the "UML Profile for XML Schema." A UML profile has three key items: stereotypes, tagged values (properties), and constraints. A profile provides a definition of these items and explains how they extend the UML in a particular domain, which is XML schema in this case. Each of the configuration points in the profile gives the user control over the generation capabilities. Leveraging the UML profile, EA can generate highly customizable XML Schemas. The details of the profile, and the effect they have, can be found in the EA documentation linked above, and in the following articles:

  1. http://www.xml.com/lpt/a/2001/08/22/uml.html
  2. http://www.xml.com/lpt/a/2001/09/19/uml.html
  3. http://www.xml.com/lpt/a/2001/10/10/uml.html


The following table details some best practices for modifying the default settings of EA.

UML Construct Tagged Value Name Tagged Value Value Example Notes
Package targetNamespace {set according to caGrid Recommendations below} gme://caBIO.caCORE/3.1/gov.nih.nci.cabio.domain Used to specify the namespace of the XSD. This is important, as it uniquely identifies the XSD on the grid.
Package targetNamespacePrefix {something short and unique within the project} cabio Used to specify namespace prefixes in the XSD
Package memberName unqualified   This prevents EA from prefixing every element and attribute with the Class name. For example, taxon instead of Chromosome.taxon.
Package elementFormDefault qualified   Necessary to ensure created elements have the proper namespace.
Package/Association Source anonymousType false   Ensures that created elements are of the proper xsd:type; forces element references instead of anonymous elements of a particular xsd:type. (This is supposed to be the default in EA, but doesn't appear to be, at least in version 6.1) You may want to set this to true on some associations. Experiment with this value for your Package and see what works best.
Package/ Association Source anonymousRole false   Ensures that a new element is created for each association, and given the name of the target rolename. This is the default value in EA. You may want to set this to true on some associations (such as when the max cardinality is 1, to prevent the creation of the "wrapper" element). Experiment with this value for your Package and see what works best.

Table 3 : EA UML Profile Best PracticesThe caGrid standard metadata, detailed in Chapter 3, was modeled in EA and its corresponding XML Schemas were generated using the mechanisms described above. You can view the EA project file for the caGrid metadata to see some of these settings used in practice.

hyperModel

Another tool capable of generating XSDs from XMI, amongst other things, is hyperModel (www.xmlmodeling.com). It leverages the same UML Profile as EA, and can import an XMI file. It also suffers from the same package-level issues as EA. It claims to offer more configuration points than can be easily identified through its user interface. The documentation of hyperModel is fairly lacking. There is a book published on its use and philosophies, so it is likely more detail is provided therein.

caCORE SDK

The caCORE SDK supports the ability to generate XML Schemas, plain Java Beans, and castor mapping files to configure serialization and deserialization to and from XML which adheres to those schemas. The XML Schemas can be used stand alone with default serialization, or the Castor serialization with the generated plain Java Beans can be used using the serialization approach described above in the caCORE SDK Serialization section. The input to this process is an XMI file appropriately following the SDK guidelines. Consult the caCORE SDK Programmer's Guide for specific details on the appropriate build targets, and XMI guidelines.

User Authored Approach

In the event an existing set of Java Objects planned to be used on the grid which has the ability to serialize and deserialize to XML already, it is likely desirous to use those capabilities. In this case, unless an XML schema already exists to describe the serialized XML, the schema will probably need to be hand written to describe the format of the XML. Most XML Schema editing tools have the capability to create a schema from an existing XML document. This mechanism could be leveraged to create a starting schema using an existing XML serialization, though care should be taken to review the schema for things like cardinality, enumerations, and data types, as those things may very from document to document.

XML Schema Namespace Conventions

As GME manages schemas by their respective namespaces, a consistent approach to selecting namespaces for data types is necessary. This section details the current recommendation for assigning namespaces to objects based on the information about them in caDSR. This section assumes a working knowledge of caDSR terminology, and the caDSR documentation should be consulted for reference.

Namespace Format

In caDSR, each project (application) will have its own Classification Scheme (e.g. caCORE). A Classification Scheme may define a subproject, which is represented as a Classification Scheme Item (CSI) (e.g. caBIO). For projects creating new XML Schemas, a reasonable approach to modeling is to assign each CSI its own schema. One could certainly create XML schemas for each object in a Classification Schema but that seems unnecessary. Unlike caGrid 0.5, which did not support this, caGrid can now handle this using the process, for mapping caDSR items to GME items, defined in the following section. The current GME implementation also lifts any namespace format restrictions which previously existed.
While not required (as the caDSR now supports XML Namespace annotations of UML Projects), the current recommendation for assigning namespaces for caBIG objects is shown below. For example, gme://caTIES.caBIG/3/edu.upmc.opi.cabig.caties.document.domain (where caTIES.caBIG is the <domain>) or if there is an alias (document) for the CSI: gme://caTIES.caBIG/3/document.

<Classification Scheme>.<Context>/<Classification Scheme Version>/<Classification Scheme Item>

Internally, each Object will have an element/complex type defined using its name. For example, if the caTIES model has an object Document, it would serialize to a document similar to that shown below.

<Document xmlns=" gme://caTIES.caBIG/3/edu.upmc.opi.cabig.caties.document.domain">
...
</Document>

Use of maxOccurs and minOccurs

Previously, a user has needed to pass some optional "Double" values in a service. When the "Double" attribute is defined, it is converted to type "xs:double". In the Introduce generated framework (Java stub code), this attribute is declared as primitive java type (instead of java.land.Double). This prevents the user from detecting null values in the Java code.

The explanation is that when Introduce writes the schema for the service interface, it does not set the minOccurs maxOccurs as it is intended to use the defaults. When using a primitive, if you want it to be optional, then you would need to set the minOccurs to 0. This will tell axis to map it to the Double and not the double. However, you can still achieve this by creating your own type in a schema that sets the minOccurs to 0 and import it into Introduce and use that type.

Last edited by
Sarah Honacki (998 days ago) , ...
Adaptavist Theme Builder Powered by Atlassian Confluence