|caGrid||caGrid 1.3 Documentation|
|FQP||FQP 1.3 Documentation||FQP 1.3 Design Guide|
The caGrid Federated Query Infrastructure provides a mechanism to perform basic distributed aggregations and joins of queries over multiple data services. As caGrid data services all use a uniform query language, CQL, the Federated Query Infrastructure can be used to express queries over any combination of caGrid data services. Federated queries are expressed with a query language, DCQL, which is an extension to CQL to express such concepts as joins, aggregations, and target services. The infrastructure, shown below, is composed of a core engine and grid services which provide access to and management of the use of the core engine.
DCQL, the language used to express federated queries, is an extension to CQL, the language used to express single data service queries. Both CQL and DCQL use a declarative approach to describe the desired data by identifying the nature of the instance data with respect to its containing UML information model. That is, a query can be seen as identifying a class in a UML model, and restricting its instances to those which meet criteria defined over that class's UML attributes and UML associations.
The primary additions to CQL, which DCQL provides, are the introduction of the ability to specify multiple target services (aggregations), and the ability to specify object restrictions through relationships to Objects on remote data services (joins). The other primary difference between the languages is that CQL is context dependent, meaning the language must be interpreted against the service receiving the query, and DCQL itself specifies the context of the queries (by identifying the target services). As such, services accepting DCQL (such as the FQP service), generally don't expose any local data.
As DCQL is modeled as an extension to CQL, it mirrors the general structure wherein a target object Class is identified, and instances are specified by restrictions over its attributes and associated objects. The specifics of each component of a DCQL query are described below.
The root of a DCQL Query is the DCQLQuery object. It contains a collection of data service URLs which identify the services that should be queried for instances of data matching the Objects described by the TargetObject. This mechanism provides the basic support for simple aggregation queries, as each identified service will queried for the identified data, and the results will be aggregated and returned to the client. DCQL is a recursively defined language, as queries are ultimately descriptions of instance objects and relationships between associated objects can be described using the same mechanism. That is, just as the target object is described by restrictions of its attributes and associations to other objects, the associated objects are defined using the same syntax. These restrictions are defined below by the Object type, as the DCQLQuery's TargetObject is just an instance of this type.
The Object type is the core component to CQL and therefore DCQL. Firstly it identifies the Class of targeted instances. That is, all data instances matching the description will be of the class type identified by the "name" of the Object. When used as the TargetObject of a DCQL query, it describes the return type of the results of the query. In addition to specifying the class of objects, the Object type provides the basic recursive structure of DCQL and CQL queries. That is, an object is described through a restriction over one of its attributes with the Attribute type, an association to another object with the Association type, a relationship to another object on a remote data service with the ForeignAssociation type, or a logical grouping of two or more of those types using the Group type. For example, a Group of several restrictions over attributes and associations can be specified. Each of these types are described below.
The Attribute type is the simplest of restriction types, and the terminator of query recursion (as it allows no children). DCQL simply makes use of the CQL Attribute type; its syntax and semantics are the same. The type allows basic restriction of a single attribute of an object. The restriction is expressed as name, value, and predicate. The name component defines the name of the attribute. The value component defines the expected value of the attribute, with respect to the predicate. The predicate is the operator that should be used to evaluate whether or not a given attribute instance matches the specified value. For example, an Attribute with name="size", value="5", and predicate="LESS_THEN" would restrict the results to contain data instances which had an attribute called "size" with a value of less than 5. The predicate values are generally self-descriptive: "EQUAL_TO", "NOT_EQUAL_TO", "LIKE", "LESS_THAN", "LESS_THAN_EQUAL_TO", "GREATER_THAN", and "GREATER_THAN_EQUAL_TO." Two additional predicates, "IS_NULL", and "IS_NOT_NULL" check only for the presence or absence, respectively, of an attribute, and do not restrict its value at all. Therefore, any value attribute will be ignored when using these predicates. "EQUAL_TO" is the default predicate and so an Attribute need not explicitly specify predicate="EQUAL_TO" to define equivalence restrictions.
The Association type is used to describe a relationship, or association, between the containing object and the object identified by the Association type. The type is an extension of the Object type, in that it too describes an object (the associated object). The Association type is always used in the context of a containing Object type. The containing object is the source of the UML association, and the object described by the Association type is the target. Beyond everything it inherits from the Object type, the Association type introduces an additional attribute called roleName. The roleName attribute is optional, and can be used to name the role the target object plays in the UML association. The roleName can be omitted if the UML information model only describes a single association between the source and target Classes. If more than one association between the two classes is present in the model, then the roleName must be used to disambiguate the relationship. The query is considered invalid if the roleName is omitted and multiple associations between the two classes exist in the model.
The primary distinction between DCQL and CQL is the addition of the ability to identify data of interest through relationships with data on remote services. The ForeignAssociation type provides this means. Similar to the Assocaition type, this type is only used within the context of a containing Object type, and contains a description of another Object type; in this case the ForeignObject. The type describes a relationship between the containing object and object on another data service. The objects in the other data service are identified by both the targetServiceURL attribute of the ForeignAssociation, which identifies the remote data service, and the ForeignObject which is just an Object type that defines the desired instances just as "local" queries do. Conceptually, the ForeignAssociation results in a new CQL query being sent to the data service running at targetServiceURL with the Target of the query being the ForeignObject. This again shows how DCQL just makes use of minor extensions to CQL to express the notions of aggregation and joins. In addition to the notion of describing a query to a remote service, the way in which the "results" should relate to the containing object must also be defined. For this purpose, DCQL uses a JoinCondition type, which is described below. In this way, the ForeignAssociation type describes a query to a remote data service, and defines that the containing objects which should be kept are those that meet the requirements of the JoinCondition when compared against the results of the remote query. In terms of database query languages, the ForeignAssociation type is roughly equivalent to an SQL subselect.
Used only as a part of a ForeginAssociation, the JoinCondition type specifies the desired relationship between the ForeignObject and the containing Object type. It currently supports only single simple attribute comparisons. The type is similar to the Attribute type, in that a predicate is used to identify the relationship between the entities, but differs slightly in that rather than a value being specified as one entity, both entities are attribute names. That is, while an Attribute type compares an attribute of its containing Object to a specified value, the JoinCondition type compares the containing Object to another Object (that which is identified by the ForeignAssociation's ForeignObject). The syntax of the JoinCondition is such that the containing Object's attribute is named by the required localAttributeName attribute, the ForeignObject's attribute is named by the required foreignAttributeName, and the predicate is named by the optional predicate attribute. Just like the Attribute type, the predicate is optional, and "EQUAL_TO" is the default value when omitted. The set of allowable predicate values is the same as for the Attribute type, with the exception of the "IS_NULL" and "IS_NOT_NULL" values being disallowed, as they are only applicable against a single attribute.
The Group type provides the capability to express two or more constraints. Whenever a single constraint (Object, Association, ForeignAssociation, or Attribute) needs to be combined with one or more additional contrants, a Group must be used to express their relationship with eachother. This relationship is described by the logicalRelation attribute of the Group. The logicalRelation attribute can assume the value of either "OR" or "AND". If the value is "AND" all of the contained constraints must be met for the Group constraint to be met. If the value is "OR" only one constraint must be met for the Group constraint to be met. In addition to grouping other constraints, the Group type can also contain nested Groups. This simple construct allows for arbitrarily complex constraints to be modeled.
An example DCQL query, represented in XML, is shown below. In this fictitious example, a PersonRegistry Data Service is joined with a StudyRegistry Data Service. The query specifies Persons in the PersonRegistry should be returned which have an "ssn" that is equal to that of a Participant's "patientSSN" and the Participant should have an "age" greater than 18. The specification of the target service can be seen on line 18 in the example (in this case only one service is targeted, though may could have been listed). Additionally, the "join" is specified starting on line 6, wherein the second target service is identified, and the join condition is defined. The join condition creates a link between the containing Object (in this case, Person), and an Object (in this case Participant, as defined on line 10) in the second target service. The condition specifies a predicate to be evaluated against an attribute in each of the two linked Objects (in this case Person.ssn and Participant.patientSSN). It is worth noting that as DCQL is a recursive language, the ForeignObject defined on line 10 could have also specified a join to a third Data Service, or other more complex criteria.
The Federated Query Engine is a simple but powerful design. The main functionality of the engine is to process a DCQL query by converting it into regular CQL queries to the targeted data services, appropriately aggregating results. As such, all of the actual "joining" of data is offloaded to the remote data services. This allows the engine to be reused as a client API as no databases or complex service infrastructure is needed; its simply a client-side querying tool. The trade off of this simplicity is efficiency, as more optimized approaches to federated joins exist. Future versions may adopt a more optimized, but less portable, approach.
The engine requires no special support from data services. Each service which is contacted to satisfy the distributed query is only sent one or more standard, but potentially complex, CQL queries. It is possible to construct a DCQL query which is essentially a standard CQL query, with the addition of specifying one or more target data services. In this case, the engine simply "forwards" that query on to the targeted services, and aggregates their results. In fact, this is always the final step of processing a DCQL query. The primary addition DCQL adds to CQL is the capability to express relationships between objects on multiple data services. The processing of these relationships yield "value conditions" against the source data service, which are expressed in standard CQL, and ultimately sent to the target data service. Take for example, a query that asked for all the Genes in service A that shared a symbol with Genes in service B which had an id greater that 50. In this case, first service B would be asked for all the symbols of the Genes which had an id greater that 50. Then service A would be asked for all the Genes which had a symbol that matched on from the list returned from service B. These results would then be returned as the results of the distributed query. Note the final step is just a standard "value condition" CQL query sent to a data service. While the queries can become quite complex, this basic process of converting cross-data service relationships into single data service value comparisons, and appropriately aggregating results, is the core functionality of the federated query engine.
The engine is composed of three core components, shown below. The first is the FederatedQueryEngine itself, which is the client-facing entry point to using the engine. The FederatedQueryEngine's main responsibility is accepting the DCQL query, orchestrating the use of the other two components, and returning the appropriate results to the client. The second component is the FederatedQueryProcessor, and performs the bulk of the work of the engine in that it is responsible for traversing the query, executing the logic of the engine, and processing intermediate aggregations. Both of the first two components make use of the third component, the DataServiceQueryExecutor to invoke standard CQL queries on remote data services. Each of these components is described in greater detail in the following section.
The FederatedQueryEngine is the client-facing entry point to the engine. Its sole responsibility is to use the FederatedQueryProcessor to convert the specified DCQL query into a CQL query, and then query each specified target data service with that query using the DataServiceQueryExecutor. The results are then appropriately aggregated according to the method that was invoked.
It provides two methods which accept DCQL queries, and return the results. Each of the two methods provides a different variant on how results are represented. The first method is the executeAndAggregateResults method, which returns the standard CQLQueryResults (the same result type returned by data services' query method). Each CQLResult obtained from each targeted data service is merged into an aggregate list, and a master CQLQueryResults object is constructed which contains them all. The information about which result came from which data service is lost in this scenario, but this provides the ability to reuse existing data service tooling and APIs when that information is not relevant.
In cases where it is important to know from which data service a given result came, the second query method called execute can be used. This method returns a new type called DCQLQueryResultsCollection. The DCQLQueryResultsCollection contains a list of DCQLResult, wherein each DCQLResult specifies a CQLQueryResults object, and the data service URL from which it came. That is, the result type is a collection of tuples containing the standard data service results, and that service's URL.
Both query methods will throw a RemoteDataServiceException in the event a queried data service returns invalid results (such as the wrong target class type), or if a data service itself throws an exception when being queried, or if there is any problem querying the data service. Additionally, a FederatedQueryProcessingException, which is the parent class of RemoteDataServiceException, may be thrown if there is a problem processing the query itself.
The Federated Query Engine supports a callback framework by which developers may be notified of various query processing events. This works much as it does in Swing GUIs: An instance of the FQPProcessingStatusListener interface is implemented and passed to the addStatusListener method of the engine. Listeners can be removed via the removeStatusListener and enumerated by way of the getStatusListeners method. The Federated Query Engine provides various protected visibility methods which may be called to send event notifications to all listeners. This mechanism is used to provide support for the Federated Query Processing Status resource property, which clients to the grid service may use in conjunction with WS-Notification to know when a long-running query has completed without implementing a busy-wait loop.
The heart of the engine is the FederatedQueryProcessor. It contains all the logic to convert a DCQL query into a CQL query. The basic process flow is to create a new CQL query, and recursively walk the DCQL query, depth first, until a terminating leaf is reached. These query leafs are either Attribute types, or Association or Object types with no children (class only criteria). Such criteria are converted to their CQL counterparts, and added to the CQL query as the recursion unrolls. The main special processing, described below, occurs when a ForeignAssociation type is encountered. The processing of a ForeignAssociation yields a CQL Group type, which is added to the converted type previously containing the ForeignAssociation. The end result is a standard CQL query, which is returned to the caller.
ForeignAssociations are processed by converting the foreign relationship to a standard CQL query to the specified data service, and converting the results of that query into attribute value comparisons (Attribute type) against the containing Object type. The ForeignObject, a DCQL Object type, in the ForeignAssociation is converted to a standard CQL Object just as the main DCQL Object is. That is, the same depth first recursion logic is used, and the Object may very well contain a nested ForeignAssociation. Once the conversion of the ForeignObject is complete, the resulting CQL Object is used as the target in a new CQL Query. The processor then apply a CQL QueryModifier to the CQL query which specifies that only distinct attribute values should be retrieved from the data service. The attribute requested is the one named in the JoinCondition as the foreignAttributeName. This optimization ensures that only the information necessary to execute the value join is transmitted from the remote data service. It also alleviates the processor from the need to understand the structure of the resulting objects, as attribute only CQL queries always yield a common name-value result format. Given the results of the query, a CQL Group is constructed which represents the value comparisons that are necessary to join these results with the containing object. If the remote query returned no results, then this criteria would never yield any results (as the client requested a join that contains an empty set). In order to maintain the semantics of the query, this criteria must yield no results, but cannot be simply omitted (as it would be inappropriately inclusive), nor should the query be aborted, as other criteria may yield results. The most appropriate replacement for such a query would be a "false" criteria, but CQL has no such construct. In order to replicate this behavior, a Group is constructed which will always be evaluated as false as it contains two mutually exclusive criteria ANDed together. Namely, two Attribute types using the localAttributeName as the name and "IS_NULL" and "IS_NOT_NULL" respectively as the predicates. If results are actually returned from the remote data service query, each is converted into an Attribute type with localAttributeName as the name, a query result value as the value, and the JoinCondition's predicate as the predicate. In this way the cross-data service joins are executed by first extracting the join values from the remote service, and then converting them into local value queries to the "local" data service.
The net result of this processing is the DCQL query is converted into a standard CQL query, wherein all cross data service joins are evaluated by converting them to local attribute value restrictions. Notice that this processing flow dictates that the joins specified in nested ForeignAssociations are processed by the immediately containing data service. This is also true for joins executed across multiple data services (using a Group of ForeignAssociations). As such, the amount of processing required by the "outer most data service" does not increase with the complexity of the query, but only with the number of results returned by the outermost ForeignAssociation. As such, clients creating federated joins using DCQL, who are interested in performance, should take care to avoid, when possible, unrestricted joins. That is, joins which don't restrict the return set of the join predicate. For example, joining Gene symbols across data services without attaching criteria to the ForeignObject would require one data service to execute a query with all Gene symbols from the other data service listed as possible values.
The DataServiceQueryExecutor is a simple wrapper around the standard data service client that ensures uniform exception handling and logging while communicating with remote data services.
The federated query service infrastructure provides remote access to the federated query engine through the use of two grid services, shown below. The first, the Federated Query Processor service, is the "entry point" service, and provides the means to execute DCQL queries. Queries can either be executed by a blocking call, and the results returned directly in the response to the query request, or they can be executed asynchronously. If a query is executed asynchronously, the service returns an End Point Reference (EPR, basically a service address), to the second service, the Federated Query Results service. The results service provides information about the current status of the query processing, and provides access to the results once the query processing is complete. These services are implemented using a standard WSRF factory pattern. That is, clients may access the Federated Query Processor service directly, and are given a resource-specified EPR with which the Federated Query Results service can be contacted. The services must be deployed together, in the same container, and can operate in both secure and insecure deployment scenarios. These services were developed using the Introduce grid service authoring tool, and can be viewed or edited using it.
The Federated Query Processor Service is the grid service which allows clients to execute DCQL queries and generate results using a remote service rather than a locally running Federated Query Engine API. It leverages the same APIs as a client would for local invocation.
The Federated Query Processor service is main service interface to the federated query engine. It provides four query execution operations. The first two are execute which takes a DCQL query and returns a DCQLQueryResultsCollection, and executeAndAggregateResults which returns a CQLQueryResults. These are both simple grid service wrappers for the corresponding methods in the FederatedQueryEngine API. The third operation, executeAsynchronously, provides asynchronous, non-blocking, access to the execute method, and returns a FederatedQueryResultsReference. It is also a convenience method which passes default values to the fourth and most powerful query method, query. The query method also executes asynchronously and returns a FederatedQueryResultsReference, but it allows the client to specify a delegated credential for use in queries, and the behavior of the service in the event of various failures while executing the DCQL query.
The FederatedQueryResultsReference is a typed container for an EPR to the Federated Query Results service. The Federated Query Results service client API can be used to subsequently retrieve the DCQLQueryResultsCollection directly, or via one of the other results transport mechanism (WS-Enumeration or Transfer).
In order to process the asynchronous query requests (via the executeAsynchronously or query operations), the service employs a worker thread pool. A thread pool from the Java 5 concurrent package is leveraged for this purpose. Its initial and maximum sizes are configured via a service property. This provides a mechanism to manage the amount of system resources consumed by the service for the purposes of scheduling background tasks. Each task concurrently scheduled beyond the maximum number of worker threads is placed in a queue for processing once a currently executing task completes, and a thread becomes available. Using this framework, each asynchronous query is modeled as a Runnable instance which is passed to the thread pool for scheduling and eventual execution. Once processing of the DCQL query is complete and only the final CQL query remains, the same thread pool is used by the Federated Query Engine to execute it in parallel against all target data services in the DCQL query.
When a new Federated Query Result Resource is created, it is passed the ExecutorService which will handle all subsequent query related work. The resource then creates the runnable instance which is passed in to the executor, and is used to represent the status of processing and the results, and the DCQL query which needs to be executed. Once given priority to process, the worker uses the Federated Query Engine to execute the query, and stores the results in the resource. Any exceptions that occur during processing are also captured and stored in the resource, and a special execution status resource property (discussed later). After initializing a Results Resource, the service constructs an EPR to the resource and returns it to the caller.
The service makes a few aspects of its implementation configurable through deploy-time options. This feature, provided by Introduce, allows the service to access configuration options through the ServiceConfiguration class, where the values can be set by deployers of the service. The values can be set in the Introduce GUI when deploying the service using Introduce, or Ant properties when deploying using Ant from the command line; the defaults are set in the service.properties file. One configurable aspect is the number of threads used by the thread pool. This option, named threadPoolSize, defaults to 10. Another option, initialResultLeaseInMinutes, defaults to 30, and defines the number of minutes an asynchronous query result resource will live, unless renewed, before it is discarded. The specifics of how result lifetime is managed is described below in the description of the results service.
The service operations signal errors through the use of service faults. A base fault, FederatedQueryProcessingFault, acts as the main service-specific fault which clients can handle. It extends from BaseFault and represents a problem satisfying the request. All engine exceptions are wrapped with this fault. Additionally, an InternalErrorFault, which extends from FederatedQueryProcessingFault, can be thrown if there is some internal problem with the service. For example, an unrecoverable configuration issue or unexpected logic error in the service would yield such a fault. It is not expected the client could recover from such an issue. A ProcessingNotCompleteFault will be thrown if a client attempts to retrieve DCQL query results before all query execution operations have completed.
The service contains a single WSRF Resource, which exposes service-level metadata as resource properties. The only resource property currently exposed by the service is the standard caGrid ServiceMetadata, which describes the datatypes and operations of the service.
The Federated Query Results service is the service responsible for providing access to query results and processing status for asynchronously executed queries. The service can only be contacted with a resource-qualified EPR, provided by the Federated Query Processor service. As described above, whenever the query processor service is requested to execute an asynchronous query, a Federated Query Results Resource is created and an EPR, which identifies that resource in the results service, is returned. The Federated Query Results service's only purpose is to expose information about, and management of, the Federated Query Results Resource instances. As such, the only business logic in the service is to call through to the resources. The Globus toolkit provides all the necessary logic for the service to operate in the context of the appropriate resource.
The Federated Query Results Resource is a WSRF Resource, and its instances act as the managers of the state of asynchronous queries. It is exposed to clients through the results service. It contains the current status of the query it corresponds to, any exceptions which occurred during processing, and eventually the results of the query. Also contained in the resource, but not visible to clients are the Executor Service used to perform asynchronous work, and the client's Delegated Credential Reference, if one was supplied. It supports standard WSRF Resource Lifetime behavior. As such, it exposes, as Resource Properties, the current time (as believed by the local system), and the termination time of the resource. Once created, the resource will be terminated/destroyed by the service once its termination time is past. This lifetime is initially controlled by the service property setting in the processor service. The client can also immediately destroy the resource with the Destroy operation, or change its termination time with the SetTerminationTime operation. Both of these operations are standardized operations for resources supporting Resource Lifetime and their implementations are plugged into the service via the provider framework, with the DestoryProvider and SetTerminationTimeProvider.
In addition to the operations and resource properties necessary to support Resource Lifetime on the resource, the service also provides the getResults and isProcessingComplete operations. The isProcessingComplete operation returns a simple Boolean value, indicating whether or not the query processing has completed. Once the query processing has completed, the results can be accessed via the getResults operation, which returns a DCQLQueryResultsCollection. If the operation is invoked prior to the processing being complete, a ProcessingNotCompleteFault fault will be thrown. If the processing is complete, but an exception occurred, a FederatedQueryProcessingFault will be thrown, and its cause will be the exception that occurred during query processing.
The FQP Results Resource publishes a special resource property which contains the current status of Federated Query Processing, which is an instance of the FederatedQueryExecutionStatus bean. It is updated via a custom implementation of the FQPProcessingStatusListener interface which in turn edits and stores the resource property. The results resource creates this instance when query processing is initiated, and passes it to the query execution task, which in turn passes it along to the Federated Query Engine's addStatusListener() method.
The Federated Query Processor service supports two deployment scenarios. The first is an insecure deployment, wherein no security (authentication or authorization) is enforced by the container. In this scenario, no encryption is used and no protection of query results is enforced. That is, anonymous communication is used over an open channel, and it is possible for one client to manipulate the query resources of another, given it knows the EPR.
The second scenario is when the services are deployed securely, such as with transport level security (https). In this scenario, no authorization is performed by the processor service, but any resources created via asynchronous queries are created with resource-level security descriptors that enforce that only the creator of the resource can call any operations on it. To accomplish this, the processor service which creates the resource applies Grid Map authorization on the resource, and adds the creator's identity as the sole entry in the Grid Map.
The Federated Query Processor service is Introduce-created, and as such, the security settings can be modified in Introduce to support additional security enforcement. The only security enforced in the actual source code, is the programmatic application of the Grid Map authorization to the resources.
The Federated Query Processor service allows clients to make use of delegated credentials when calling the query method. When this is done, all remote data services involved in the DCQL query are accessed using the delegated credential.