Difference between revisions of "Search Planning and Execution Specification"

From Gcube Wiki
Jump to: navigation, search
(Overview)
 
(25 intermediate revisions by 2 users not shown)
Line 8: Line 8:
 
the IR framework to scale in the number of Data Sources that are integrated in an infrastructure.
 
the IR framework to scale in the number of Data Sources that are integrated in an infrastructure.
  
Search Planner produces an plan that combines Data Sources and a selection from various Search Operators.  
+
Search Planner produces an plan that combines Data Sources and a selection from various Search Operators. A distributed execution environment ensures the efficient execution of the plan. Information travels from Data Sources to Search Operators and, in turn, to the Search clients through a pipelining data transferring mechanism that provides low latency, large throughput and a flow control facility.  
 
+
+
The Data Sources Subsystem constitutes the framework we provide in order to integrate heterogeneous data from different providers in our Information Retrieval(IR) process. Using an
+
Indexing Layer and the [http://www.opensearch.org  OpenSearch] standard, Data Sources framework provides fast access and direct connection to the information hosted in the
+
heterogeneous environment.
+
  
 
=== Key features ===
 
=== Key features ===
  
;Unification of heterogenous Data and different IR capabilities
+
; On-the-fly Integration of New Data Sources
: Using the [http://www.loc.gov/standards/sru/specs/cql.html CQL] standard, different gCube IR providers that host data with diverse representations and semantics, can be involved the overall IR process.
+
: CQL-compliant Sources that publish their capabilities, are dynamically involved in the IR process.
  
;Indexing Layer for advanced IR functionality
+
; Involves the minimum number of Data Sources
: Full-text retrieval, Multidimensional Range queries and Spatiotemporal search functionality
+
: Detection of all the Sources that contribute to the result of a query.
  
;Access to the information hosted by external Providers
+
; Scalability in the number of Sources integrated in an Infrastructure
: External providers can provide their results during the IR process through the [http://www.opensearch.org OpenSearch] standard.
+
: Planning and Execution components designed to scale.
 +
 
 +
; Dynamic Integration of New Search Operators
 +
: Operators with new functionality can be dynamically integrated in the IR process.
 +
 
 +
; Pipelining Mechanism
 +
: Offers flow control, low latency and high throughput.
  
 
== Design ==
 
== Design ==
  
 
=== Philosophy ===
 
=== Philosophy ===
The Data Sources framework is implemented in order to:
+
Search Planning and Execution components are designed in order to:
  
* simplify the integration of different IR providers in the gCube IR framework, using the appropriate standards.
+
* allow the efficient and flexible integration of new Sources and Operators.
* provide Replication and High Availability through a distributed architecture.
+
* exploit the IR capabilities and functionality of various information providers.
* exploit the information and IR capabilities of external providers.
+
* scale in environments with a large number of heterogeneous Sources.
 +
* decouple and eliminate the dependencies among the Planner, the Execution environment and the information providers.
  
 
=== Architecture ===
 
=== Architecture ===
The Data Sources framework is composed by the Index and OpenSearch Systems. The architecture is depicted in the following figure:
+
The main components of the Search system are the Planner, the Search Operators and the distributed Execution engine. The architecture is shown in the following figure:
  
The Index System is designed using a distributed architecture that involves three entities:
+
[[Image:Search_system_architecture.jpg|thumb|center|1000px|Search System Architecture]]
  
* Updater: An Updater instance enables the on-the-fly update on an Index partition. It applies the preprocessing steps required to transform the data to be indexed into an appropriate format.  
+
The Planner is aware of the available Data Sources, through the Resource Registry that provides an interaction mechanism with the [[gCore Based Information System]]. The Information System is used by Data Sources in order to publish their capabilities. Planner computes a plan for answering a [http://www.loc.gov/standards/sru/specs/cql.html CQL] query received by a search client. This plan contains also some hints about specific functionality required for the involved Search Operators.  
* Manager: A Manager instance ensures the correct synchronization and application of update actions on all the Replicas of a specific Index partition. Moreover, it handles abnormal conditions that affect the operation of the related Index partition.
+
* Replica: A Replica hosts the actual data being indexed for an Index partition. It dynamically applies update actions on the index structure it maintains, without ceasing its operation.
+
  
The OpenSearch framework uses the OpenSearch specification in order to connect to external IR providers and exploit their information. A different OpenSearch instance is used to connect to each provider. In such way, the IR capabilities of external providers are published to gCube infrastructure and can be utilized by the gCube IR framework.
+
The distributed Execution environment computes a preferred allocation for the execution of received plans through the Execution planner.  
 
+
During this procedure the possible invocation options for each Data Source and Search Operator are taken into account. The outcome of execution is the endpoint of a [[GRS2]] pipeline that can transfer the results to the search client that initiated the query.
On the top layer of the Index and OpenSearch Sources the CQL standard provides the link to the gCube Search System. While only Index and OpenSearch Sources are internal parts of gCube, other IR providers can be wrapped as Data Sources, as long as they support CQL.
+
  
 
== Deployment ==
 
== Deployment ==
 +
The deployment schema must be decided based on:
  
Data Sources are deployed over [https://gcore.wiki.gcube-system.org/gCube/index.php/Main_Page gCore] containers. The [[gRS2]] pipelining mechanism must also be part of the node. Index Replicas and OpenSearch instances can use a large number of nodes in cases where load balancing is required for large scale infrastructures. For better synchronization, Index Managers and Updaters can be co-deployed, while it is preferable to deploy Lookups in different nodes. Note that Data Sources instances are most commonly deploy on a [[Administrator's_Guide:How_to_create_a_Virtual_Organization | VO]] level.
+
1. the workload from the search clients
  
== Use Cases ==
+
2. the extent of the Data Sources' space
The suitability of the gCube Data Source specifications for IR components is strongly related to the two standards adopted:
+
  
* CQL: IR providers that support functionality which can be directly mapped in the CQL standard are good candidates for being wrapped into Data Sources.
+
=== Large Deployment ===
* OpenSearch: IR providers that implement the OpenSearch API can be directly wrapped into Data Sources.
+
Planner, Search Operators and Execution Engine are deployed over [https://gcore.wiki.gcube-system.org/gCube/index.php/Main_Page gCore] containers. The [[gRS2]] pipelining mechanism and the Resource Registry must also be part of the node. In case of a high frequency of received queries from search clients, a lot of Planner, Operators and Execution Engine instances have to be deployed. Such instances can be co-deployed for minimizing different types of overhead. A Planner instance can use only the co-deployed Execution Engine instance for the generation of execution plans. If Search Operators are also co-deployed on the same node, the Search System will be able to perform execution of the entire or part of the execution plan by exploiting the local resources, or in case the complexity of the execution plan is high, it can exploit the computational resources of all nodes where an Execution Engine is deployed. The following figure depicts a deployment configuration with multiple Search System endpoints where execution can be performed in a distributed manner with each node expoiting the computational capabilities both of itself and also of all other nodes. Another option for a deployment scenario could be to separate Search System endpoint nodes with execution nodes by maintaining a set of nodes with Planner and Execution Engine deployed and another set of nodes with Execution Engine and Operators. Deployment is usually performed on a [[VRE_Administration | VRE]] level.
  
=== Well suited Use Cases ===
+
[[Image:Search_system_deployment_large.jpg|frame|center|Search System Large Deployment]]
  
Components that provide IR functionality are well-suited for forming Data Sources based on their relation to the above standards. Integration of an IR provider
+
=== Small Deployment ===
through the OpenSearch Data Source is preferable in cases where there is no direct mapping of the provider's functionality to the CQL standard. However, if CQL
+
In a smaller scale, with a small frequency of received queries and a reasonable Sources' space, only one node with Planner, Operators and Execution Engine co-deployed, is preferable.
can express accurately the provided IR capabilities, the direct integration of the corresponding IR component as a separate Data Source can be advantageous. The
+
 
advantages in that case are mainly related to the better exploitation of the component's IR functionality. Note that CQL is chosen as the standard in our framework,  
+
[[Image:Search_system_deployment_small.jpg|frame|center|Search System Small Deployment]]
because it is a highly expressive query language that suits the IR functionality of most general-case IR systems.
+
 
 +
== Use Cases ==
 +
The suitability of the Search System is based primarily on the characteristics of the underlying environment
 +
 
 +
=== Well suited Use Cases ===
 +
In cases where the environment's underlying information providers are CQL-compliant, gCube Search System can unify them and provide dynamic integration in the IR process. Additionally the optimization mechanisms it employs, in the planning and execution layer, offer high performance and scalability in case of large heterogeneous Data Sources' spaces.
  
 
=== Less well suited Use Cases ===
 
=== Less well suited Use Cases ===
  
In case a Data provider can not be associated with any of the two standards, the alternative approach is to apply an intermediate step by inserting the provider's data into an Index partition. In this case the provider's information will be exploited through the Index System functionality. However, this alternative implies a significant overhead when the content of the provider is frequently updated.
+
Less well suited cases are those where the underlying information providers can not be wrapped into gCube Data Sources through the CQL and [http://www.opensearch.org OpenSearch] standards, or through the [[Data_Sources_Specification | Index Data Source]]. These rare cases will mostly arise when the IR functionality of information providers can not be exploited using CQL expressions.

Latest revision as of 14:09, 19 October 2016

Overview

A fundamental part of the gCube Information Retrieval framework consists of the Search Planning and Execution components. The Search Planner enables the on-the fly integration of CQL-compliant Data Sources. The key concept in this process is the publication of CQL capabilities by the integrated Sources. The Search Planner will involve any of the Sources that have published their capabilities on a given infrastructure, as long as they contribute to the result for a query.

The optimization mechanisms of the Planner detect the smallest set of Sources required to answer a query. Moreover, using a probabilistic approach, a near-optimal plan for execution is found. The algorithms of the Planning and Optimization stages allow the IR framework to scale in the number of Data Sources that are integrated in an infrastructure.

Search Planner produces an plan that combines Data Sources and a selection from various Search Operators. A distributed execution environment ensures the efficient execution of the plan. Information travels from Data Sources to Search Operators and, in turn, to the Search clients through a pipelining data transferring mechanism that provides low latency, large throughput and a flow control facility.

Key features

On-the-fly Integration of New Data Sources
CQL-compliant Sources that publish their capabilities, are dynamically involved in the IR process.
Involves the minimum number of Data Sources
Detection of all the Sources that contribute to the result of a query.
Scalability in the number of Sources integrated in an Infrastructure
Planning and Execution components designed to scale.
Dynamic Integration of New Search Operators
Operators with new functionality can be dynamically integrated in the IR process.
Pipelining Mechanism
Offers flow control, low latency and high throughput.

Design

Philosophy

Search Planning and Execution components are designed in order to:

  • allow the efficient and flexible integration of new Sources and Operators.
  • exploit the IR capabilities and functionality of various information providers.
  • scale in environments with a large number of heterogeneous Sources.
  • decouple and eliminate the dependencies among the Planner, the Execution environment and the information providers.

Architecture

The main components of the Search system are the Planner, the Search Operators and the distributed Execution engine. The architecture is shown in the following figure:

Search System Architecture

The Planner is aware of the available Data Sources, through the Resource Registry that provides an interaction mechanism with the gCore Based Information System. The Information System is used by Data Sources in order to publish their capabilities. Planner computes a plan for answering a CQL query received by a search client. This plan contains also some hints about specific functionality required for the involved Search Operators.

The distributed Execution environment computes a preferred allocation for the execution of received plans through the Execution planner. During this procedure the possible invocation options for each Data Source and Search Operator are taken into account. The outcome of execution is the endpoint of a GRS2 pipeline that can transfer the results to the search client that initiated the query.

Deployment

The deployment schema must be decided based on:

1. the workload from the search clients

2. the extent of the Data Sources' space

Large Deployment

Planner, Search Operators and Execution Engine are deployed over gCore containers. The gRS2 pipelining mechanism and the Resource Registry must also be part of the node. In case of a high frequency of received queries from search clients, a lot of Planner, Operators and Execution Engine instances have to be deployed. Such instances can be co-deployed for minimizing different types of overhead. A Planner instance can use only the co-deployed Execution Engine instance for the generation of execution plans. If Search Operators are also co-deployed on the same node, the Search System will be able to perform execution of the entire or part of the execution plan by exploiting the local resources, or in case the complexity of the execution plan is high, it can exploit the computational resources of all nodes where an Execution Engine is deployed. The following figure depicts a deployment configuration with multiple Search System endpoints where execution can be performed in a distributed manner with each node expoiting the computational capabilities both of itself and also of all other nodes. Another option for a deployment scenario could be to separate Search System endpoint nodes with execution nodes by maintaining a set of nodes with Planner and Execution Engine deployed and another set of nodes with Execution Engine and Operators. Deployment is usually performed on a VRE level.

Search System Large Deployment

Small Deployment

In a smaller scale, with a small frequency of received queries and a reasonable Sources' space, only one node with Planner, Operators and Execution Engine co-deployed, is preferable.

Search System Small Deployment

Use Cases

The suitability of the Search System is based primarily on the characteristics of the underlying environment

Well suited Use Cases

In cases where the environment's underlying information providers are CQL-compliant, gCube Search System can unify them and provide dynamic integration in the IR process. Additionally the optimization mechanisms it employs, in the planning and execution layer, offer high performance and scalability in case of large heterogeneous Data Sources' spaces.

Less well suited Use Cases

Less well suited cases are those where the underlying information providers can not be wrapped into gCube Data Sources through the CQL and OpenSearch standards, or through the Index Data Source. These rare cases will mostly arise when the IR functionality of information providers can not be exploited using CQL expressions.