Difference between revisions of "Data Sources Specification"

From Gcube Wiki
Jump to: navigation, search
(Key features)
(Overview)
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
== Overview ==
 
== Overview ==
  
The Data Sources Subsystem constitutes the framework we provide in order to integrate heterogeneous data from different providers in our Information Retrieval(IR) process. Using an  
+
The Data Sources Subsystem constitutes the framework provided in order to integrate heterogeneous data from different providers in our Information Retrieval(IR) process. Using an  
 
Indexing Layer and the [http://www.opensearch.org  OpenSearch] standard, Data Sources framework provides fast access and direct connection to the information hosted in the  
 
Indexing Layer and the [http://www.opensearch.org  OpenSearch] standard, Data Sources framework provides fast access and direct connection to the information hosted in the  
 
heterogeneous environment.
 
heterogeneous environment.
Line 19: Line 19:
  
 
=== Philosophy ===
 
=== Philosophy ===
This is the rationale behind the design. An example will be provided.  
+
The Data Sources framework is implemented in order to:
 +
 
 +
* simplify the integration of different IR providers in the gCube IR framework, using the appropriate standards.
 +
* provide Replication and High Availability through a distributed architecture.
 +
* exploit the information and IR capabilities of external providers.
  
 
=== Architecture ===
 
=== Architecture ===
The main software components forming the subsystem should be identified and roughly described. An architecture diagram has to be added here. A template for the representation of the architecture diagram will be proposed together with an opensource tool required to produce it.
+
The Data Sources framework is composed by the Index and OpenSearch Systems. The architecture is depicted in the following figure:
  
== Deployment ==
+
The Index System is designed using a distributed architecture that involves three entities:
Usually, a subsystem consists of a number of number of components. This section describes the setting governing components deployment, e.g. the hardware components where software components are expected to be deployed. In particular, two deployment scenarios should be discussed, i.e. Large deployment and Small deployment if appropriate. If it not appropriate, one deployment diagram has to be produced.
+
  
=== Large deployment ===
+
* Updater: An Updater instance enables the on-the-fly update on an Index partition. It applies the preprocessing steps required to transform the data to be indexed into an appropriate format.
 +
* Manager: A Manager instance ensures the correct synchronization and application of update actions on all the Replicas of a specific Index partition. Moreover, it handles abnormal conditions that affect the operation of the related Index partition.
 +
* Replica: A Replica hosts the actual data being indexed for an Index partition. It dynamically applies update actions on the index structure it maintains, without ceasing its operation.
  
A deployment diagram suggesting the deployment schema that maximizes scalability should be described here.
+
[[Image:Datasource_architecture.jpg|thumb|800px|center|Data Sources Architecture]]
  
=== Small deployment ===
+
The OpenSearch framework uses the OpenSearch specification in order to connect to external IR providers and exploit their information. A different OpenSearch instance is used to connect to each provider. In such way, the IR capabilities of external providers are published to gCube infrastructure and can be utilized by the gCube IR framework.
  
A deployment diagram suggesting the "minimal" deployment schema should be described here.
+
On the top layer of the Index and OpenSearch Sources the CQL standard provides the link to the gCube Search System. While only Index and OpenSearch Sources are internal parts of gCube, other IR providers can be wrapped as Data Sources, as long as they support CQL.
 +
 
 +
== Deployment ==
 +
 
 +
Data Sources are deployed over [https://gcore.wiki.gcube-system.org/gCube/index.php/Main_Page gCore] containers. The [[gRS2]] pipelining mechanism must also be part of the node. Index Replicas and OpenSearch instances can use a large number of nodes in cases where load balancing is required for large scale infrastructures. For better synchronization, Index Managers and Updaters can be co-deployed, while it is preferable to deploy Lookups in different nodes. Note that Data Sources instances are most commonly deploy on a [[Administrator's_Guide:How_to_create_a_Virtual_Organization | VO]] level.
 +
 
 +
[[Image:Datasource_deployment.jpg|thumb|600px|center|Data Sources Deployment]]
  
 
== Use Cases ==
 
== Use Cases ==
The subsystem has been conceived to support a number of use cases moreover it will be used to serve a number of scenarios. This area will collect these "success stories".  
+
The suitability of the gCube Data Source specifications for IR components is strongly related to the two standards adopted:
 +
 
 +
* CQL: IR providers that support functionality which can be directly mapped in the CQL standard are good candidates for being wrapped into Data Sources.
 +
* OpenSearch: IR providers that implement the OpenSearch API can be directly wrapped into Data Sources.
  
 
=== Well suited Use Cases ===
 
=== Well suited Use Cases ===
  
Describe here scenarios where the subsystem proves to outperform other approaches.  
+
Components that provide IR functionality are well-suited for forming Data Sources based on their relation to the above standards. Integration of an IR provider
 +
through the OpenSearch Data Source is preferable in cases where there is no direct mapping of the provider's functionality to the CQL standard. However, if CQL
 +
can express accurately the provided IR capabilities, the direct integration of the corresponding IR component as a separate Data Source can be advantageous. The
 +
advantages in that case are mainly related to the better exploitation of the component's IR functionality. Note that CQL is chosen as the standard in our framework,
 +
because it is a highly expressive query language that suits the IR functionality of most general-case IR systems.
  
 
=== Less well suited Use Cases ===
 
=== Less well suited Use Cases ===
  
Describe here scenarios where the subsystem partially satisfied the expectations.
+
In case a Data provider can not be associated with any of the two standards, the alternative approach is to apply an intermediate step by inserting the provider's data into an Index partition. In this case the provider's information will be exploited through the Index System functionality. However, this alternative implies a significant overhead when the content of the provider is frequently updated.

Latest revision as of 00:22, 5 March 2012

Overview

The Data Sources Subsystem constitutes the framework provided in order to integrate heterogeneous data from different providers in our Information Retrieval(IR) process. Using an Indexing Layer and the OpenSearch standard, Data Sources framework provides fast access and direct connection to the information hosted in the heterogeneous environment.

Key features

Unification of heterogenous Data and different IR capabilities
Using the CQL standard, different gCube IR providers that host data with diverse representations and semantics, can be involved the overall IR process.
Indexing Layer for advanced IR functionality
Full-text retrieval, Multidimensional Range queries and Spatiotemporal search functionality
Access to the information hosted by external Providers
External providers can provide their results during the IR process through the OpenSearch standard.

Design

Philosophy

The Data Sources framework is implemented in order to:

  • simplify the integration of different IR providers in the gCube IR framework, using the appropriate standards.
  • provide Replication and High Availability through a distributed architecture.
  • exploit the information and IR capabilities of external providers.

Architecture

The Data Sources framework is composed by the Index and OpenSearch Systems. The architecture is depicted in the following figure:

The Index System is designed using a distributed architecture that involves three entities:

  • Updater: An Updater instance enables the on-the-fly update on an Index partition. It applies the preprocessing steps required to transform the data to be indexed into an appropriate format.
  • Manager: A Manager instance ensures the correct synchronization and application of update actions on all the Replicas of a specific Index partition. Moreover, it handles abnormal conditions that affect the operation of the related Index partition.
  • Replica: A Replica hosts the actual data being indexed for an Index partition. It dynamically applies update actions on the index structure it maintains, without ceasing its operation.
Data Sources Architecture

The OpenSearch framework uses the OpenSearch specification in order to connect to external IR providers and exploit their information. A different OpenSearch instance is used to connect to each provider. In such way, the IR capabilities of external providers are published to gCube infrastructure and can be utilized by the gCube IR framework.

On the top layer of the Index and OpenSearch Sources the CQL standard provides the link to the gCube Search System. While only Index and OpenSearch Sources are internal parts of gCube, other IR providers can be wrapped as Data Sources, as long as they support CQL.

Deployment

Data Sources are deployed over gCore containers. The gRS2 pipelining mechanism must also be part of the node. Index Replicas and OpenSearch instances can use a large number of nodes in cases where load balancing is required for large scale infrastructures. For better synchronization, Index Managers and Updaters can be co-deployed, while it is preferable to deploy Lookups in different nodes. Note that Data Sources instances are most commonly deploy on a VO level.

Data Sources Deployment

Use Cases

The suitability of the gCube Data Source specifications for IR components is strongly related to the two standards adopted:

  • CQL: IR providers that support functionality which can be directly mapped in the CQL standard are good candidates for being wrapped into Data Sources.
  • OpenSearch: IR providers that implement the OpenSearch API can be directly wrapped into Data Sources.

Well suited Use Cases

Components that provide IR functionality are well-suited for forming Data Sources based on their relation to the above standards. Integration of an IR provider through the OpenSearch Data Source is preferable in cases where there is no direct mapping of the provider's functionality to the CQL standard. However, if CQL can express accurately the provided IR capabilities, the direct integration of the corresponding IR component as a separate Data Source can be advantageous. The advantages in that case are mainly related to the better exploitation of the component's IR functionality. Note that CQL is chosen as the standard in our framework, because it is a highly expressive query language that suits the IR functionality of most general-case IR systems.

Less well suited Use Cases

In case a Data provider can not be associated with any of the two standards, the alternative approach is to apply an intermediate step by inserting the provider's data into an Index partition. In this case the provider's information will be exploited through the Index System functionality. However, this alternative implies a significant overhead when the content of the provider is frequently updated.