Difference between revisions of "Index Management Framework"

From Gcube Wiki
Jump to: navigation, search
(Contextual Query Language Compliance)
(Deployment Instructions)
 
(96 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
=Contextual Query Language Compliance=
 
=Contextual Query Language Compliance=
The gCube Index Framework consists of the Full Text Index, the Geospatial Index and the Forward Index. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:
+
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer [http://www.loc.gov/standards/sru/specs/cql.html CQL] queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:
  
* Full Text Index: =, adj, fuzzy, proximity, within
+
* [[Index_Management_Framework#CQL_capabilities_implementation | Index Service]] : =,  ==, within, >, >=, <=, adj, fuzzy, proximity, within
* Geospatial Index: geosearch
+
<!--* [[Index_Management_Framework#CQL_capabilities_implementation_2 | Geo-Spatial Index]] : geosearch
* Forward: ==, within
+
* [[Index_Management_Framework#CQL_capabilities_implementation_3 | Forward Index]] : -->
  
=Full Text Index=
+
=Index Service=
 +
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.
 +
 
 +
Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST.
 +
For example, the following HTTP GET call is used in order to query the index:
 +
 
 +
http://'''{host}'''/index-service-1.0.0-SNAPSHOT/'''{resourceID}'''/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302
 +
 
 +
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:
 +
 
 +
<source lang="xml">
 +
 
 +
<!-- index service web app -->
 +
<groupId>org.gcube.index</groupId>
 +
<artifactId>index-service</artifactId>
 +
<version>...</version>
 +
 
 +
 
 +
<!-- index service commons library -->
 +
<groupId>org.gcube.index</groupId>
 +
<artifactId>index-service-commons</artifactId>
 +
<version>...</version>
 +
 
 +
<!-- index service client library -->
 +
<groupId>org.gcube.index</groupId>
 +
<artifactId>index-service-client-library</artifactId>
 +
<version>...</version>
 +
 
 +
<!-- helper common library -->
 +
<groupId>org.gcube.index</groupId>
 +
<artifactId>indexcommon</artifactId>
 +
<version>...</version>
 +
 
 +
</source>
 +
 
 +
==Implementation Overview==
 +
===Services===
 +
The new index  is implemented through one service. It is implemented according to the Factory pattern:
 +
*The '''Index Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.
 +
 
 +
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:
 +
 
 +
[[File:FullTextIndexNodeService.png|frame|none|Generic Editor]]
 +
 
 +
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster.
 +
 
 +
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.
 +
 
 +
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of ''defaultSameCluster'' variable in the ''deploy.properties'' file true of false respectively.
 +
 
 +
Example
 +
<pre>
 +
defaultSameCluster=true
 +
</pre>
 +
or
 +
 
 +
<pre>
 +
defaultSameCluster=false
 +
</pre>
 +
 
 +
''ElasticSearch'', which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables ''noReplicas'' and ''noShards'' in the ''deploy.properties'' file 
 +
 
 +
Example:
 +
<pre>
 +
noReplicas=1
 +
noShards=2
 +
</pre>
 +
 
 +
 
 +
'''Highlighting''' is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables ''maxFragmentSize'' and ''maxFragmentCnt'' in the ''deploy.properties'' file respectively:
 +
 
 +
Example:
 +
<pre>
 +
maxFragmentCnt=5
 +
maxFragmentSize=80
 +
</pre>
 +
 
 +
 
 +
 
 +
The folder where the data of the index are stored can be configured by setting the variable ''dataDir'' in the ''deploy.properties'' file (if the variable is not set the default location is the folder that the container runs).
 +
 
 +
Example :
 +
<pre>
 +
dataDir=./data
 +
</pre>
 +
 
 +
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable ''useRRAdaptor''  in the ''deploy.properties''
 +
 
 +
Example :
 +
<pre>
 +
useRRAdaptor=true
 +
</pre>
 +
 
 +
 
 +
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable ''resourcesFoldername''  in the ''deploy.properties''.
 +
 
 +
Example :
 +
<pre>
 +
resourcesFoldername=./resources/index
 +
</pre>
 +
 
 +
Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and  ''scope'' in the ''deploy.properties''.
 +
 
 +
Example :
 +
<pre>
 +
hostname=dl015.madgik.di.uoa.gr
 +
port=8080
 +
scope=/gcube/devNext
 +
</pre>
 +
 
 +
===CQL capabilities implementation===
 +
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:
 +
 
 +
{| border="1"
 +
! CQL triple !! explanation !! lucene equivalent
 +
|-
 +
! title adj "sun is up"
 +
| documents with this phrase in their title
 +
| title:"sun is up"
 +
|-
 +
! title fuzzy "invorvement"
 +
| documents with words "similar" to invorvement in their title
 +
| title:invorvement~
 +
|-
 +
! allIndexes = "italy" (documents have 2 fields; title and abstract)
 +
| documents with the word italy in some of their fields
 +
| title:italy OR abstract:italy
 +
|-
 +
! title proximity "5 sun up"
 +
| documents with the words sun, up inside an interval of 5 words in their title
 +
| title:"sun up"~5
 +
|-
 +
! date within "2005 2008"
 +
| documents with a date between 2005 and 2008
 +
| date:[2005 TO 2008]
 +
|}
 +
 
 +
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.
 +
 
 +
===RowSet===
 +
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:
 +
<pre>
 +
<ROWSET idxType="IndexTypeName" colID="colA" lang="en">
 +
    <ROW>
 +
        <FIELD name="ObjectID">doc1</FIELD>
 +
        <FIELD name="title">How to create an Index</FIELD>
 +
        <FIELD name="contents">Just read the WIKI</FIELD>
 +
    </ROW>
 +
    <ROW>
 +
        <FIELD name="ObjectID">doc2</FIELD>
 +
        <FIELD name="title">How to create a Nation</FIELD>
 +
        <FIELD name="contents">Talk to the UN</FIELD>
 +
        <FIELD name="references">un.org</FIELD>
 +
    </ROW>
 +
</ROWSET>
 +
</pre>
 +
 
 +
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.
 +
 
 +
===IndexType===
 +
How the different fields in the [[Full_Text_Index#RowSet|ROWSET]] should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:
 +
 
 +
<source lang="xml">
 +
    <index-type>
 +
        <field-list>
 +
            <field name="title">
 +
                <index>yes</index>
 +
                <store>yes</store>
 +
                <return>yes</return>
 +
                <tokenize>yes</tokenize>
 +
                <sort>no</sort>
 +
                <highlightable>yes</highlightable>
 +
                <boost>1.0</boost>
 +
            </field>
 +
            <field name="contents">
 +
                <index>yes</index>
 +
                <store>yes</store>
 +
                <return>yes</return>
 +
                <tokenize>yes</tokenize>
 +
                <sort>no</sort>
 +
                <boost>1.0</boost>
 +
            </field>
 +
            <field name="references">
 +
                <index>yes</index>
 +
                <store>no</store>
 +
                <return>no</return>
 +
                <tokenize>yes</tokenize>
 +
                <sort>no</sort>
 +
                <highlightable>no</highlightable> <!-- will not be included in the highlight snippet -->
 +
                <boost>1.0</boost>
 +
            </field>
 +
            <field name="gDocCollectionID">
 +
                <index>yes</index>
 +
<store>yes</store>
 +
<return>yes</return>
 +
<tokenize>yes</tokenize>
 +
<sort>no</sort>
 +
<boost>1.0</boost>
 +
    </field>
 +
    <field name="gDocCollectionLang">
 +
<index>yes</index>
 +
<store>yes</store>
 +
<return>yes</return>
 +
<tokenize>yes</tokenize>
 +
<sort>no</sort>
 +
<boost>1.0</boost>
 +
    </field>
 +
        </field-list>
 +
    </index-type>
 +
</source>
 +
 
 +
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:
 +
 
 +
*'''index'''
 +
:specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)
 +
*'''store'''
 +
:specifies whether the field should be stored in its original format to be returned in the results from a query.
 +
*'''return'''
 +
:specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)
 +
*'''highlightable'''
 +
:specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.
 +
*'''tokenize'''
 +
:Not used
 +
*'''sort'''
 +
:Not used
 +
*'''boost'''
 +
:Not used
 +
 
 +
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:
 +
*index-type-default-1.0 (DublinCore)
 +
*index-type-TEI-2.0
 +
*index-type-eiDB-1.0
 +
*index-type-iso-1.0
 +
*index-type-FT-1.0
 +
 
 +
===Query language===
 +
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.
 +
 
 +
 
 +
==Deployment Instructions==
 +
 
 +
In order to deploy and run Index Service on a node we will need the following:
 +
* index-service-''{version}''.war
 +
* smartgears-distribution-''{version}''.tar.gz (to publish the running instance of the service on the IS and be discoverable)
 +
** see [http://gcube.wiki.gcube-system.org/gcube/index.php/SmartGears_gHN_Installation here] for installation
 +
* an application server (such as Tomcat, JBoss, Jetty)
 +
 
 +
There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file ''deploy.properties'' that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is ''webapps/service/WEB-INF/classes''.
 +
 
 +
The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables ''hostname'' and ''scope''  in the ''deploy.properties''.
 +
 
 +
Example :
 +
<pre>
 +
hostname=dl015.madgik.di.uoa.gr
 +
port=8080
 +
scope=/gcube/devNext
 +
</pre>
 +
 
 +
Finally, [http://gcube.wiki.gcube-system.org/gcube/index.php/Resource_Registry Resource Registry] should be configured to not run in client mode. This is done in the ''deploy.properties'' by setting:
 +
 
 +
<pre>
 +
clientMode=false
 +
</pre>
 +
 
 +
'''NOTE''': it is important to note that ''resourcesFoldername'' as well as ''dataDir'' properties have relative paths in their default values. In some cases these values maybe evaluated by taking into account the folder that the container was started, so in order to avoid problems related to this behavior it is better for these properties to take absolute paths as values.
 +
 
 +
==Usage Example==
 +
 
 +
===Create an Index Service Node, feed and query using the corresponding client library===
 +
 
 +
The following example demonstrate the usage of the IndexClient and IndexServiceClient.
 +
Both are created according to the Builder pattern.
 +
 
 +
<source lang="java">
 +
 
 +
final String scope = "/gcube/devNext";
 +
 
 +
// create a client for the given scope (we can provide endpoint as extra filter)
 +
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();
 +
 
 +
factoryClient.createResource("myClusterID", scope);
 +
 
 +
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)
 +
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();
 +
 
 +
try{
 +
  indexClient.feedLocator(locator);
 +
  indexClient.query(query);
 +
} catch (IndexException) {
 +
  // handle the exception
 +
}
 +
</source>
 +
 
 +
<!--=Full Text Index=
 
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.
 
The Full Text Index is responsible for providing quick full text data retrieval capabilities in the gCube environment.
  
Line 18: Line 311:
 
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:
 
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Full Text Index:
 
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]
 
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]
 +
 +
===CQL capabilities implementation===
 +
Full Text Index uses [http://lucene.apache.org/ Lucene] as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:
 +
 +
{| border="1"
 +
! CQL triple !! explanation !! lucene equivalent
 +
|-
 +
! title adj "sun is up"
 +
| documents with this phrase in their title
 +
| title:"sun is up"
 +
|-
 +
! title fuzzy "invorvement"
 +
| documents with words "similar" to invorvement in their title
 +
| title:invorvement~
 +
|-
 +
! allIndexes = "italy" (documents have 2 fields; title and abstract)
 +
| documents with the word italy in some of their fields
 +
| title:italy OR abstract:italy
 +
|-
 +
! title proximity "5 sun up"
 +
| documents with the words sun, up inside an interval of 5 words in their title
 +
| title:"sun up"~5
 +
|-
 +
! date within "2005 2008"
 +
| documents with a date between 2005 and 2008
 +
| date:[2005 TO 2008]
 +
|}
 +
 +
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.
  
 
===RowSet===
 
===RowSet===
 
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:
 
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:
 
<pre>
 
<pre>
<ROWSET>
+
<ROWSET idxType="IndexTypeName" colID="colA" lang="en">
     <ROW id="doc1">
+
     <ROW>
 +
        <FIELD name="ObjectID">doc1</FIELD>
 
         <FIELD name="title">How to create an Index</FIELD>
 
         <FIELD name="title">How to create an Index</FIELD>
 
         <FIELD name="contents">Just read the WIKI</FIELD>
 
         <FIELD name="contents">Just read the WIKI</FIELD>
 
     </ROW>
 
     </ROW>
     <ROW id="doc2">
+
     <ROW>
 +
        <FIELD name="ObjectID">doc2</FIELD>
 
         <FIELD name="title">How to create a Nation</FIELD>
 
         <FIELD name="title">How to create a Nation</FIELD>
 
         <FIELD name="contents">Talk to the UN</FIELD>
 
         <FIELD name="contents">Talk to the UN</FIELD>
Line 34: Line 358:
 
</ROWSET>
 
</ROWSET>
 
</pre>
 
</pre>
 +
 +
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.
  
 
===IndexType===
 
===IndexType===
Line 65: Line 391:
 
                 <boost>1.0</boost>
 
                 <boost>1.0</boost>
 
             </field>
 
             </field>
 +
            <field name="gDocCollectionID">
 +
                <index>yes</index>
 +
<store>yes</store>
 +
<return>yes</return>
 +
<tokenize>yes</tokenize>
 +
<sort>no</sort>
 +
<boost>1.0</boost>
 +
    </field>
 +
    <field name="gDocCollectionLang">
 +
<index>yes</index>
 +
<store>yes</store>
 +
<return>yes</return>
 +
<tokenize>yes</tokenize>
 +
<sort>no</sort>
 +
<boost>1.0</boost>
 +
    </field>
 
         </field-list>
 
         </field-list>
 
     </index-type>
 
     </index-type>
 
</pre>
 
</pre>
  
Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:
+
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:
  
 
*'''index'''
 
*'''index'''
Line 96: Line 438:
 
             <boost>1.0</boost>  
 
             <boost>1.0</boost>  
 
   
 
   
             <span style="color:green"><nowiki><!-- subfields of contents --></nowiki></span>
+
             <span style="color:green"><nowiki>// subfields of contents</nowiki></span>
 
             <field name="title">
 
             <field name="title">
 
                 <index>yes</index>
 
                 <index>yes</index>
Line 105: Line 447:
 
                 <boost>1.0</boost>  
 
                 <boost>1.0</boost>  
 
   
 
   
                 <span style="color:green"><nowiki><!-- subfields of title which itself is a subfield of contents --></nowiki></span>
+
                 <span style="color:green"><nowiki>// subfields of title which itself is a subfield of contents</nowiki></span>
 
                 <field name="bookTitle">
 
                 <field name="bookTitle">
 
                     <index>yes</index>
 
                     <index>yes</index>
Line 150: Line 492:
 
         </field>  
 
         </field>  
 
   
 
   
         <span style="color:green"><nowiki><!-- not a subfield --></nowiki></span>
+
         <span style="color:green"><nowiki>// not a subfield</nowiki></span>
 
         <field name="references">
 
         <field name="references">
 
             <index>yes</index>
 
             <index>yes</index>
Line 178: Line 520:
  
 
===Query language===
 
===Query language===
The Full Text Index uses the Lucene query language, but does not allow the use of fuzzy searches, proximity searches, range searches or boosting of a term. In addition, queries using wildcards will not return usable query statistics.
+
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.
  
 
===Statistics===
 
===Statistics===
Line 526: Line 868:
 
  input.close();
 
  input.close();
 
  output.close();
 
  output.close();
 +
-->
  
=Geo-Spatial Index=
+
<!--=Geo-Spatial Index=
 
==Implementation Overview==
 
==Implementation Overview==
 
===Services===
 
===Services===
Line 537: Line 880:
 
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:
 
It is important to note that none of the three services have to reside on the same node; they are only connected through WebService calls, the [[IS-Notification | gCube notifications' framework]] and the [[ Content Manager (NEW) | gCube Content Management System]]. The following illustration shows the information flow and responsibilities for the different services used to implement the Geo Index:
 
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]
 
[[Image:GeneralIndexDesign.png|frame|none|Generic Editor]]
 +
 +
===Underlying Technology===
 +
Geo-Spatial Index uses [http://geotools.org/ Geotools] as its underlying technology. The documents hosted in a Geo-Spatial Index Lookup belong to different collections and language. For each language of each collection, the Geo-Spatial Index Lookup uses a seperate R-tree to index the corresponding documents. As we will describe in the following section, the CQL queries received by a Geo Index Lookup may refer to specific languages and collections. Through this design we aim at high performance for complicated queries that involve many collections and languages.
 +
 +
===CQL capabilities implementation===
 +
Geo-Spatial Index Lookup supports one custom CQL relation. The "geosearch" relation has a number of modifiers, that specify the collection and language of the results, the inclusion type of the query, a refiner for filtering further the results and a ranker for computing a score for each result. All of the modifiers are optional. Let's see the following example in order to understand better the geosearch relation and its modifiers:
 +
 +
<pre>
 +
geo geosearch/colID="colA"/lang="en"/inclusion="0"/ranker="rankerA false arg1 arg2"/refiner="refinerA arg1 arg2 arg3" "1 1 1 10 10 1 10 10"
 +
</pre>
 +
 +
In this example the results will be the documents that belong to collection with ID "colA", they are in English and they intersect with the polygon defined by points (1,1), (1,10), (10,1), (10,10). These results will be filtered by the refiner with ID "refinerA" that will take "arg1 arg2 arg3" as arguments for the filtering operation, will be ordered by rankerA that will take "arg1 arg2" as arguments for the ranking operation and will be returned as the output for this simple CQL query. The "false" indication to the ranker modifier signifies that we don't want reverse ordering of the results(true signifies that the higher score must be placed at the end). There are three inclusion types. 0 is "intersects", 1 is "contains" (documents that are contained in the specified polygon) and 2 is "inside" (documents that are inside the specified polygon). The next sections will provide the details for the refiners and rankers.
 +
 +
A complete CQL query for a Geo-Spatial Index Lookup will contain many CQL geosearch triples, connected with AND, OR, NOT operators. The approach we follow in order to execute CQL queries, is to apply boolean algebra rules, and transform the initial query to an equivalent one. We aim at producing a query which is a union of operations that each refers to a single R-tree. Additionally we apply "cut-off" rules that eliminate parts of the initial query that have a zero number of results. Consider the following example of a "cut-off" rule:
 +
 +
<pre>
 +
(geo geosearch/colID="colA"/inclusion="1" <P1>) AND (geo geosearch/colID="colA"/inclusion="1" <P2>)
 +
</pre>
 +
 +
Here we want documents of collection "colA" that are contained in polygon P1 AND are also contained in polygon P2. Note that if the collections in the two criteria were different then we could eliminate this subquery, since it could not produce any result(each document belongs to one collection only). Since the two criteria specify the same collection, we must examine the relation of the two polygons. If the two polygons do not intersect then
 +
there is no area in which the documents should be contained, so no document can satisfy the conjunction of the 2 criteria. This is depicted in the following figure:
 +
 +
[[Image:GeoInter.jpg|500px|frame|center|Intersection of polygons P1 and P2]]
 +
 +
The transformation of a initial CQL query to a union of R-tree operations is depicted in the following figure:
 +
 +
[[Image:GeoXform.jpg|500px|frame|center|Geo-Spatial Index Lookup transformation of CQL queries]]
 +
 +
Each R-tree operation refers to a lookup operation for a given polygon on a single R-tree. The MergeSorter component implements the union of the individual R-tree operations, based on the scores of the results. Flow control is supported by the MergeSorter component in order to pause and synchronize the workers that execute the single R-tree operations, depending on the behavior of the client that reads the results.
  
 
===RowSet===
 
===RowSet===
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Geographical/Spatial Index#refinement|refinement]] of a query or [[Geographical/Spatial Index#ranking|ranking]] of results. As opposed to the ROWSETs used for fulltext indices, all rows in a GeoROWSET must contain ''all'' fields specified in the [[Geographical/Spatial Index#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:
+
The content to be fed into a Geo Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the GeoROWSET schema. This is a very simple schema, declaring that an object (ROW element) should containan id, start and end X coordinates (x1-mandatory and x2-set to equal x1 if not provided) as well as start and end Y coordinates (y1-mandatory and y2-set to equal y1 if not provided). In addition, and of any number of FIELD elements containing a name attribute and information to be stored and perhaps used for [[Index Management Framework#Refinement|refinement]] of a query or [[Index Management Framework#Ranking|ranking]] of results. In a similar fashion with fulltext indices, a row in a GeoROWSET may contain only a number of all the fields specified in the [[Index Management Framework#GeoIndexType|IndexType]]. The following is a simple but valid GeoROWSET containing two objects:
 
<pre>
 
<pre>
<ROWSET>
+
<ROWSET colID="colA" lang="en">
 
     <ROW id="doc1" x1="4321" y1="1234">
 
     <ROW id="doc1" x1="4321" y1="1234">
 
         <FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD>
 
         <FIELD name="StartTime">2001-05-27T14:35:25.523</FIELD>
Line 552: Line 924:
 
</ROWSET>
 
</ROWSET>
 
</pre>
 
</pre>
 +
 +
The attributes colID and lang specify the collection ID and the language of the documents under the <ROWSET> element. The first one is required, while the second one is optional.
  
 
===GeoIndexType===
 
===GeoIndexType===
Which fields should be present in the [[Geographical/Spatial Index#RowSet|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which should be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:
+
Which fields may be present in the [[Index Management Framework#RowSet_2|RowSet]], and how these fields are to be handled by the Geo Index is specified through a GeoIndexType; an XML document conforming to the GeoIndexType schema. Which GeoIndexType to use for a specific GeoIndex instance, is specified by supplying a GeoIndexType ID during initialization of the GeoIndexManagement resource. A GeoIndexType contains a field list which contains all the fields which can be stored in order to be presented in the query results or used for refinement. The following is a possible IndexType for the type of ROWSET shown above:
  
 
<pre>
 
<pre>
Line 571: Line 945:
 
</pre>
 
</pre>
  
Fields present in the ROWSET but not in the IndexType will be skipped. Fields present in the IndexType but not in a ROW in the ROWSET will cause an exception. The two elements under each "field" element are used to define that field should be handled. The meaning and expected content of each of them is explained bellow:
+
Fields present in the ROWSET but not in the IndexType will be skipped. The two elements under each "field" element are used to define how that field should be handled. The meaning and expected content of each of them is explained bellow:
  
 
*'''type''' specifies the data type of the field. Accepted values are:
 
*'''type''' specifies the data type of the field. Accepted values are:
Line 580: Line 954:
 
**FLOAT - A decimal number fitting into a Java "float"
 
**FLOAT - A decimal number fitting into a Java "float"
 
**DOUBLE - A decimal number fitting into a Java "double"
 
**DOUBLE - A decimal number fitting into a Java "double"
**STRING - A string with a maximum length of 40 (or so...)
+
**STRING - A string with a maximum length of 100 (or so...)
 
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.
 
*'''return''' specifies whether the field should be returned in the results from a query. "yes" and "no" are the only accepted values.
  
Line 926: Line 1,300:
 
  }
 
  }
  
 +
 +
-->
 +
 +
<!--
 
=Forward Index=
 
=Forward Index=
 +
 +
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair:
 +
key; integer, value; string
 +
key; float, value; string
 +
key; string, value; string
 +
key; date, value;string
 +
The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.
 +
 +
 +
==Implementation Overview==
 +
===Services===
 +
The new forward index is implemented through one service. It is implemented according to the Factory pattern:
 +
*The '''ForwardIndexNode Service''' represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.
 +
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:
 +
 +
[[File:ForwardIndexNodeService.png|frame|none|Generic Editor]]
 +
 +
 +
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.
 +
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.
 +
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of '''useClusterId''' variable in the '''deploy-jndi-config.xml''' file true of false respectively.
 +
Example
 +
 +
<pre>
 +
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" />
 +
</pre>
 +
 +
or
 +
<pre>
 +
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" />
 +
</pre>
 +
 +
 +
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables '''noReplicas''' in the '''deploy-jndi-config.xml''' file
 +
Example:
 +
 +
<pre>
 +
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" />
 +
</pre>
 +
 +
 +
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.
 +
 +
Example:
 +
 +
<pre>
 +
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" />
 +
 +
<environment name="couchbasePort" value="8091" type="java.lang.String" override="false" />
 +
 +
 +
// should be the same for all nodes in the cluster
 +
<environment name="couchbaseUsername" value="Administrator" type="java.lang.String"  override="false" />
 +
 +
// should be the same for all nodes in the cluster
 +
<environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" />
 +
</pre>
 +
 +
 +
Total '''RAM''' of each bucket-index can be specified as well in the deploy-jndi-config.xml
 +
 +
Example:
 +
<pre>
 +
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" />
 +
</pre>
 +
 +
 +
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.
 +
 +
<source lang="bash">
 +
# Initialize node run:
 +
$ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD
 +
 +
# If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run :
 +
$ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD
 +
 +
# If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run :
 +
$ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD
 +
</source>
 +
 +
===CQL capabilities implementation===
 +
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:
 +
 +
[[Image:FWDCQL.jpg|frame|center|FWD CQL transformation]]
 +
 +
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.
 +
 +
===RowSet===
 +
The content to be fed into an Index, must be served as a [[ResultSet Framework|ResultSet]] containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the '''Forward Index Updater''':
 +
 +
The row set "schema"
 +
 +
<pre>
 +
<ROWSET>
 +
<INSERT>
 +
<TUPLE>
 +
<KEY>
 +
<KEYNAME>title</KEYNAME>
 +
<KEYVALUE>sun is up</KEYVALUE>
 +
</KEY>
 +
<KEY>
 +
<KEYNAME>ObjectID</KEYNAME>
 +
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE>
 +
</KEY>
 +
<KEY>
 +
<KEYNAME>gDocCollectionID</KEYNAME>
 +
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE>
 +
</KEY>
 +
<KEY>
 +
<KEYNAME>gDocCollectionLang</KEYNAME>
 +
<KEYVALUE>es</KEYVALUE>
 +
</KEY>
 +
<VALUE>
 +
<FIELD name="title">sun is up</FIELD>
 +
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD>
 +
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD>
 +
<FIELD name="gDocCollectionLang">es</FIELD>
 +
</VALUE>
 +
</TUPLE>
 +
</INSERT>
 +
</ROWSET>
 +
</pre>
 +
 +
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.
 +
 +
==Usage Example==
 +
 +
===Create a ForwardIndex Node, feed and query using the corresponding client library===
 +
 +
 +
<source lang="java">
 +
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build();
 +
 +
//Create a resource
 +
CreateResource createResource = new CreateResource();
 +
CreateResourceResponse output = proxyRandomf.createResource(createResource);
 +
 +
//Get the reference
 +
StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build();
 +
 +
//or get a random reference
 +
StatefulQuery q = ForwardIndexNodeDSL.getSource().build();
 +
 +
List<EndpointReference> refs = q.fire();
 +
 +
//Get a proxy
 +
try {
 +
  ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build();
 +
        //Feed
 +
proxyRandom.feedLocator(locator);
 +
        //Query
 +
proxyRandom.query(query);
 +
} catch (ForwardIndexNodeException e) {
 +
        //Handle the exception
 +
}
 +
 +
</source>
 +
 +
 +
-->
 +
 +
<!--=Forward Index=
 
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.
 
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The '''Forward Index Service''' design pattern is similar to/the same as the '''Full Text Index''' Service design and the '''Geo Index''' Service design.
 
The forward index supports the following schema for each key value pair:
 
The forward index supports the following schema for each key value pair:
Line 957: Line 1,497:
  
 
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]
 
[[Image:BDBFWD.jpg|frame|center|Design of Forward Index Lookup]]
 +
 +
===CQL capabilities implementation===
 +
As stated in the previous section the design of the Forward Index Lookup enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:
 +
 +
[[Image:FWDCQL.jpg|frame|center|FWD Lookup CQL transformation]]
 +
 +
In a similar manner with the Geo-Spatial Index Lookup, the Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index Lookup.
  
 
===RowSet===
 
===RowSet===
Line 963: Line 1,510:
 
The row set "schema"
 
The row set "schema"
  
<rowset>
+
<pre>
<insert>
+
<ROWSET>
<tuple><key></key><value></value></tuple>
+
<INSERT>
<tuple><key></key><value></value></tuple>
+
<TUPLE>
</insert>
+
<KEY>
<delete>
+
<KEYNAME>title</KEYNAME>
<key></key>
+
<KEYVALUE>sun is up</KEYVALUE>
<key></key>
+
</KEY>
</delete>
+
<KEY>
</rowset>
+
<KEYNAME>ObjectID</KEYNAME>
 +
<KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE>
 +
</KEY>
 +
<KEY>
 +
<KEYNAME>gDocCollectionID</KEYNAME>
 +
<KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE>
 +
</KEY>
 +
<KEY>
 +
<KEYNAME>gDocCollectionLang</KEYNAME>
 +
<KEYVALUE>es</KEYVALUE>
 +
</KEY>
 +
<VALUE>
 +
<FIELD name="title">sun is up</FIELD>
 +
<FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD>
 +
<FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD>
 +
<FIELD name="gDocCollectionLang">es</FIELD>
 +
</VALUE>
 +
</TUPLE>
 +
</INSERT>
 +
</ROWSET>
 +
</pre>
  
The rowset may contain a insert section, or a delete section or both. The key and value pairs (tuples) in the insert section may be repeated 1  or infinite number of times. The key in the delete section may be repeated 1 or infinite number of times.
+
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.
  
 
===Test Client ForwardIndexClient===
 
===Test Client ForwardIndexClient===
Line 1,004: Line 1,571:
 
to create the statefull web services:
 
to create the statefull web services:
  
ForwardIndexManagementService - responsible for holding the list of delta files that
+
* ForwardIndexManagementService : responsible for holding the list of delta files that in sum is the index. The service also relays Notifications from the ForwardIndexUpdaterService to the ForwardIndexLookupService when new delta files must be merged into the index.
                              in sum is the index. The service also relays Notifications
+
                              from the ForwardIndexUpdaterService to the ForwardIndexLookupService
+
                              when new delta files must be merged into the index.
+
ForwardIndexUpdaterService - responsible for creating new delta files with tuples that shall
+
                            be deleted from the index or inserted into the index.
+
  
ForwardIndexLookupService -  responsible for looking up queries and returning the answer.
+
* ForwardIndexUpdaterService : responsible for creating new delta files with tuples that shall be deleted from the index or inserted into the index.
  
The test clients creates one WS - resource of each type, inserts some data into the update, and queries
+
* ForwardIndexLookupService : responsible for looking up queries and returning the answer.
 +
 
 +
The test clients creates one WSresource of each type, inserts some data into the update, and queries
 
the data by using the lookup WS resource.
 
the data by using the lookup WS resource.
  
Line 1,036: Line 1,600:
  
 
The result is provided to the client by using the Result Set service.
 
The result is provided to the client by using the Result Set service.
 
+
-->
=Storage Handling layer=
+
<!--=Storage Handling layer=
 
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.
 
The Storage Handling layer is responsible for the actual storage of the index data to the infrastructure. All the index components rely on the functionality provided by this layer in order to store and load their data. The implementation of the Storage Handling layer can be easily modified independently of the way the index components work in order to produce their data. The current implementation splits the index data to chunks called "delta files" and stores them in the Content Management Layer, through the use of the Content Management and Collection Management services.
 
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.
 
The Storage Handling service must be deployed together with any other index. It cannot be invoked directly, since it's meant to be used only internally by the upper layers of the index components hierarchy.
 +
-->
  
 
=Index Common library=
 
=Index Common library=
 
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.
 
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.

Latest revision as of 13:37, 19 June 2014

Contextual Query Language Compliance

The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer CQL queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:

  • Index Service : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within

Index Service

The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.

Index Service exposes a REST API, thus it can be used by different general purpose libraries that support REST. For example, the following HTTP GET call is used in order to query the index:

http://{host}/index-service-1.0.0-SNAPSHOT/{resourceID}/query?queryString=((e24f6285-46a2-4395-a402-99330b326fad = tuna) and (((gDocCollectionID == 8dc17a91-378a-4396-98db-469280911b2f)))) project ae58ca58-55b7-47d1-a877-da783a758302

Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:

<!-- index service web app -->
<groupId>org.gcube.index</groupId>
<artifactId>index-service</artifactId>
<version>...</version>
 
 
<!-- index service commons library -->
<groupId>org.gcube.index</groupId>
<artifactId>index-service-commons</artifactId>
<version>...</version>
 
<!-- index service client library -->
<groupId>org.gcube.index</groupId>
<artifactId>index-service-client-library</artifactId>
<version>...</version>
 
<!-- helper common library -->
<groupId>org.gcube.index</groupId>
<artifactId>indexcommon</artifactId>
<version>...</version>

Implementation Overview

Services

The new index is implemented through one service. It is implemented according to the Factory pattern:

  • The Index Service represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.

The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:

Generic Editor

It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster.

Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.

The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of defaultSameCluster variable in the deploy.properties file true of false respectively.

Example

defaultSameCluster=true

or

defaultSameCluster=false

ElasticSearch, which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables noReplicas and noShards in the deploy.properties file

Example:

noReplicas=1
noShards=2


Highlighting is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables maxFragmentSize and maxFragmentCnt in the deploy.properties file respectively:

Example:

maxFragmentCnt=5
maxFragmentSize=80


The folder where the data of the index are stored can be configured by setting the variable dataDir in the deploy.properties file (if the variable is not set the default location is the folder that the container runs).

Example :

dataDir=./data

In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable useRRAdaptor in the deploy.properties

Example :

useRRAdaptor=true


Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable resourcesFoldername in the deploy.properties.

Example :

resourcesFoldername=./resources/index

Finally, the hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.

Example :

hostname=dl015.madgik.di.uoa.gr
port=8080
scope=/gcube/devNext

CQL capabilities implementation

Full Text Index uses Lucene as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:

CQL triple explanation lucene equivalent
title adj "sun is up" documents with this phrase in their title title:"sun is up"
title fuzzy "invorvement" documents with words "similar" to invorvement in their title title:invorvement~
allIndexes = "italy" (documents have 2 fields; title and abstract) documents with the word italy in some of their fields title:italy OR abstract:italy
title proximity "5 sun up" documents with the words sun, up inside an interval of 5 words in their title title:"sun up"~5
date within "2005 2008" documents with a date between 2005 and 2008 date:[2005 TO 2008]

In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.

RowSet

The content to be fed into an Index, must be served as a ResultSet containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:

<ROWSET idxType="IndexTypeName" colID="colA" lang="en">
    <ROW>
        <FIELD name="ObjectID">doc1</FIELD>
        <FIELD name="title">How to create an Index</FIELD>
        <FIELD name="contents">Just read the WIKI</FIELD>
    </ROW>
    <ROW>
        <FIELD name="ObjectID">doc2</FIELD>
        <FIELD name="title">How to create a Nation</FIELD>
        <FIELD name="contents">Talk to the UN</FIELD>
        <FIELD name="references">un.org</FIELD>
    </ROW>
</ROWSET>

The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.

IndexType

How the different fields in the ROWSET should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:

    <index-type>
        <field-list>
            <field name="title">
                <index>yes</index>
                <store>yes</store>
                <return>yes</return>
                <tokenize>yes</tokenize>
                <sort>no</sort>
                <highlightable>yes</highlightable>
                <boost>1.0</boost>
            </field>
            <field name="contents">
                <index>yes</index>
                <store>yes</store>
                <return>yes</return>
                <tokenize>yes</tokenize>
                <sort>no</sort>
                <boost>1.0</boost>
            </field>
            <field name="references">
                <index>yes</index>
                <store>no</store>
                <return>no</return>
                <tokenize>yes</tokenize>
                <sort>no</sort>
                <highlightable>no</highlightable> <!-- will not be included in the highlight snippet -->
                <boost>1.0</boost>
            </field>
            <field name="gDocCollectionID">
                <index>yes</index>
		<store>yes</store>
		<return>yes</return>
		<tokenize>yes</tokenize>
		<sort>no</sort>
		<boost>1.0</boost>
	    </field>
	    <field name="gDocCollectionLang">
		<index>yes</index>
		<store>yes</store>
		<return>yes</return>
		<tokenize>yes</tokenize>
		<sort>no</sort>
		<boost>1.0</boost>
	    </field>
        </field-list>
    </index-type>

Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:

  • index
specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)
  • store
specifies whether the field should be stored in its original format to be returned in the results from a query.
  • return
specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)
  • highlightable
specifies whether a returned field should be included or not in the highlight snippet. If not specified then every returned field will be included in the snippet.
  • tokenize
Not used
  • sort
Not used
  • boost
Not used

We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:

  • index-type-default-1.0 (DublinCore)
  • index-type-TEI-2.0
  • index-type-eiDB-1.0
  • index-type-iso-1.0
  • index-type-FT-1.0

Query language

The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.


Deployment Instructions

In order to deploy and run Index Service on a node we will need the following:

  • index-service-{version}.war
  • smartgears-distribution-{version}.tar.gz (to publish the running instance of the service on the IS and be discoverable)
    • see here for installation
  • an application server (such as Tomcat, JBoss, Jetty)

There are a few things that need to configured in order for the service to be functional. All the service configuration is done in the file deploy.properties that comes within the service war. Typically, this file should be loaded in the classpath so that it can be read. The default location of this file (in the exploded war) is webapps/service/WEB-INF/classes.

The hostname of the node as well as the port and the scope that the node is running on have to set in the in the variables hostname and scope in the deploy.properties.

Example :

hostname=dl015.madgik.di.uoa.gr
port=8080
scope=/gcube/devNext

Finally, Resource Registry should be configured to not run in client mode. This is done in the deploy.properties by setting:

clientMode=false

NOTE: it is important to note that resourcesFoldername as well as dataDir properties have relative paths in their default values. In some cases these values maybe evaluated by taking into account the folder that the container was started, so in order to avoid problems related to this behavior it is better for these properties to take absolute paths as values.

Usage Example

Create an Index Service Node, feed and query using the corresponding client library

The following example demonstrate the usage of the IndexClient and IndexServiceClient. Both are created according to the Builder pattern.

final String scope = "/gcube/devNext"; 
 
// create a client for the given scope (we can provide endpoint as extra filter)
IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build();
 
factoryClient.createResource("myClusterID", scope);
 
// create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters)
IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build();
 
try{
  	indexClient.feedLocator(locator);
  	indexClient.query(query);
} catch (IndexException) {
  	// handle the exception
}



Index Common library

The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.