Index Management Framework
Contents
Contextual Query Language Compliance
The gCube Index Framework consists of the Index Service which provides both FullText Index and Forward Index capabilities. All of them are able to answer CQL queries. The CQL relations that each of them supports, depends on the underlying technologies. The mechanisms for answering CQL queries, using the internal design and technologies, are described later for each case. The supported relations are:
- Index Service : =, ==, within, >, >=, <=, adj, fuzzy, proximity, within
Index Service
The Index Service is responsible for providing quick full text data retrieval and forward index capabilities in the gCube environment.
Index Service is consisted by a few components that are available in our Maven repositories with the following coordinates:
<!-- index service web app --> <groupId>org.gcube.index</groupId> <artifactId>index-service</artifactId> <version>...</version> <!-- index service commons library --> <groupId>org.gcube.index</groupId> <artifactId>index-service-commons</artifactId> <version>...</version> <!-- index service client library --> <groupId>org.gcube.index</groupId> <artifactId>index-service-client-library</artifactId> <version>...</version> <!-- helper common library --> <groupId>org.gcube.index</groupId> <artifactId>indexcommon</artifactId> <version>...</version>
Implementation Overview
Services
The new index is implemented through one service. It is implemented according to the Factory pattern:
- The Index Service represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.
The following illustration shows the information flow and responsibilities for the different services used to implement the Index Service:
It is actually a wrapper over ElasticSearch and each IndexNode has a 1-1 relationship with an ElasticSearch Node. For this reason creation of multiple resources of IndexNode service is discouraged, instead the best case is to have one resource (one node) at each container that consists the cluster.
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Full Text Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of defaultSameCluster variable in the deploy.properties file true of false respectively.
Example
defaultSameCluster=true
or
defaultSameCluster=false
ElasticSearch, which is the underlying technology of the new Index Service, can configure the number of replicas and shards for each index. This is done by setting the variables noReplicas and noShards in the deploy.properties file
Example:
noReplicas=1 noShards=2
Highlighting is a new supported feature by Full Text Index (also supported in the old Full Text Index). If highlighting is enabled the index returns a snippet of the matching query that is performed on the presentable fields. This snippet is usually a concatenation of a number of matching fragments in those fields that match queries. The maximum size of each fragment as well as the maximum number of the fragments that will be used to construct a snippet can be configured by setting the variables maxFragmentSize and maxFragmentCnt in the deploy.properties file respectively:
Example:
maxFragmentCnt=5 maxFragmentSize=80
The folder where the data of the index are stored can be configured by setting the variable dataDir in the deploy-jndi-config.xml file (if the variable is not set the default location is the folder that the container runs).
Example :
dataDir=./data
In order to configure whether to use Resource Registry or not (for translation of field ids to field names) we can change the value of the variable useRRAdaptor in the deploy-jndi-config.xml
Example :
useRRAdaptor=true
Since the Index Service creates resources for each Index instance that is running (instead of running multiple Running Instances of the service) the folder where the instances will be persisted locally have to be set in the variable resourcesFoldername in the deploy-jndi-config.xml.
Example :
resourcesFoldername=/tmp/resources/index
Finally, the hostname of the node as well as the scope that the node is running on have to set in the in the variable resourcesFoldername in the deploy-jndi-config.xml'.
Example :
hostname=dl015.madgik.di.uoa.gr scope=/gcube/devNext
CQL capabilities implementation
Full Text Index uses Lucene as its underlying technology. A CQL Index-Relation-Term triple has a straightforward transformation in lucene. This transformation is explained through the following examples:
CQL triple | explanation | lucene equivalent |
---|---|---|
title adj "sun is up" | documents with this phrase in their title | title:"sun is up" |
title fuzzy "invorvement" | documents with words "similar" to invorvement in their title | title:invorvement~ |
allIndexes = "italy" (documents have 2 fields; title and abstract) | documents with the word italy in some of their fields | title:italy OR abstract:italy |
title proximity "5 sun up" | documents with the words sun, up inside an interval of 5 words in their title | title:"sun up"~5 |
date within "2005 2008" | documents with a date between 2005 and 2008 | date:[2005 TO 2008] |
In a complete CQL query, the triples are connected with boolean operators. Lucene supports AND, OR, NOT(AND-NOT) connections between single criteria. Thus, in order to transform a complete CQL query to a lucene query, we first transform CQL triples and then we connect them with AND, OR, NOT equivalently.
RowSet
The content to be fed into an Index, must be served as a ResultSet containing XML documents conforming to the ROWSET schema. This is a very simple schema, declaring that a document (ROW element) should contain of any number of FIELD elements with a name attribute and the text to be indexed for that field. The following is a simple but valid ROWSET containing two documents:
<ROWSET idxType="IndexTypeName" colID="colA" lang="en"> <ROW> <FIELD name="ObjectID">doc1</FIELD> <FIELD name="title">How to create an Index</FIELD> <FIELD name="contents">Just read the WIKI</FIELD> </ROW> <ROW> <FIELD name="ObjectID">doc2</FIELD> <FIELD name="title">How to create a Nation</FIELD> <FIELD name="contents">Talk to the UN</FIELD> <FIELD name="references">un.org</FIELD> </ROW> </ROWSET>
The attributes idxType and colID are required and specify the Index Type that the Index must have been created with, and the collection ID of the documents under the <ROWSET> element. The lang attribute is optional, and specifies the language of the documents under the <ROWSET> element. Note that for each document a required field is the "ObjectID" field that specifies its unique identifier.
IndexType
How the different fields in the ROWSET should be handled by the Index, and how the different fields in an Index should be handled during a query, is specified through an IndexType; an XML document conforming to the IndexType schema. An IndexType contains a field list which contains all the fields which should be indexed and/or stored in order to be presented in the query results, along with a specification of how each of the fields should be handled. The following is a possible IndexType for the type of ROWSET shown above:
<index-type> <field-list> <field name="title" lang="en"> <index>yes</index> <store>yes</store> <return>yes</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> <field name="contents" lang="en> <index>yes</index> <store>no</store> <return>no</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> <field name="references" lang="en> <index>yes</index> <store>no</store> <return>no</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> <field name="gDocCollectionID"> <index>yes</index> <store>yes</store> <return>yes</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> <field name="gDocCollectionLang"> <index>yes</index> <store>yes</store> <return>yes</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> </field-list> </index-type>
Note that the fields "gDocCollectionID", "gDocCollectionLang" are always required, because, by default, all documents will have a collection ID and a language ("unknown" if no collection is specified). Fields present in the ROWSET but not in the IndexType will be skipped. The elements under each "field" element are used to define how that field should be handled, and they should contain either "yes" or "no". The meaning of each of them is explained bellow:
- index
- specifies whether the specific field should be indexed or not (ie. whether the index should look for hits within this field)
- store
- specifies whether the field should be stored in its original format to be returned in the results from a query.
- return
- specifies whether a stored field should be returned in the results from a query. A field must have been stored to be returned. (This element is not available in the currently deployed indices)
- tokenize
- specifies whether the field should be tokenized. Should usually contain "yes".
- sort
- Not used
- boost
- Not used
For more complex content types, one can also specify sub-fields as in the following example:
<index-type> <field-list> <field name="contents"> <index>yes</index> <store>no</store> <return>no</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> <!-- subfields of contents --> <field name="title"> <index>yes</index> <store>yes</store> <return>yes</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> <!-- subfields of title which itself is a subfield of contents --> <field name="bookTitle"> <index>yes</index> <store>yes</store> <return>yes</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> <field name="chapterTitle"> <index>yes</index> <store>yes</store> <return>yes</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> </field> <field name="foreword"> <index>yes</index> <store>yes</store> <return>yes</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> <field name="startChapter"> <index>yes</index> <store>yes</store> <return>yes</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> <field name="endChapter"> <index>yes</index> <store>yes</store> <return>yes</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> </field> <!-- not a subfield --> <field name="references"> <index>yes</index> <store>no</store> <return>no</return> <tokenize>yes</tokenize> <sort>no</sort> <boost>1.0</boost> </field> </field-list> </index-type>
Querying the field "contents" in an index using this IndexType would return hitsin all its sub-fields, which is all fields except references. Querying the field "title" would return hits in both "bookTitle" and "chapterTitle" in addition to hits in the "title" field. Querying the field "startChapter" would only return hits in from "startChapter" since this field does not contain any sub-fields. Please be aware that using sub-fields adds extra fields in the index, and therefore uses more disks pace.
We currently have five standard index types, loosely based on the available metadata schemas. However any data can be indexed using each, as long as the RowSet follows the IndexType:
- index-type-default-1.0 (DublinCore)
- index-type-TEI-2.0
- index-type-eiDB-1.0
- index-type-iso-1.0
- index-type-FT-1.0
Query language
The Full Text Index receives CQL queries and transforms them into Lucene queries. Queries using wildcards will not return usable query statistics.
Deployment Instructions
In order to deploy and run Index Service on a node we will need the following:
- index-service-....war
- smartgears (to publish the running instance of the service on the IS and be discoverable)
- an application container (such as Tomcat, JBoss, Jetty)
Before starting the application service we should provide the configuration needed by Resource Registry. This configuration should be placed in the folder $CATALINA/conf/infrastructure.properties and the variables that need to be set are: infrastructure, scopes and clientMode (clientMode should be set to false)
Example :
infrastructure=gcube scopes=devNext clientMode=false
Usage Example
Create an Index Service Node, feed and query using the corresponding client library
The following example demonstrate the usage of the IndexClient and IndexServiceClient. Both are created according to the Builder pattern.
final String scope = "/gcube/devNext"; // create a client for the given scope (we can provide endpoint as extra filter) IndexFactoryClient indexFactoryClient = new IndexFactoryClient.Builder().scope(scope).build(); factoryClient.createResource("myClusterID", scope); // create a client for the same scope (we can provide endpoint, resourceID, collectionID, clusterID, indexID as extra filters) IndexClient indexClient = new IndexFactoryClient.Builder().scope(scope).build(); try{ indexClient.feedLocator(locator); indexClient.query(query); } catch (IndexException) { // handle the exception }
Forward Index
The Forward Index provides storage and retrieval capabilities for documents that consist of key-value pairs. It is able to answer one-dimensional(that refer to one key only) or multi-dimensional(that refer to many keys) range queries, for retrieving documents with keyvalues within the corresponding range. The Forward Index Service design pattern is similar to/the same as the Full Text Index Service design. The forward index supports the following schema for each key value pair: key; integer, value; string key; float, value; string key; string, value; string key; date, value;string The schema for an index is given as a parameter when the index is created. The schema must be known in order to be able to build the indices with correct type for each field. The Objects stored in the database can be anything.
Implementation Overview
Services
The new forward index is implemented through one service. It is implemented according to the Factory pattern:
- The ForwardIndexNode Service represents an index node. It is used for managements, lookup and updating the node. It is a compaction of the 3 services that were used in the old Full Text Index.
The following illustration shows the information flow and responsibilities for the different services used to implement the Forward Index:
It is actually a wrapper over Couchbase and each ForwardIndexNode has a 1-1 relationship with a Couchbase Node. For this reason creation of multiple resources of ForwardIndexNode service is discouraged, instead the best case is to have one resource (one node) at each gHN that consists the cluster.
Clusters can be created in almost the same way that a group of lookups and updaters and a manager were created in the old Forward Index (using the same indexID). Having multiple clusters within a scope is feasible but discouraged because it usually better to have a large cluster than multiple small clusters.
The cluster distinction is done through a clusterID which is either the same as the indexID or the scope. The deployer of the service can choose between these two by setting the value of useClusterId variable in the deploy-jndi-config.xml file true of false respectively.
Example
<environment name="useClusterId" value="false" type="java.lang.Boolean" override="false" />
or
<environment name="useClusterId" value="true" type="java.lang.Boolean" override="false" />
Couchbase, which is the underlying technology of the new Forward Index, can configure the number of replicas for each index. This is done by setting the variables noReplicas in the deploy-jndi-config.xml file
Example:
<environment name="noReplicas" value="1" type="java.lang.Integer" override="false" />
After deployment of the service it is important to change some properties in the deploy-jndi-config.xml in order for the service to communicate with the couchbase server. The IP address of the gHN, the port of the couchbase server and also the credentials for the couchbase server need to be specified. It is important that all the couchbase servers within the cluster to share the same credentials in order to know how to connect with each other.
Example:
<environment name="couchbaseIP" value="127.0.0.1" type="java.lang.String" override="false" /> <environment name="couchbasePort" value="8091" type="java.lang.String" override="false" /> <!-- should be the same for all nodes in the cluster --> <environment name="couchbaseUsername" value="Administrator" type="java.lang.String" override="false" /> <!-- should be the same for all nodes in the cluster --> <environment name="couchbasePassword" value="mycouchbase" type="java.lang.String" override="false" />
Total RAM of each bucket-index can be specified as well in the deploy-jndi-config.xml
Example:
<environment name="ramQuota" value="512" type="java.lang.Integer" override="false" />
The fact that Couchbase is not embedded (as ElasticSearch) means that there is a possibility for a ForwardIndexNode to stop but the respective Couchbase server will be still running. Also, initialization of couchbase server (setting of credentials, port, data_path etc) needs to be done separetely for the first time after couchbase server installation, so it can be used by the server. In order to automate these routine process (and some others) the following bash scripts have been developed and come with the service.
# Initialize node run: $ cb_initialize_node.sh HOSTNAME PORT USERNAME PASSWORD # If the service is down you need to remove the couchbase server from the cluster as well and reinitialize it in order to restart it later run : $ cb_remove_node.sh HOSTNAME PORT USERNAME PASSWORD # If the service is down and you want for some reason to delete the bucket (index) to rebuild it you can simply run : $ cb_delete_bucket BUCKET_NAME HOSTNAME PORT USERNAME PASSWORD
CQL capabilities implementation
As stated in the previous section the design of the Forward Index enables the efficient execution of range queries that are conjunctions of single criteria. A initial CQL query is transformed into an equivalent one that is a union of range queries. Each range query will be a conjunction of single criteria. A single criterion will refer to a single indexed key of the Forward Index. This transformation is described by the following figure:
Τhe Forward Index Lookup will exploit any possibilities to eliminate parts of the query that will not produce any results, by applying "cut-off" rules. The merge operation of the range queries' results is performed internally by the Forward Index.
RowSet
The content to be fed into an Index, must be served as a ResultSet containing XML documents conforming to the ROWSET schema. This is a simple schema, containing key and value pairs. The following is an example of a ROWSET for that can be fed into the Forward Index Updater:
The row set "schema"
<ROWSET> <INSERT> <TUPLE> <KEY> <KEYNAME>title</KEYNAME> <KEYVALUE>sun is up</KEYVALUE> </KEY> <KEY> <KEYNAME>ObjectID</KEYNAME> <KEYVALUE>cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</KEYVALUE> </KEY> <KEY> <KEYNAME>gDocCollectionID</KEYNAME> <KEYVALUE>dfe59f60-f876-11dd-8103-acc6e633ea9e</KEYVALUE> </KEY> <KEY> <KEYNAME>gDocCollectionLang</KEYNAME> <KEYVALUE>es</KEYVALUE> </KEY> <VALUE> <FIELD name="title">sun is up</FIELD> <FIELD name="ObjectID">cms://dfe59f60-f876-11dd-8103-acc6e633ea9e/16b96b70-f877-11dd-8103-acc6e633ea9e</FIELD> <FIELD name="gDocCollectionID">dfe59f60-f876-11dd-8103-acc6e633ea9e</FIELD> <FIELD name="gDocCollectionLang">es</FIELD> </VALUE> </TUPLE> </INSERT> </ROWSET>
The <KEY> elements specify the indexable information, while the <FIELD> elements under the <VALUE> element specifies the presentable information.
Usage Example
Create a ForwardIndex Node, feed and query using the corresponding client library
ForwardIndexNodeFactoryCLProxyI proxyRandomf = ForwardIndexNodeFactoryDSL.getForwardIndexNodeFactoryProxyBuilder().build(); //Create a resource CreateResource createResource = new CreateResource(); CreateResourceResponse output = proxyRandomf.createResource(createResource); //Get the reference StatefulQuery q = ForwardIndexNodeDSL.getSource().withIndexID(output.IndexID).build(); //or get a random reference StatefulQuery q = ForwardIndexNodeDSL.getSource().build(); List<EndpointReference> refs = q.fire(); //Get a proxy try { ForwardIndexNodeCLProxyI proxyRandom = ForwardIndexNodeDSL.getForwardIndexNodeProxyBuilder().at((W3CEndpointReference) refs.get(0)).build(); //Feed proxyRandom.feedLocator(locator); //Query proxyRandom.query(query); } catch (ForwardIndexNodeException e) { //Handle the exception }
Index Common library
The Index Common library is another component which is meant to be used internally by the other index components. It provides some common functionality required by the various index types, such as interfaces, XML parsing utilities and definitions of some common attributes of all indices. The jar containing the library should be deployed on every node where at least one index component is deployed.