Revision as of 18:49, 1 September 2008

Search Intro

The Search Framework consists of three major component categories: Search Master Service, Search Library and Search Operators. The first two categories are atomic in the sense that they consist of a single entity. The third one is a family of gCube services which expose various search functionalities.

The Search Master Service provides the access point to the gCube Search Engine. It receives a user query from the Portal, and along with some environment information received from the Information Service (IS), it initiates the query processing procedure. This procedure comes up with an ExecutionPlan which is fed the appropriate Execution Engine Connector which, in turn, forwards it to its corresponding execution engine (external to the Search Framework).

The Search Library is an all-in-one bundle that incorporates all the query processing, as well as the actual implementation of the Search Operators. The most critical subcomponents are: Query Processor Bundle (QueryObject, QueryParser, QueryPlanner, QueryOptimizer), Search Operator Library and the eXecution ENgine API (XENA). We will analyze these components here.

Finally, the Search Operators bundle is decomposed into several physical packages offering low level data processing elements. It is a family of gCube services, each of which is dedicated to a specific data processing facility. As previously mentioned, the SearchOperators bundle is a thin wrapper over the SearchLibrary which provides the actual implementation.

In the following paragraphs we shall present these three components along with their constituent subcomponents, and attempt to analyze their inner mechanics and relationships with the rest of the gCore/gCube context.

Query Processing Chain

Query Expression --> Search Master --> Query Parser --> Query Planner --> XENA --> Execution Engine --> Results

((Under construction))

Search from the User Perspective: Querying

In general, a query is a form of questioning, in a line of inquiry; a statement of information needs, typically keywords combined with boolean operators and other modifiers; a specification of a result to be calculated from one or more sets of data. In the gCube environment, a query object contains the information for a simple or complex search operation on one or more collections of data. The principle behind the Query is that for each search operation there is a respective query node class. So, for example, there is a Join class which represents the Join search operation. Each and every class of the Query framework, is a sub-class of the SearchOperation. In this way, we can build query trees, with a query node class define one or more query node classes as its child. The root of this query tree is handled by the Query class, which provides methods for (de) serializing a query and locating a specific query node in the tree. In order to be able to express the appropriate information, a query must describe the following elements:

Search operation: Every search operation corresponds to a search service, which implements it. Available operations are:
- FieldedSearch
- FullTextSearch
- Join
- KeepTop
- Merge
- Project
- QueryExternalSource
- SimilaritySearch
- SpatialSearch
- Conditional
- Sort
- TranformResultSet
- Source Selection: The Sources Selection part is responsible for identifying the resources on which the Query should be performed defining the collections against which the query criteria should be executed. There are also provisions for the case in which the submitted query is to be performed against the ResultSet computed by a previous query. Since the ResultSet is to be treated as a WS-Resource, this is done by passing the End Point Reference of the previous result set in the Metadata Source subpart of the Source definition section of any following query.

Available Operations

Operation	Semantics
project	Perform projection over a result set on a set of elements. By default, all header data (DocID, RankID, CollID) are kept in the projected result set; that is, they need not to be specified. If the projected term set is empty, then only the header data are kept.
sort	Perform a sort operation on a result set based on an element of that result set, either ascending (ASC) or descending (DESC).
merge	Concatenate two or more result sets. The records are non-deterministically concatenated in the final result set.
join	Join two result sets on a join key, using one of the available modes (currently only 'inner' is implemented). The semantics of the inner join is similar to the respective operation found in the relational algebra, with the difference that in our case, only the join key from the left RS is kept and joined with the right payload.
fielded search	Keep only those records of a source, which conform to a selection expression. This expression is a relation between a key and a value. The key is an element of the result set and the relation is one of the: ==, !=, >, <, <=, >=, contains. The 'contains' relation refers to whether a string is a substring of another string. Using this comparison function, one can use the wildcard '', which means any. We discriminate these cases: 'ad'. It can match any of these: 'ad', 'add', 'mad', 'ladder' 'der'. It can match any of these: 'der', 'ladder', but not 'derm' ot 'ladders'. 'ad'. It can match any of these: 'ad', 'additional', but not 'mad' or 'ladder'. 'ad'. It can only match 'ad'. If we search on a text field, then contains* refers to any of its consisting words. For example, if we search on the field title which is the rain in spain stays mainly in the plane, then the matching criteria 'ain' refers to any of the 'rain', 'spain', 'mainly'. If the predicate is '==' then we search for exact match; that is, in the previous example, the title == 'stays', won't succeed. Predicates can be combined with ORs, ANDs (currently under development). The source of this operation can be either a result set generated by another search operation or a content source. In the last case, you should use a source string identifier.
full text search	Perform a full text search on a given source based on a set of terms. The full text search source must be a string identifier of the content source. Each full text search term may contain a single or multiple words. In both cases, all terms are combined with a logical AND. In the second case, is a term is e.g. 'hello nasty', we search for the words 'hello' and 'nasty', with the latter following the former, as stated in the term; Text that does not contain such exact succession of the two words, it won't match the search criteria. Another feature of fulltextsearch is the lemmatization. In a few words, the terms are processed and a set of relative words is generated and also used in the full text search.
filter by xpath, xslt, math	Perform a low level xpath or a xslt operation on result set. The math type refers to a mathematical language and is used by advanced users who are acquainted with that language. For more details about the semantics and syntax of that language, please see the documentation for the ResultSetScanner service, which implements this language.
keep top	Keep only a given number of records of a result set.
retrieve metadata	Retrieve ALL metadata associated to the search results of a previous search operation.
read	Read a result set endpoint reference, in order to process it. This operation can be used for further processing the results of a previous search operation.
external search (deprecated)	Perform a search on external (diligent-disabled) source. Currently, google, RDBMS and the OSIRIS infrastructures can be queried. Depending on the source, the query string can vary. As far as google is concerned, the query string must conform to the query language of google. In the case of RDBMS, the query must have the following form, in order to be executed successfully: <root> <driverName>your jdbc driver</driverName> <connectionString>your jdbc connection string</connectionString> <query>your sql queryt</query> </root> Finally, in the OSIRIS case, the query string must have the following format: <root> <collection>your osiris collection</collection> <imageURL>your image URL to be searched for similar images</imageURL> <numberOfResults>the number of results</numberOfResults> </root>
similarity search	Perform a similarity search on a source for a multimedia content (currently, only images). The image URL is defined, along with the source string identifier and pairs of feature and weight.
spatial search	Perform a classic spatial search against a used defined shape (polygon, to be exact) and a spatial relation (contains, crosses, disjoint, equals, inside, intersect, overlaps, touches.
conditional search	Classic If-Then-Else construct. The hypothesis clause involves the (potentially aggragated) value of one or more fields which are part of the result of previous search operation(s). The central predicate involves a comparison of two clauses, which are combinations (with the basic math functions +, -, *, /) of these values

Syntax

  <function> ::= <project_fun> | <sort_fun> | <filter_fun> | <merge_fun> | <join_fun> | <keeptop_fun> | <fulltexts_fun> | 
     <fieldedsearch_fun> | <extsearch_fun> | <read_fun> | <similsearch_fun> | <spatialsearch_fun> | <retrieve_metadata_fun>

  <read_fun> ::= <read_fun_name> <epr>
  <read_fun_name> ::= 'read'
  <epr> ::= string

  <project_fun> ::= <project_fun_name> <by> <project_key> <project_source>
  <project_fun_name> ::= 'project'
  <project_key> ::= string
  <project_source> ::= <non_leaf_source>

  <sort_fun> ::= <sort_fun_name> <sort_order> <by> <sort_key> <sort_source>
  <sort_fun_name> ::= 'sort'
  <sort_key> ::= string
  <sort_order> ::= 'ASC' | 'DESC'
  <sort_source> ::= <non_leaf_source>

  <filter_fun> ::= <filter_fun_name> <filter_type> <by> <filter_statement> <filter_source>
  <filter_fun_name> ::= 'filter'
  <filter_type> ::= string
  <filter_statement> ::= string
  <filter_source> ::= <non_leaf_source> | <leaf_source>

  <merge_fun> ::= <merge_fun_name> <on> <merge_sources>
  <merge_fun_name> ::= 'merge'
  <merge_sources> ::= <merge_source> <and> <merge_source> <merge_sources2>
  <merge_sources2> ::= <and> <merge_source> <merge_sources2> | Ï†
  <merge_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <join_fun> ::= <join_fun_name> <join_type> <by> <join_key> <on> <join_source> <and> <join_source>
  <join_fun_name> ::= 'join'
  <join_key> ::= string
  <join_type> ::= 'inner' | 'fullOuter' | 'leftOuter' | 'rightOuter'
  <join_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <keeptop_fun> ::= <keeptop_fun_name> <keeptop_number> <keeptop_source>
  <keeptop_fun_name> ::= 'keeptop'
  <keeptop_number> ::= integer
  <keeptop_source> ::= <non_leaf_source>

  <fulltexts_fun> ::= <fulltexts_fun_name> <by> <fulltexts_term> <fulltexts_terms> <in> <language> <on> <fulltexts_sources>
  <fulltexts_fun_name> ::= 'fulltextsearch'
  <fulltexts_terms> ::= <comma> <fulltexts_term> <fulltexts_terms> | Ï†
  <fulltexts_sources> ::= <fulltexts_source> <fulltexts_sources_2>
  <fulltexts_sources_2> ::= <comma> <fulltexts_source> <fulltexts_source> | Ï†
  <fulltexts_source> ::= string

  <fieldedsearch_fun> ::= <fieldedsearch_fun_name> <by> <query> <fieldedsearch_source>
  <fieldedsearch_fun_name> ::= 'fieldedsearch'
  <query> ::= string
  <fieldedsearch_source> ::= <non_leaf_source> | <leaf_source>

  <extsearch_fun> ::= <extsearch_fun_name> <by> <extsearch_query> <on> <extsearch_source>
  <extsearch_fun_name> ::= 'externalsearch'
  <extsearch_query> ::= string
  <extsearch_source> ::= string

  <similsearch_fun> ::= <similaritysearch_fun_name> <as> <URL> <by> <pair> <pairs> <similarity_source>
  <similsearch_fun_name> ::= 'similaritysearch'
  <URL> ::= string
  <pair> ::= <feature> <equal> <weight>
  <pairs> ::= <and> <pair> <pairs> | Ï†
  <similarity_source> ::= <leaf_source>

  <if-syntax> ::= <if> <left_parenthesis> <function-st> <compare-sign> <function-st> <right_parenthesis> <then> <search-op> <else> <search-op>
  <compare-sign> ::= '==' | '>' | '<' | '>=' | '<='
  <function-st> ::= <left-op> <math-op> <right-op> | <left-op>
  <math-op> ::= '+' | '-' | '*' | '/'
  <left-op> ::= <function> <left_parenthesis> <left-op> <right_parenthesis> | <literal>
  <function> ::= <max-fun> | <min-fun> | <sum-fun> | <av-fun> | <va r-fun> | <size-fun>
  <max-fun> ::= 'max' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <min-fun> ::= 'min' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <sun-fun> ::= 'sum' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <av-fun> ::= 'av' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <va r-fun> ::= 'var' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <size-fun> ::= size' <left_parenthesis> <search-op> <right_parenthesis>
  <right-op> ::= <function-st> | <left-op>
  <element> ::= an element of the result set payload (either XML element, or XML attribute)

  <retrieve_metadata_fun> ::= <rm_fun_name> <in> <language> <on> <rm_source> <as> <schema>
  <rm_fun_name> ::= 'retrievemetadata'
  <schema> ::= string
  <rm_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <spatialsearch_fun> ::= <spatialsearch_fun_name> <relation> <geometry> [<timeBoundary>] <spatial_source>
  <spatialsearch_fun_name> ::= 'spatialsearch'
  <relation> ::= {'intersects', 'contains', 'isContained'}
  <geometry> ::= <polygon_name> <left_parenthesis> <points> <right_parenthesis>
  <polygon_name> ::= 'polygon'
  <timeBoundary> ::= 'within' <startTime> <stopTime>
  <startTime> ::= double
  <stopTime> ::= double
  <spatial_source> ::= <leaf_source>
  <points> ::= <point> {<comma> <point>}+
  <point> ::= <x> <y>
  <x> ::= long
  <y> ::= long

<leaf_source> ::= [<in> <language>] <on>

Invalid language.

You need to specify a language like this: <source lang="html4strict">...</source>

Supported languages for syntax highlighting:

4cs, 6502acme, 6502kickass, 6502tasm, 68000devpac, abap, actionscript, actionscript3, ada, aimms, algol68, apache, applescript, arm, asm, asp, asymptote, autoconf, autohotkey, autoit, avisynth, awk, bascomavr, bash, basic4gl, bf, bibtex, blitzbasic, bnf, boo, c, caddcl, cadlisp, cfdg, cfm, chaiscript, chapel, cil, clojure, cmake, cobol, coffeescript, cpp, csharp, css, cuesheet, d, dart, dcl, dcpu16, dcs, delphi, diff, div, dos, dot, e, ecmascript, eiffel, email, epc, erlang, euphoria, ezt, f1, falcon, fo, fortran, freebasic, freeswitch, fsharp, gambas, gdb, genero, genie, gettext, glsl, gml, gnuplot, go, groovy, gwbasic, haskell, haxe, hicest, hq9plus, html4strict, html5, icon, idl, ini, inno, intercal, io, ispfpanel, j, java, java5, javascript, jcl, jquery, kixtart, klonec, klonecpp, latex, lb, ldif, lisp, llvm, locobasic, logtalk, lolcode, lotusformulas, lotusscript, lscript, lsl2, lua, m68k, magiksf, make, mapbasic, matlab, mirc, mmix, modula2, modula3, mpasm, mxml, mysql, nagios, netrexx, newlisp, nginx, nimrod, nsis, oberon2, objc, objeck, ocaml, octave, oobas, oorexx, oracle11, oracle8, oxygene, oz, parasail, parigp, pascal, pcre, per, perl, perl6, pf, php, pic16, pike, pixelbender, pli, plsql, postgresql, postscript, povray, powerbuilder, powershell, proftpd, progress, prolog, properties, providex, purebasic, pycon, pys60, python, q, qbasic, qml, racket, rails, rbs, rebol, reg, rexx, robots, rpmspec, rsplus, ruby, rust, sas, scala, scheme, scilab, scl, sdlbasic, smalltalk, smarty, spark, sparql, sql, standardml, stonescript, systemverilog, tcl, teraterm, text, thinbasic, tsql, typoscript, unicon, upc, urbi, uscript, vala, vb, vbnet, vbscript, vedit, verilog, vhdl, vim, visualfoxpro, visualprolog, whitespace, whois, winbatch, xbasic, xml, xpp, yaml, z80, zxbasic

 [<as> <schema>]
   <non_leaf_source>  ::= <on> <left_parenthesis> <function> <right_parenthesis>

----

   <language>  ::= 'AFRIKAANS' | 'ARABIC' | 'AZERI' | 'BYELORUSSIAN' | 'BULGARIAN' | 'BANGLA' | 'BRETON' | 'BOSNIAN' | 'CATALAN' | 
      'CZECH' | 'WELSH' |    'DANISH' | 'GERMAN' | 'GREEK' | 'ENGLISH' | 'ESPERANTO' | 'SPANISH' | 'ESTONIAN' | 'BASQUE' | 'FARSI' |
      'FINNISH' | 'FAEROESE' | 'FRENCH' | 'FRISIAN' | 'IRISH_GAELIC' | 'GALICIAN' | 'HAUSA' | 'HEBREW' | 'HINDI' | 'CROATIAN' | 
      'HUNGARIAN' | 'ARMENIAN' | 'INDONESIAN' | 'ICELANDIC' | 'ITALIAN' | 'JAPANESE' | 'GEORGIAN' | 'KAZAKH' | 'GREENLANDIC' | 'KOREAN' |
      'KURDISH' | 'KIRGHIZ' | 'LATIN' | 'LETZEBURGESCH' | 'LITHUANIAN' | 'LATVIAN' | 'MAORI' | 'MONGOLIAN' | 'MALAY' | 'MALTESE' |
      'NORWEGIAN_BOKMAAL' | 'DUTCH' | 'NORWEGIAN_NYNORSK' | 'POLISH' | 'PASHTO' | 'PORTUGUESE' | 'RHAETO_ROMANCE' | 'ROMANIAN' | 'RUSSIAN' | 
      'SAMI_NORTHERN' | 'SLOVAK' | 'SLOVENIAN' | 'ALBANIAN' | 'SERBIAN' | 'SWEDISH' | 'SWAHILI' | 'TAMIL' | 'THAI' | 'FILIPINO' | 'TURKISH' |
      'UKRAINIAN' | 'URDU' | 'UZBEK' | 'VIETNAMESE' | 'SORBIAN' | 'YIDDISH' | 'CHINESE_SIMPLIFIED' | 'CHINESE_TRADITIONAL' | 'ZULU'
   <source> ::= string
   <schema>  ::= string
   <left_parenthesis> ::= '('
   <right_parenthesis> ::= ')'
   <comma> ::= ','
   <and> ::= 'and'
   <on> ::= 'on'
   <as> ::= 'as'
   <by> ::= 'by'
   <sort_by> ::= 'sort'
   <from> ::= 'from'
   <if> ::= 'if'
   <then> ::= 'then'
   <else> ::= 'else'

== Examples ==

{| border="1"
|+ '''Example 1'''
|-
! User Request
| Give me back all documents whose metadata contain the word ''woman'' from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''>
|-
! Actual Query
| <pre>fulltextsearch by 'woman' in 'ENGLISH' on '0a952bf0-fa44-11db-aab8-f715cb72c9ff' as 'dc'</pre>
|-
! Explanation
| We perform the ''fulltextsearch'' operation, using the ''woman'' term in the data source identified by the laguage ''ENGLISH'', source number ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'' and schema ''dc''
|}

----

{| border="1"
|+ '''Example 2'''
|-
! User Request
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea; that is, the creator's name contains the separate word dorothea, e.g. ''Hemans, Felicia Dorothea Browne''
|-
! Actual Query
| <pre>fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc'</pre>
|-
! Explanation
| We perform the ''fieldedsearch'' operation in the data source identified by the laguage ''ENGLISH'', source number ''568a5220-fa43-11db-82de-905c553f17c3'' and schema ''dc'' and retrieve only those that their creator's name contain the word 'dorothea'. CAUTION: This does not cover creator names such as 'abcdorothea'. In this case, users should use the wildcard '*'. The absence of '*' implies string delimiter. E.g. '*dorothea' matches 'abcdorothea' but not 'dorotheas', 'dorothea*' matches 'dorotheas' but not 'abcdorothea'. Another critical issue is the data source identifier. Example 1 and Example 2 refer to the 'A Celebration of Women Writers' collection. However, in Example 1 we refer to the metadata of this collection, whereas in Example 2 we refer to the actual content.
|}

----

{| border="1"
|+ '''Example 3'''
|-
! User Request
| Give me back the ''creator'' and ''subject'' the first 10 documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> sorted by the ''DocID'' field, whose creator's name contains ''ro''.
|-
! Actual Query
| <pre>project by 'creator', 'subject' on 
   (keeptop '10' on 
      (sort 'ASC' by 'DocID' on 
         (fieldedsearch by 'creator' contains '*ro*' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')))</pre>
|-
! Explanation
| First of all, we perform the fieldedsearch operation in the data source identified by the laguage ''ENGLISH'', source number ''568a5220-fa43-11db-82de-905c553f17c3'' and schema ''dc'' and retrieve only those that their creator's name contain the word 'ro'. On this result set, we apply the sort operation on the ''DocID'' field. Then, we apply the keep top operation, in order to keep only the first 10 sorted documents. Finally, we apply the project operation, keeping only the ''creator'' and ''subject'' fields.
|}

----

{| border="1"
|+ '''Example 4'''
|-
! User Request
| Perform spatial search against the collection identified by the triplet <''ENGLISH'',''6cbe79b0-fbe0-11db-857b-d6a400c8bdbb'',''eiDB''> defining a search rectagular (0,0), (0,50), (50,50), (50,0).
|-
! Actual Query
| <pre>spatialsearch contains polygon(0 0, 0 50, 50 50, 50 0) 
   in 'ENGLISH' on '6cbe79b0-fbe0-11db-857b-d6a400c8bdbb' as 'eiDB'</pre>
|-
! Explanation
| Search in collection identified by the triplet <''ENGLISH'',''6cbe79b0-fbe0-11db-857b-d6a400c8bdbb'',''eiDB''> for records that define geometric shapes which include the rectagular identified by the points {(0,0), (0,50), (50,50), (50,0)}.
|}

----

{| border="1"
|+ '''Example 5'''
|-
! User Request
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea and all documents from the same collection whose title contain the word ''woman''
|-
! Actual Query
| <pre>merge on 
    (fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc') 
and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')</pre>
|-
! Explanation
| This is an example of how the user can merge the results of more than one subqueries.
|}

----

{| border="1"
|+ '''Example 6'''
|-
! User Request
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea AND whose description contain the word ''London''
|-
! Actual Query
| <pre>join inner by 'DocID' on 
    (fieldedsearch by 'description' contains 'London' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc') 
and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')</pre>
|-
! Explanation
| This is an example of how the user can perform a join (logical AND) of subqueries. We perform the ''join'' operation in the field ''DocID'', which is the document unique identifier. In this way, documents that are members of both the result sets of the subqueries, participate in the final result set.
|}

----


= Search Orchestrator (Search Master Service)=

The Search Master Service (SMS) is the main entry point to the functionality of the search engine. It contains the elements that will organize the execution of the search operators for the various tasks the Search engine is responsible for.

There are two steps in achieving the necessary Query processing before it can be forwarded to the Search engine for the actual execution and evaluation of the net results. The first step is the transformation of the abstract Query terms in a series of WS invocations. The output of this step is an enriched execution plan mapping the abstract Query to a workflow of service invocations. These invocations are calls to Search Service operators providing basic functionality called Search Operators. The second step is the optimization of the calculated execution plan.

The SMS is responsible for the first stage of query processing. This stage produces a query execution plan, which in the gCube implementation is a directed acyclic graph of Search Operator invocations. This element is responsible for gathering the whole set of information that is expected to be needed by the various search services and provides it as context to the processed query. In this manner, delays for gathering info at the various services are significantly reduced and assist responsiveness.

The information gathered is produced by various components or services of the gCube Infrastructure. They include the gCube Information Service (IS), Content and Metadata Management, Indexing service etc. The process of gathering all needed information proves to be very time consuming. To this end, the SMS keeps a cache of previously discovered information and state. 

The SMS validates the received Query using SearchLibrary elements. It validates the user supplied query against the elements of the specific Virtual Organisation (VO). This ensures that content collections are available, metadata elements (e.g. fields) are present, operators (i.e. services) are accessible etc. Afterwards it performs a number of preprocessing steps invoking functionality offered by services such as the Query Personalisation and the DIR (former Content Source Selection) service, in order to refine the context of the search or inject extra information at the query. These are specializations of the general Query Preprocessor Element. An order of Query Preprocessor calls is necessary in the case where they might inject conflicting information. Otherwise, a method for weighting the source of the conflicting information importance is necessary. Furthermore, a number of exceptions may occur during the operation of a preprocessor, as during the normal flow of any operation. The difference is that, although useful in the context of gCube, preprocessors are not necessary for a Search execution. So errors during Query Preprocessing must not stop the search flow of execution.

The above statement is a sub case of the more general need of a framework for defining fatal errors and warnings. During the entire Search operation a number of errors and/or warnings may emerge. Depending on the context in which they appear, they may have great or no significance to the overall progress of execution. Currently, these cases are handled separately but a uniform management may come into play as the specifications of each service’s needs in the grand plan of the execution become more apparent at a low enough level of detail.

After the above pre-processing steps are completed successfully, the SMS dispatches a QueryPlanner thread to create the Query Execution Plan. Its job is firstly to map the provided Query that has been enriched by the preprocessors to a concrete workflow of WS invocations. Subsequently, the Query Planner uses the information encapsulated with the provided Query, the information gathered by the SMS for the available gCube environment and a set of transformation rules to come up with a near optimal plan. When certain conditions are met (e.g. the Query Planner has finished, time has elapsed, all plans have been evaluated), the planer returns to its caller the best plan calculated. If more than one Query Planners are utilized, the plans calculated by each Query Planner are gathered by the SMS. He then chooses the overall optimal plan and passes it to a suitable execution engine, where execution and scheduling is being achieved in a generic manner. The actual integration with the available execution engines and the formalization of their interaction with the SMS is accomplished through the introduction of the eXecution Engine Api (XENA), which is thoroughly analyzed in the Search Library section. In this formal methodology, the SMS is able to selected among the various available engines, such as the Process Execution Service, the Internal Execution Engine or any other WS-workflow engine. These engines are free to enforce their own optimization strategies, as long as they respect to the semantic invariants dictated by the original Execution Plan.

Finally, the SMS receives the final ResultSet from the execution engine and pass its end point reference back to the requestor.

== DL Description ==
Through the Search Master, external services can receive a structured overview of the VO resources available and usable during a search operation. An example of this summarization is shown bellow:

   <SearchConfig>
     <collections>
       <collection name="Example Collection Name 1" id="1fc1fbf0-fa3c-11db-82de-905c553f17c3">
         <TYPE>DATA</TYPE>
         <ASSOCIATEDWITH>d510a060-fa3c-11db-aa91-f715cb72c9ff</ASSOCIATEDWITH>
         <ASSOCIATEDWITH>g45612f7-dth5-23fg-45df-45dfg5b1r34s</ASSOCIATEDWITH>
       </collection>
       <collection name="Example Collection Name 2" id="c3f685b0-fdb6-11db-a573-e4518f2111ab">
         <TYPE>DATA</TYPE>
         <ASSOCIATEDWITH>7bb87410-fdb7-11db-8476-f715cb72c9ff</ASSOCIATEDWITH>
         <INDEX>FEATURE</INDEX>
       </collection>
       <collection name="Example Collection Name 3" id="d510a060-fa3c-11db-aa91-f715cb72c9ff">
         <TYPE>METADATA</TYPE>
         <LANGUAGE>en</LANGUAGE>
         <SCHEMA>dc</SCHEMA>
         <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH>
         <INDEX>FTS</INDEX>
         <INDEX>XML</INDEX>
       </collection>
       <collection name="Example Collection Name 4" id="g45612f7-dth5-23fg-45df-45dfg5b1r34s">
         <TYPE>METADATA</TYPE>
         <LANGUAGE>en</LANGUAGE>
         <SCHEMA>tei</SCHEMA>
         <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH>
         <INDEX>FTS</INDEX>
         <INDEX>XML</INDEX>
       </collection>
       <collection name="Example Collection Name 5" id="7bb87410-fdb7-11db-8476-f715cb72c9ff">
         <TYPE>METADATA</TYPE>
         <LANGUAGE>en</LANGUAGE>
         <SCHEMA>dc</SCHEMA>
         <ASSOCIATEDWITH>c3f685b0-fdb6-11db-a573-e4518f2111ab</ASSOCIATEDWITH>
         <INDEX>FTS</INDEX>
         <INDEX>XML</INDEX>
       </collection>
     </collections>
   </SearchConfig>

= Query Processing (Search Library) =

The core query processing functionality is provided by the Search Library component and is orchestrated by the SMS. In the following paragraphs we analyze the major subcomponents of the Search Library component.

== Query Parser ==

The Query Parser is responsible for transforming query expressions, coming either from the Search Portlet or directly provided by the end-user, into instances of the ''Query'' (and its subordinate) class. These instances are afterwards forwarded to the Query Planner and along with some environment information (extracted by the SMS from the IS) initiate the query processing process. To minimize external dependencies as well as the total execution time we chose not to use a pre-existing java compiler, such as [https://javacc.dev.java.net/ JavaCC], but instead build our own.


== Query Planner ==

The objective of query processing in the gCube distributed context is to transform a high-level query into an efficient execution strategy that takes into account the status and nature of the infrastructure, which lies beneath. The need for optimization is evident in many use cases, including the mis-utilization of resources and the beneficial use of existing structures. The task of query processing includes the mapping of the Query to an execution plan, consisting of WS invocations, and an initial domain specific optimisation of the produced execution plan. This task is the responsibility of the QueryPlanner, which will be thoroughly analysed in this paragraph.

Generally there are two main layers, which are involved in mapping the query into an optimized sequence of local operations, each acting on a specific node. These layers perform the functions of query materialization and query optimization. But before analysing these layers, first we will present the most critical part of the query planning procedure, the service profile structure. 

We list the most important architectural aspects of the component are presented below

=== Service Profiles ===
Each WSRF service that participates in the search framework is accompanied with a profile. This profile contains vital information concerning the usability and applicability of the service. It is primarily used for describing the connection between the service itself and the query it can execute. For example, “serviceA” declares in its profile that it can answer to a sort query. During the production of the execution plan from a query tree, the query planner matches the various query nodes to the services which can answer them, thus constructing the execution plan step-by-step. 

More analytically, each service, first, describes a set of invocation information, which include its service path name (without the host machine address), its port type and operation and optionally its resource Endpoint Reference (if the service is stateful). Besides the execution information, services declare a semantic descriptor of which queries can they compute. For that reason, we employ XML Schema Definitions (XSD), so that every service can define the Schema of the XML queries that are able to answer to. Moreover, each service defines a generic transformation of the matching XML query to produce its invocation message (actually the body of the invocation SOAP message). This is done using XSL Transformations of the XML query to the XML SOAP message body. Finally, in order to accommodate the combination of more that one service invocations computing to one sub query, we have introduced a generic replacement strategy, through which a pseudo-service can compute one sub query, by splitting it into many sub-queries, each of which can be directly computed by a single service instance. This is also done using XSL Transformations of the matching XML query to the combination of other XML sub queries.

Service Profiles are basically XML documents that can be found in the DIS component. However, the query planner deserializes them into java classes via a data-binding technology, JAXB. These classes are located in the SearchLibrary. Since the service profiles are only internally used by the Planner component, their design will not be analyzed further.

==== Matching Rules ====

As previously stated ([[#Service_Profiles]]), the maching between subqueries and services is performed via the SearchServiceProfiles. Under the assumption that all D4S services are up-n-running, then the matching rules are the following:
! project
| TransformByXSLTOperator
|-
! sort
| SortOperator
|-
! merge
| MergeOperator
|-
! join
| * If join inner then JoinInnerOperator
  * otherwise no match
|-
! fielded search
| * If value starts with '*" then XMLIndexer
  * Otherwise 
  ** if it exists forward to FTS
  ** if not, the XMLIndexer
|-
! full text search
| FTS
|-
! filter by xpath, xslt
| FilterXPath, TransformByXSLT respectively.
|-
! keep top
| KeepTopOperator
|-
! retrieve metadata
| MetadataManagerFactory
|-
! read
| build-in (does not correspond to any service)
|-
! external search
| Perform a search on external source. Currently, only google can be queried.
|-
! similarity search
| ?
|-
! spatial search
| GeoIndexLookup
|-
! conditional search
| build-in (does not correspond to any service)


=== ExecutionPlan ===
The output of the Planner component is an instance of the ExecutionPlan class, which in principle is just a java graph of service invocations. Edges denote parent-child relationships and nodes specific service invocations. Note that these invocations are abstract in the sense that no specific endpoint reference is defined. To accommodate the WS scheduling that will be performed by the execution engine, the ExecutionPlan also includes a set of candidate concrete service endpoint references, so as to be selected later on by the execution engine.

The best way to describe an execution plan is through its Schema definition:

  <?xml version="1.0" encoding="UTF-8"?>
  <schema xmlns="http://www.w3.org/2001/XMLSchema" 
        xmlns:xsd="http://www.w3.org/2001/XMLSchema"
        targetNamespace="http://gcube.org/searchservice/qep"
        xmlns:tns="http://gcube.org/searchservice/qep"
        xmlns:jxb="http://java.sun.com/xml/ns/jaxb" jxb:version="2.0"
        xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing">
  
        <xsd:import namespace="http://schemas.xmlsoap.org/ws/2004/03/addressing"
             schemaLocation="../../external/WS-Addressing.xsd" />
  
        <xsd:annotation>
                <xsd:appinfo>
                        <jxb:schemaBindings>
                                <jxb:package name="org.gcube.searchservice.searchlibrary.qep"/>
                        </jxb:schemaBindings>
                </xsd:appinfo>
        </xsd:annotation>
  
        <!-- TODO Types -->
        <xsd:complexType name="securityType">
                <xsd:complexContent>
                        <xsd:restriction base="xsd:anyType" />
                </xsd:complexContent>
        </xsd:complexType>
  
  
        <!-- Simple Types -->
  
        <xsd:simpleType name="scopeType">
                <xsd:restriction base="xsd:string" />
        </xsd:simpleType>
  
        <xsd:simpleType name="wsdlType">
                <xsd:restriction base="xsd:string" />
        </xsd:simpleType>
  
        <xsd:simpleType name="substituteTypeType">
                <xsd:annotation>
                        <xsd:documentation>
                                The 'substituteTypeType' defines the role of the input
                                message's element to be substituted. There are 2
                                possible values: 'child' and 'children'. The 'child'
                                value expresses the fact that the element to be
                                substotuted holds the endpoint reference of one input
                                source of the service to be invoked. The 'children'
                                value expresses the fact that the element to be
                                substotuted holds the endpoint references of ALL input
                                sources of the service to be invoked; that is, an array
                                of endpoint references.
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:restriction base="xsd:string">
                        <xsd:enumeration value="child" />
                        <xsd:enumeration value="children" />
                </xsd:restriction>
        </xsd:simpleType>
  
        <xsd:simpleType name="substituteNameType">
                <xsd:annotation>
                        <xsd:documentation>
                                The 'substituteNameType' defines the element of a
                                service's input message, that should be substituted.
                                Actually, a population of this element takes places,
                                with values not available at runtime. These values
                                concern the output of the children services of that
                                specific service. This output refers to the result set
                                endpoint reference of the resource of the results of a
                                child service. Note that a service is child to another
                                service if its former's output is the later's input.
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:restriction base="xsd:string" />
        </xsd:simpleType>
  
        <xsd:simpleType name="messageType">
                <xsd:annotation>
                        <xsd:documentation>
                                This is the type of the message used as input to a
                                service.
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:restriction base="xsd:string" />
        </xsd:simpleType>
  
        <xsd:simpleType name="portTypeType">
                <xsd:annotation>
                        <xsd:documentation>
                                This is the type of a portType; actually, an xsd string
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:restriction base="xsd:string" />
        </xsd:simpleType>
  
        <xsd:simpleType name="operationType">
                <xsd:annotation>
                        <xsd:documentation>
                                This is the type of an operation; actually, an xsd
                                string
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:restriction base="xsd:string" />
        </xsd:simpleType>
  
        <xsd:simpleType name="msgType">
                <xsd:annotation>
                        <xsd:documentation>
                                This is the type of a msgType; actually, an xsd string
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:restriction base="xsd:string" />
        </xsd:simpleType>
  
        <xsd:simpleType name="partType">
                <xsd:annotation>
                        <xsd:documentation>
                                This is the type of a message part; actually, an xsd
                                string
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:restriction base="xsd:string" />
        </xsd:simpleType>
  
        <xsd:simpleType name="elmtType">
                <xsd:annotation>
                        <xsd:documentation>
                                This is the type of a part element; actually, a
                                qualified name
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:restriction base="xsd:QName" />
        </xsd:simpleType>
  
  
        <!-- Complex Types -->
  
        <xsd:complexType name="resourceType">
                <xsd:annotation>
                        <xsd:documentation>
                                This is the type of the resource of a web service. Note
                                that a service's resource is an identifier of one of its
                                states.
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:complexContent>
                        <xsd:restriction base="xsd:anyType" />
                </xsd:complexContent>
        </xsd:complexType>
  
        <xsd:complexType name="substituteType">
                <xsd:annotation>
                        <xsd:documentation>
                                This is the type of a message's element to be
                                substituted/populated ate runtime. This elements should
                                be populated with the output of the children services of
                                another service. For further detail see the
                                'substituteNameType' and 'substituteTypeType'
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:sequence>
                        <xsd:element name="name" type="tns:substituteNameType" />
                        <xsd:element name="type" type="tns:substituteTypeType" />
                </xsd:sequence>
        </xsd:complexType>
  
        <xsd:complexType name="inputMessageType">
                <xsd:annotation>
                        <xsd:documentation>
                                The 'inputMessageType' defines a complex the type of the
                                message that serves as input to a service, plus the set
                                of elements to be populated at runtime with the output
                                of the children services.
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:sequence>
                        <xsd:element name="message" minOccurs="0" maxOccurs="1"
                                type="tns:messageType" />
                        <xsd:element name="substitute" minOccurs="0"
                                maxOccurs="unbounded" type="tns:substituteType" />
                </xsd:sequence>
        </xsd:complexType>
  
        <xsd:complexType name="wsResourceType">
                <xsd:annotation>
                        <xsd:documentation>
                                The 'wsResourceType' defines a WSRF resource (web
                                service + resource). The complex type is consisted of
                                the web service's wsdl URI and the XML representation of
                                a resource.
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:sequence>
                        <xsd:element name="wsdl" type="tns:wsdlType" />
                        <xsd:element name="resource" type="xsd:string" /><!-- tns:resourceType"/> -->
                </xsd:sequence>
        </xsd:complexType>
  
        <xsd:complexType name="executionEnvelopeType">
                <xsd:annotation>
                        <xsd:documentation>
                                This element describes some execution information
                                concerning the service invocation. This info include,
                                namespace, portType, operation, I/O msg, part, element.
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:sequence>
                        <xsd:element name="namespace" type="xsd:string" />
                        <xsd:element name="portType" type="tns:portTypeType" />
                        <xsd:element name="operation" type="tns:operationType" />
                        <xsd:element name="inputMsg" type="tns:messageType" />
                        <xsd:element name="outputMsg" type="tns:messageType" />
                        <xsd:element name="inputPart" type="tns:partType" />
                        <xsd:element name="outputPart" type="tns:partType" />
                        <xsd:element name="inputElement" type="tns:elmtType" />
                        <xsd:element name="outputElement" type="tns:elmtType" />
                        <xsd:element name="endpointReference" type="wsa:EndpointReferenceType"/>
                </xsd:sequence>
        </xsd:complexType>
  
        <xsd:complexType name="qepNode">
                <xsd:annotation>
                        <xsd:documentation>
                                This is the datatype of the query execution plan node.
                                It declares the ws resource, its input message and the
                                set of children nodes.
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:sequence>
                        <xsd:element name="serviceClass" type="xsd:string" minOccurs="0" maxOccurs="1" default="Search"/>
                        <xsd:element name="wsResource" type="tns:wsResourceType" />
                        <xsd:element name="inputMessage"
                                type="tns:inputMessageType" />
                        <xsd:element name="executionEnvelope"
                                type="tns:executionEnvelopeType" />
                        <xsd:element name="managementInfo" type="tns:managementInfoType" minOccurs="0" maxOccurs="1"/>
                        <xsd:element name="sources" type="tns:qepNode" minOccurs="0"
                                maxOccurs="unbounded" />
                </xsd:sequence>
        </xsd:complexType>
  
        <xsd:complexType name="managementInfoType">
                <xsd:annotation>
                        <xsd:documentation>
                                The 'managementInfoType' type includes define management information
                                that might be useful to the execution engine. This information include
                                security parameters, (D4S) scope attributes etc. It can be applied either
                                globally (affecting all service invocations), or locally (affecting a single service)
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:sequence>
                        <xsd:element name="scope" type="tns:scopeType" minOccurs="0" maxOccurs="1"/>
                        <xsd:element name="security" type="tns:securityType" minOccurs="0" maxOccurs="1"/>
                </xsd:sequence>
        </xsd:complexType>
  
        <!-- Root Element -->
  
        <xsd:element name="queryExecutionPlan">
                <xsd:annotation>
                        <xsd:documentation>
                                This is the head of the query execution plan. Actually
                                is it a 'pointer' to the qep root node.
                        </xsd:documentation>
                </xsd:annotation>
                <xsd:complexType>
                        <xsd:sequence>
                                <xsd:element name="rootNode" type="tns:qepNode" />
                                <xsd:element name="managementInfo" type="tns:managementInfoType" minOccurs="0" maxOccurs="1"/>
                        </xsd:sequence>
                </xsd:complexType>
        </xsd:element>
  
  </schema>

As you can see, an execution plan is a graph of qepNode instances, along with a global managementInfo envelope. This envelope refers to some managerial factors, such as scope handling and security. Currently, only scope handling is employed. Each qepNode instance has an arbitrary number of qepNode children instances, its local managementInfo envelope (valid for only this instance, not its children instances), the message payload and its execution context. The message payload consists of the raw message as well as the part of the message that will be dynamically populated by the invocation of other qepNode instance executions. In other words, each qepNode consumes the results of its children qepNode instances and this consumption in described in its input variables which store the produced output. Finally, the execution context refers to the service invocation details, such as the service EPR.

== Search Operators Core Library ==
The classes that belong to the [[#Search_Operators | Search Operators]] set are the core classes of the data retrieval and processing procedure. They implement the necessary processing algorithms and are wrapped by the respective Search Services, in order to expose their functionality to the gCube infrastructure. However, the SearchOperator classes implement only a subset of the whole gCube functionality; there are some services that do not rely on these classes and incorporate the full functionality, without any dependencies to any library (apart from the ResultSet and ws-core, of course). These services are the IndexLookupService, GeoIndexLookupService, ForwardIndex, FeatureIndex and Fuse; they will be analyzed in separate, following paragraphs.

=== Query Materialization ===
The main role of this layer is to transform the query operations into concrete search operators, which are provided by the Search Service, escorted by the corresponding gHNs that host these operators. The appropriate information concerning the existent operators and their hosting gHNs is passed to the QueryPlanner by the SearchMasterService, which in turn receives it from the gCube Information Service. This information is the set of available service profiles, described above, and the set of hosting nodes and their respective services. The generating execution plan is constructed in steps, by matching a query node to the respective search service. There is an M-N relation among sub queries and service invocations, m, n>= 1. The output of this layer is an initial query execution plan which can be forwarded to the Process Execution service (§6.6) and run as is (the BPEL part, see previous paragraph). However, this may lead to poor performance, since no optimization is performed upon the query, which can lead to a substantial execution cost increase.

=== Query Optimization ===
The goal of query optimization is to find an execution strategy for the query which is as close to optimal as possible. However, selecting the optimal execution strategy for a query is an NP-hard problem. For complex queries with many sources (collections) involving a large number of operations hosted in a complex infrastructure, this can incur a prohibitive optimization cost. Therefore, the actual objective of the optimization is to find a strategy close to optimal within a logical amount of time. The output of the optimization layer, which is also the final output of the QueryPlanner, is an optimized QueryExecutionPlan object.
The optimization process followed by the planner makes extensive use of service instances response time statistics and operator cardinalities. Service instances are ranked based on their response times and query operations are ranked based on their selectivity factor. Query Planner enumerates the equivalent plans for a given query and selects the best plan, according to its execution cost. However, an exhaustive search in all possible plans leads to prohibitive time and resource consumption. Therefore, we employ a greedy heuristic of minimizing the cost in each step of the plan construction. In the first step, the initial query tree is re-written using a set of XSL Transformations, in order to produce a plan with minimum intermediate results size. In the second step, the query tree produced earlier is transformed to its corresponding execution plan, selecting the best service instances that can compute the sub queries of the query tree. If we come up to two candidate services that correspond to a given sub query, then we select only the one that minimizes the total cost up to this point. We note that only abstract services are selected in this step. The actual scheduling takes place in the third and final optimization step, by the ProcessOptimizationService component (POS). The POS is responsible for allocating tasks to gHNs, based on some criteria, such as gHN load.

To sum up, the QueryPlanner gets a query and return a near optimal query execution plan object that contains a sequence of concrete search operations. The final mapping of operations and gHNs will be done at runtime by the execution service.


== Execution (Engine) Integration ==

The term Execution (Engine) Integration refers to a logical group of components which materialize the bridge between query planning and plan execution. As previously written, the QueryPlanner produces an ExecutionPlan which should be forwarded to the execution engine(s). The procedure of feeding the engine(s) with the ExecutionPlan is performed by the components described in this paragraph.

Search-Execution integration is by far not a trivial task because of the obvious fact that components from each side were developed by different groups, resulting in heterogeneity regarding their interfaces. The purpose of decoupling components is exactly the ability to employ them from various institutions. All in all, the problem comes up to the fact that we have an ExecutionPlan, a set of available execution engines, which potentially have completely different interfaces, and we must find a way to feed the engine with a corresponding execution plan, expressed in its native language, but obviously based on the original ExecutionPlan. The solution to this integration problem is the introduction of the eXecution Engine Api (XENA), which is described in the following paragraphs. Before that, we present the available execution engines, which are or will be employed in D4Science.
Currently, there are 3 execution engines available in gCube, ProcessExecutionService, the internal QEPExecution engine and finally ActiveBPEL, which is an open-source, BPEL engine, very popular in the web service world. For further details regarding the execution engines, please refer to [[#Execution_Engines]].

=== eXecution ENgine Api (XENA) ===
As the name reveals XENA provides the abstractions needed by any engine to execute plans generated by the Search Framework, namely instances of the ExecutionPlan class. XENA can be thought of as a middleware between the Search Framework and the execution engine, with the responsibility of ‘translating’ the ExecutionPlan instances to the engine’s internal plan representation.

However, the integration issue cannot be treated so simplistically. First of all, one cannot assume that every engine should respect to the XENA paradigm and therefore offer an inherent support/integration to the Search Framework. Another problem is that the XENA design itself must be very flexible and expressive in order to encompass all of the significant syntactic and semantic attributes of a process execution. Therefore, it is imperative to clarify two major factors: the real-world architectural solution to the search-execution integration and the data model that XENA adopts.

The idea behind XENA is the following: Between The XENA API and the execution engine, there is a proxy component called Connector, which is responsible for translating the XENA API model to the engine’s internal model. The idea is not new. It has been applied in many middleware solutions and mostly known from the JDBC API. The ala-JDBC paradigm dictates the introduction of connectors that bridge JDBC and the underlying RDBMS. For every different RDBMS product there is a different software connector. So, in our context, we employ different XENA connectors for different workflow engines. Each connector, apart from the XENA artifacts, is fed with a set of execution engine endpoint references, in order for the connector to know which engine instance(s) it may refer to. Each XENA connector is first registered in a formal way to a special class called ExecutionEngineConnectorFactoryBuilder, publishing itself and the set of supported features. These features are key-value(s) pairs and describe the abilities of the execution engine. They may include persistence, automatic recovery, performance metrics, quality of service, etc. The registration procedure accommodates the dynamic binding to a specific connector (and thus execution engine) made by the SearchMasterService, based on some user preferences or predefined system parameteres. So, for example, if a user desires maximum availability in his/her processes, then the SearchMasterService can select an execution engine / respective connector that supports persistence and failsafe capabilities. The dynamic binding of XENA API to a specific connector is accomplished by employing some of the Java Classloader capabilities.

Regarding the data model that is adopted by XENA, it has borrowed the design philosophy of another middleware solution in the area of web service registries, JAXR, a Java Specification (JSR) for that field. The main abstract classes are: 

* ExecutionPlan: analyzed in [[#ExecutionPlan]]
* ExecutionVariable: Variables that participate either as data transfer containers or as control flow variables.
* ExecutionResult: The current result of the plan execution. We don’t use the term final, since the ExecutionResult can be retrieved at anytime during the process execution. See the description regarding ExecutioConnector. Through this class, one can get the current ExecutionTrace.
* ExecutionTrace: Current trace of execution. It contains the image of the already done/committed actions defined in the execution plan and the ExecutionVariables.
* ExecutionConnector: Provides the actual abstraction of the execution engine. It declares some ‘execution’ methods. The most primitive task is a plain, blocking execute method. However, users may want a more fine-grained control over their process executions. For that reason we have defined three Connector levels:
** Basic Level: it defines a single, blocking execute method, which receives an ExecutionPlan and returns back an ExecutionResult.
** Advanced Procedural Level: Basic Level + process management methods, such as executeAsync, pause, resume, cancel, getStatus, waitCompletion.
** Event-Driven Level: Apart from the purely procedural level, there are many workflow engines which adopt a different paradigm, the event-driven one. According to this, any action that takes place during a process execution, e.g. service invocation finished, or variable initialized, produces a message which can be handled by a corresponding callback method. So the “service invocation finished” event causes the invocation of its associated callback method, within which the developer can perform any management action, housekeeping, logging, etc. The Event-Driven Level offers all the mechanics for registering the set of event which the developer wishes to handle and their associated callback methods. Note that XENA does not offer a callback system. This is the responsibility of the underlying engine. Consequently, only engines that adopt the event model should implement the Event-Driven ExecutionConnector Level.

Since there are three available execution engines (PES, QEPExecution, ActiveBPEL), we the initial desing includes three corresponding connectors:
* A connector for the workflow (process) language employed by PES is a shortened version of BPEL.
* An ActiveBPEL connector which fully conforms to the BPEL OASIS standard. 
* QEPExecution, internally implemented within the Search Framework and it can directly execute ExecutionPlans..

= Search Operators =

== Introduction ==
The Search Operator family of services are the building blocks of any search operation. These along with external to the Search services handle the production, filtering and refinement of available data according to the user queries. The various intermediate steps towards producing the final search output are handled by Search Operator services. In this section we will only describe the Search Service internal Services listed below, although the Search Operator Framework reaches out to "integrate" on a high level other services too that can be utilized within a Search operation context.

The following operators are implemented as stateless services. They receive their input and produce their output in the context of a single invocation without holding any intermediate state. In case any data transferring is necessary either as input to a service or as output from the processing, the [[ResultSet Framework]] is employed.

The search operators cover the basic functionality that could be encountered in a typical search operation. A search can be decomposed in undividable units consisting of the above operators and their interaction can construct a workflow producing the net result delivered to the requester. The external source search and the service invocation services provide some extendibility for future operators by offering a method for invoking an \u201cunknown\u201d to the Search framework service, importing its results to the search operator workflow. The distinguished search operators at present time are listed below.
== Example Code ==
[[media:SearchClients.tar.gz\u200e|Search Operators Usage Examples]]

== Operators ==
=== BooleanOperator ===
==== Description ====
The Boolean Operator is used in conditional execution and more specifically, in evaluating the condition. So, it actually offers the ability of selecting alternative execution plans. For example, one can follow a plan (let\u2019s say a projection on a given field of a set of data), if a given precondition is valid; otherwise, she may follow the alternative plan (e.g. a projection on another field of the same set of data and then sort on the field). The precondition validation is the responsibility of this Service.

The condition is a Boolean expression. Basically, it involves comparisons using the operations: equal, not_equal, greater_than, lower_than, greater_equal, lower_equal. The comparing parts are either literals (date, string, integer, double literals are supported) or aggregate functions on the results of a search service execution. These aggregate functions include max, min, average, size, sum and they can be applied to a given field of the result set of a search service execution, by referring to that field employing xPath expressions.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== FilterByXPathOperator ===
==== Description ====
The role of the FilterByXPath Operator is to perform search through an expression to be evaluated against an XML structure. Such an expression could be an xPath query. The XML structure against which the expression is to be evaluated is a ResultSet, previously constructed by an other operator or complete search execution. The result of the operation is a new ResultSet and the end point reference to this is returned to the caller.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== JoinInnerOperator ===
==== Description ====
The role of the JoinResultSetService is to perform a join operation on a specific field using a set of ResultSets whose end point references are provided. This operation produces a new ResultSet, leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its endpoint reference is returned to the caller. An in memory hash \u2013 join algorithm has been implemented to perform the Joining functionality.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== KeepTopOperator ===
==== Description ====
The role of the KeepTop Operator is to perform a simple filtering operation on its input ResultSet and to produce as output a new ResultSet that holds a defined number of leading records.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== MergeOperator ===
==== Description ====
The role of the Merge Operator is to perform a merge operation using a set of ResultSets whose end point references are provided. This operation produces a new ResultSet leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its endpoint reference is returned to the caller.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== GoogleOperator ===
==== Description ====
The role of the GoogleOperator is to redirect a query to the Google search engine through its Web Service interface and wrapping the output produced by the external service in a ResultSet, whose endpoint reference returns to its caller. The above mentioned functionality is supported by elements residing in the SearchLibrary.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary

=== SortOperator ===
==== Description ====
The role of the Sort Operator is to sort the provided ResultSet using as key a specific field. This operation produces a new ResultSet leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its end point reference is returned to the caller. The algorithm used is merge sort. The comparison rules differ depending on the type of the elements to be sorted.
The key of the sort operator can be expressed in one of the ways defined in the following method org.diligentproject.searchservice.searchlibrary.resultset.elements.ResultElementGeneric#extractValue

==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== TransformByXSLTOperator ===
==== Description ====
The role of the TransformByXSLT Operator is to transform a ResultSet it receives as input from one schema to another through a transformation technology such as XSL / XSLT. These transformations are directly supplied as input to the service. The output of the transformation, which could be a projection of the initial ResultSet, is a new ResultSet wrapped in a WS-Resource whose endpoint reference is returned to the caller.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary

= Execution Engines =

As previously writter, there are 3 execution engines available in D4Science:
* Process Execution Engine: This is the official project's engine. Further details can be found [https://technical.wiki.d4science.research-infrastructures.eu/documentation/index.php/Process_Management here].
* ActiveBPEL: This is a widely-spread, open-source, BPEL-enabled execution engine. It supports various high-level features, such as fault-tolerance, persistency, advanced monitoring, etc. Further details can be found [http://www.activevos.com/community-open-source.php here].
* QEPExecution: The QEPExecution is the internal, simple, generic, ws-compliant, execution engine of the Search Infrastructure. It is basically a web service execution engine, which orchestrates the execution of a set of services. It is designed to work with any web service and therefore communicates via the exchange of SOAP messages. Its input is the raw ExecutionPlan, produced by the QueryPlanner and its output is a string of the final results Endpoint Reference (exactly like the output of the PES). The QepExecution is able to work with any WSRF service (both stateful and stateless). However, since the engine is designed to be used internally as a final solution, it lacks various features of a full fledged execution engine, such as advanced error handling.

Difference between revisions of "Search Framework"

Revision as of 18:49, 1 September 2008

Contents

Search Intro

Query Processing Chain

Search from the User Perspective: Querying

Available Operations

Syntax

Navigation menu

Views

Personal tools

gCube Wiki

gCube features

gCube documentation

Integration and Distribution

Search

Tools