Search Framework

From Gcube Wiki
Revision as of 14:44, 28 August 2008 by Pavlos.polydoras (Talk | contribs)

Jump to: navigation, search

Search Intro

Query Processing Chain

Search from the User Perspective: Querying

Available Operations

Operation Semantics
project Perform projection over a result set on a set of elements. By default, all header data (DocID, RankID, CollID) are kept in the projected result set; that is, they need not to be specified. If the projected term set is empty, then only the header data are kept.
sort Perform a sort operation on a result set based on an element of that result set, either ascending (ASC) or descending (DESC).
merge Concatenate two or more result sets. The records are non-deterministically concatenated in the final result set.
join Join two result sets on a join key, using one of the available modes (currently only 'inner' is implemented). The semantics of the inner join is similar to the respective operation found in the relational algebra, with the difference that in our case, only the join key from the left RS is kept and joined with the right payload.
fielded search Keep only those records of a source, which conform to a selection expression. This expression is a relation between a key and a value. The key is an element of the result set and the relation is one of the: ==, !=, >, <, <=, >=, contains. The 'contains' relation refers to whether a string is a substring of another string. Using this comparison function, one can use the wildcard '*', which means any. We discriminate these cases:
  • '*ad*'. It can match any of these: 'ad', 'add', 'mad', 'ladder'
  • '*der'. It can match any of these: 'der', 'ladder', but not 'derm' ot 'ladders'.
  • 'ad*'. It can match any of these: 'ad', 'additional', but not 'mad' or 'ladder'.
  • 'ad'. It can only match 'ad'.

If we search on a text field, then contains refers to any of its consisting words. For example, if we search on the field title which is the rain in spain stays mainly in the plane, then the matching criteria '*ain*' refers to any of the 'rain', 'spain', 'mainly'. If the predicate is '==' then we search for exact match; that is, in the previous example, the title == 'stays', won't succeed. Predicates can be combined with ORs, ANDs (currently under development). The source of this operation can be either a result set generated by another search operation or a content source. In the last case, you should use a source string identifier.

full text search Perform a full text search on a given source based on a set of terms. The full text search source must be a string identifier of the content source.

Each full text search term may contain a single or multiple words. In both cases, all terms are combined with a logical AND. In the second case, is a term is e.g. 'hello nasty', we search for the words 'hello' and 'nasty', with the latter following the former, as stated in the term; Text that does not contain such exact succession of the two words, it won't match the search criteria. Another feature of fulltextsearch is the lemmatization. In a few words, the terms are processed and a set of relative words is generated and also used in the full text search.

filter by xpath, xslt, math Perform a low level xpath or a xslt operation on result set. The math type refers to a mathematical language and is used by advanced users who are acquainted with that language. For more details about the semantics and syntax of that language, please see the documentation for the ResultSetScanner service, which implements this language.
keep top Keep only a given number of records of a result set.
retrieve metadata Retrieve ALL metadata associated to the search results of a previous search operation.
read Read a result set endpoint reference, in order to process it. This operation can be used for further processing the results of a previous search operation.
external search (deprecated) Perform a search on external (diligent-disabled) source. Currently, google, RDBMS and the OSIRIS infrastructures can be queried. Depending on the source, the query string can vary. As far as google is concerned, the query string must conform to the query language of google. In the case of RDBMS, the query must have the following form, in order to be executed successfully:
<root>
<driverName>your jdbc driver</driverName>
<connectionString>your jdbc connection string</connectionString>
<query>your sql queryt</query>
</root>
Finally, in the OSIRIS case, the query string must have the following format:
<root>
<collection>your osiris collection</collection>
<imageURL>your image URL to be searched for similar images</imageURL>
<numberOfResults>the number of results</numberOfResults>
</root>
similarity search Perform a similarity search on a source for a multimedia content (currently, only images). The image URL is defined, along with the source string identifier and pairs of feature and weight.
spatial search Perform a classic spatial search against a used defined shape (polygon, to be exact) and a spatial relation (contains, crosses, disjoint, equals, inside, intersect, overlaps, touches.
conditional search Classic If-Then-Else construct. The hypothesis clause involves the (potentially aggragated) value of one or more fields which are part of the result of previous search operation(s). The central predicate involves a comparison of two clauses, which are combinations (with the basic math functions +, -, *, /) of these values

Syntax

  <function> ::= <project_fun> | <sort_fun> | <filter_fun> | <merge_fun> | <join_fun> | <keeptop_fun> | <fulltexts_fun> | 
     <fieldedsearch_fun> | <extsearch_fun> | <read_fun> | <similsearch_fun> | <spatialsearch_fun> | <retrieve_metadata_fun> 

  <read_fun> ::= <read_fun_name> <epr>
  <read_fun_name> ::= 'read'
  <epr> ::= string

  <project_fun> ::= <project_fun_name> <by> <project_key> <project_source>
  <project_fun_name> ::= 'project'
  <project_key> ::= string
  <project_source> ::= <non_leaf_source>

  <sort_fun> ::= <sort_fun_name> <sort_order> <by> <sort_key> <sort_source>
  <sort_fun_name> ::= 'sort'
  <sort_key> ::= string
  <sort_order> ::= 'ASC' | 'DESC'
  <sort_source> ::= <non_leaf_source>

  <filter_fun> ::= <filter_fun_name> <filter_type> <by> <filter_statement> <filter_source>
  <filter_fun_name> ::= 'filter'
  <filter_type> ::= string
  <filter_statement> ::= string
  <filter_source> ::= <non_leaf_source> | <leaf_source>

  <merge_fun> ::= <merge_fun_name> <on> <merge_sources>
  <merge_fun_name> ::= 'merge'
  <merge_sources> ::= <merge_source> <and> <merge_source> <merge_sources2>
  <merge_sources2> ::= <and> <merge_source> <merge_sources2> | φ
  <merge_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <join_fun> ::= <join_fun_name> <join_type> <by> <join_key> <on> <join_source> <and> <join_source>
  <join_fun_name> ::= 'join'
  <join_key> ::= string
  <join_type> ::= 'inner' | 'fullOuter' | 'leftOuter' | 'rightOuter'
  <join_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <keeptop_fun> ::= <keeptop_fun_name> <keeptop_number> <keeptop_source>
  <keeptop_fun_name> ::= 'keeptop'
  <keeptop_number> ::= integer
  <keeptop_source> ::= <non_leaf_source>

  <fulltexts_fun> ::= <fulltexts_fun_name> <by> <fulltexts_term> <fulltexts_terms> <in> <language> <on> <fulltexts_sources>
  <fulltexts_fun_name> ::= 'fulltextsearch'
  <fulltexts_terms> ::= <comma> <fulltexts_term> <fulltexts_terms> | φ
  <fulltexts_sources> ::= <fulltexts_source> <fulltexts_sources_2>
  <fulltexts_sources_2> ::= <comma> <fulltexts_source> <fulltexts_source> | φ
  <fulltexts_source> ::= string

  <fieldedsearch_fun> ::= <fieldedsearch_fun_name> <by> <query> <fieldedsearch_source>
  <fieldedsearch_fun_name> ::= 'fieldedsearch'
  <query> ::= string
  <fieldedsearch_source> ::= <non_leaf_source> | <leaf_source>

  <extsearch_fun> ::= <extsearch_fun_name> <by> <extsearch_query> <on> <extsearch_source>
  <extsearch_fun_name> ::= 'externalsearch'
  <extsearch_query> ::= string
  <extsearch_source> ::= string

  <similsearch_fun> ::= <similaritysearch_fun_name> <as> <URL> <by> <pair> <pairs> <similarity_source>
  <similsearch_fun_name> ::= 'similaritysearch'
  <URL> ::= string
  <pair> ::= <feature> <equal> <weight>
  <pairs> ::= <and> <pair> <pairs> | φ
  <similarity_source> ::= <leaf_source>

  <if-syntax> ::= <if> <left_parenthesis> <function-st> <compare-sign> <function-st> <right_parenthesis> <then> <search-op> <else> <search-op>
  <compare-sign> ::= '==' | '>' | '<' | '>=' | '<='
  <function-st> ::= <left-op> <math-op> <right-op> | <left-op>
  <math-op> ::= '+' | '-' | '*' | '/'
  <left-op> ::= <function> <left_parenthesis> <left-op> <right_parenthesis> | <literal>
  <function> ::= <max-fun> | <min-fun> | <sum-fun> | <av-fun> | <va r-fun> | <size-fun>
  <max-fun> ::= 'max' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <min-fun> ::= 'min' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <sun-fun> ::= 'sum' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <av-fun> ::= 'av' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <va r-fun> ::= 'var' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <size-fun> ::= size' <left_parenthesis> <search-op> <right_parenthesis>
  <right-op> ::= <function-st> | <left-op>
  <element> ::= an element of the result set payload (either XML element, or XML attribute)

  <retrieve_metadata_fun> ::= <rm_fun_name> <in> <language> <on> <rm_source> <as> <schema>
  <rm_fun_name> ::= 'retrievemetadata'
  <schema> ::= string
  <rm_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <spatialsearch_fun> ::= <spatialsearch_fun_name> <relation> <geometry> [<timeBoundary>] <spatial_source>
  <spatialsearch_fun_name> ::= 'spatialsearch'
  <relation> ::= {'intersects', 'contains', 'isContained'}
  <geometry> ::= <polygon_name> <left_parenthesis> <points> <right_parenthesis>
  <polygon_name> ::= 'polygon'
  <timeBoundary> ::= 'within' <startTime> <stopTime>
  <startTime> ::= double
  <stopTime> ::= double
  <spatial_source> ::= <leaf_source>
  <points> ::= <point> {<comma> <point>}+
  <point> ::= <x> <y>
  <x> ::= long
  <y> ::= long

<leaf_source>  ::= [<in> <language>] <on>

Invalid language.

You need to specify a language like this: <source lang="html4strict">...</source>

Supported languages for syntax highlighting:

4cs, 6502acme, 6502kickass, 6502tasm, 68000devpac, abap, actionscript, actionscript3, ada, aimms, algol68, apache, applescript, arm, asm, asp, asymptote, autoconf, autohotkey, autoit, avisynth, awk, bascomavr, bash, basic4gl, bf, bibtex, blitzbasic, bnf, boo, c, caddcl, cadlisp, cfdg, cfm, chaiscript, chapel, cil, clojure, cmake, cobol, coffeescript, cpp, csharp, css, cuesheet, d, dart, dcl, dcpu16, dcs, delphi, diff, div, dos, dot, e, ecmascript, eiffel, email, epc, erlang, euphoria, ezt, f1, falcon, fo, fortran, freebasic, freeswitch, fsharp, gambas, gdb, genero, genie, gettext, glsl, gml, gnuplot, go, groovy, gwbasic, haskell, haxe, hicest, hq9plus, html4strict, html5, icon, idl, ini, inno, intercal, io, ispfpanel, j, java, java5, javascript, jcl, jquery, kixtart, klonec, klonecpp, latex, lb, ldif, lisp, llvm, locobasic, logtalk, lolcode, lotusformulas, lotusscript, lscript, lsl2, lua, m68k, magiksf, make, mapbasic, matlab, mirc, mmix, modula2, modula3, mpasm, mxml, mysql, nagios, netrexx, newlisp, nginx, nimrod, nsis, oberon2, objc, objeck, ocaml, octave, oobas, oorexx, oracle11, oracle8, oxygene, oz, parasail, parigp, pascal, pcre, per, perl, perl6, pf, php, pic16, pike, pixelbender, pli, plsql, postgresql, postscript, povray, powerbuilder, powershell, proftpd, progress, prolog, properties, providex, purebasic, pycon, pys60, python, q, qbasic, qml, racket, rails, rbs, rebol, reg, rexx, robots, rpmspec, rsplus, ruby, rust, sas, scala, scheme, scilab, scl, sdlbasic, smalltalk, smarty, spark, sparql, sql, standardml, stonescript, systemverilog, tcl, teraterm, text, thinbasic, tsql, typoscript, unicon, upc, urbi, uscript, vala, vb, vbnet, vbscript, vedit, verilog, vhdl, vim, visualfoxpro, visualprolog, whitespace, whois, winbatch, xbasic, xml, xpp, yaml, z80, zxbasic


 [<as> <schema>]
   <non_leaf_source>  ::= <on> <left_parenthesis> <function> <right_parenthesis>

----

   <language>  ::= 'AFRIKAANS' | 'ARABIC' | 'AZERI' | 'BYELORUSSIAN' | 'BULGARIAN' | 'BANGLA' | 'BRETON' | 'BOSNIAN' | 'CATALAN' | 
      'CZECH' | 'WELSH' |    'DANISH' | 'GERMAN' | 'GREEK' | 'ENGLISH' | 'ESPERANTO' | 'SPANISH' | 'ESTONIAN' | 'BASQUE' | 'FARSI' |
      'FINNISH' | 'FAEROESE' | 'FRENCH' | 'FRISIAN' | 'IRISH_GAELIC' | 'GALICIAN' | 'HAUSA' | 'HEBREW' | 'HINDI' | 'CROATIAN' | 
      'HUNGARIAN' | 'ARMENIAN' | 'INDONESIAN' | 'ICELANDIC' | 'ITALIAN' | 'JAPANESE' | 'GEORGIAN' | 'KAZAKH' | 'GREENLANDIC' | 'KOREAN' |
      'KURDISH' | 'KIRGHIZ' | 'LATIN' | 'LETZEBURGESCH' | 'LITHUANIAN' | 'LATVIAN' | 'MAORI' | 'MONGOLIAN' | 'MALAY' | 'MALTESE' |
      'NORWEGIAN_BOKMAAL' | 'DUTCH' | 'NORWEGIAN_NYNORSK' | 'POLISH' | 'PASHTO' | 'PORTUGUESE' | 'RHAETO_ROMANCE' | 'ROMANIAN' | 'RUSSIAN' | 
      'SAMI_NORTHERN' | 'SLOVAK' | 'SLOVENIAN' | 'ALBANIAN' | 'SERBIAN' | 'SWEDISH' | 'SWAHILI' | 'TAMIL' | 'THAI' | 'FILIPINO' | 'TURKISH' |
      'UKRAINIAN' | 'URDU' | 'UZBEK' | 'VIETNAMESE' | 'SORBIAN' | 'YIDDISH' | 'CHINESE_SIMPLIFIED' | 'CHINESE_TRADITIONAL' | 'ZULU'
   <source> ::= string
   <schema>  ::= string
   <left_parenthesis> ::= '('
   <right_parenthesis> ::= ')'
   <comma> ::= ','
   <and> ::= 'and'
   <on> ::= 'on'
   <as> ::= 'as'
   <by> ::= 'by'
   <sort_by> ::= 'sort'
   <from> ::= 'from'
   <if> ::= 'if'
   <then> ::= 'then'
   <else> ::= 'else'

== Examples ==

{| border="1"
|+ '''Example 1'''
|-
! User Request
| Give me back all documents whose metadata contain the word ''woman'' from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''>
|-
! Actual Query
| <pre>fulltextsearch by 'woman' in 'ENGLISH' on '0a952bf0-fa44-11db-aab8-f715cb72c9ff' as 'dc'</pre>
|-
! Explanation
| We perform the ''fulltextsearch'' operation, using the ''woman'' term in the data source identified by the laguage ''ENGLISH'', source number ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'' and schema ''dc''
|}

----

{| border="1"
|+ '''Example 2'''
|-
! User Request
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea; that is, the creator's name contains the separate word dorothea, e.g. ''Hemans, Felicia Dorothea Browne''
|-
! Actual Query
| <pre>fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc'</pre>
|-
! Explanation
| We perform the ''fieldedsearch'' operation in the data source identified by the laguage ''ENGLISH'', source number ''568a5220-fa43-11db-82de-905c553f17c3'' and schema ''dc'' and retrieve only those that their creator's name contain the word 'dorothea'. CAUTION: This does not cover creator names such as 'abcdorothea'. In this case, users should use the wildcard '*'. The absence of '*' implies string delimiter. E.g. '*dorothea' matches 'abcdorothea' but not 'dorotheas', 'dorothea*' matches 'dorotheas' but not 'abcdorothea'. Another critical issue is the data source identifier. Example 1 and Example 2 refer to the 'A Celebration of Women Writers' collection. However, in Example 1 we refer to the metadata of this collection, whereas in Example 2 we refer to the actual content.
|}

----

{| border="1"
|+ '''Example 3'''
|-
! User Request
| Give me back the ''creator'' and ''subject'' the first 10 documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> sorted by the ''DocID'' field, whose creator's name contains ''ro''.
|-
! Actual Query
| <pre>project by 'creator', 'subject' on 
   (keeptop '10' on 
      (sort 'ASC' by 'DocID' on 
         (fieldedsearch by 'creator' contains '*ro*' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')))</pre>
|-
! Explanation
| First of all, we perform the fieldedsearch operation in the data source identified by the laguage ''ENGLISH'', source number ''568a5220-fa43-11db-82de-905c553f17c3'' and schema ''dc'' and retrieve only those that their creator's name contain the word 'ro'. On this result set, we apply the sort operation on the ''DocID'' field. Then, we apply the keep top operation, in order to keep only the first 10 sorted documents. Finally, we apply the project operation, keeping only the ''creator'' and ''subject'' fields.
|}

----

{| border="1"
|+ '''Example 4'''
|-
! User Request
| Perform spatial search against the collection identified by the triplet <''ENGLISH'',''6cbe79b0-fbe0-11db-857b-d6a400c8bdbb'',''eiDB''> defining a search rectagular (0,0), (0,50), (50,50), (50,0).
|-
! Actual Query
| <pre>spatialsearch contains polygon(0 0, 0 50, 50 50, 50 0) 
   in 'ENGLISH' on '6cbe79b0-fbe0-11db-857b-d6a400c8bdbb' as 'eiDB'</pre>
|-
! Explanation
| Search in collection identified by the triplet <''ENGLISH'',''6cbe79b0-fbe0-11db-857b-d6a400c8bdbb'',''eiDB''> for records that define geometric shapes which include the rectagular identified by the points {(0,0), (0,50), (50,50), (50,0)}.
|}

----

{| border="1"
|+ '''Example 5'''
|-
! User Request
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea and all documents from the same collection whose title contain the word ''woman''
|-
! Actual Query
| <pre>merge on 
    (fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc') 
and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')</pre>
|-
! Explanation
| This is an example of how the user can merge the results of more than one subqueries.
|}

----

{| border="1"
|+ '''Example 6'''
|-
! User Request
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea AND whose description contain the word ''London''
|-
! Actual Query
| <pre>join inner by 'DocID' on 
    (fieldedsearch by 'description' contains 'London' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc') 
and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')</pre>
|-
! Explanation
| This is an example of how the user can perform a join (logical AND) of subqueries. We perform the ''join'' operation in the field ''DocID'', which is the document unique identifier. In this way, documents that are members of both the result sets of the subqueries, participate in the final result set.
|}

----


= Search Orchestrator (Search Master Service)=

The Search Master Service (SMS) is the main entry point to the functionality of the search engine. It contains the elements that will organize the execution of the search operators for the various tasks the Search engine is responsible for.

There are two steps in achieving the necessary Query processing before it can be forwarded to the Search engine for the actual execution and evaluation of the net results. The first step is the transformation of the abstract Query terms in a series of WS invocations. The output of this step is an enriched execution plan mapping the abstract Query to a workflow of service invocations. These invocations are calls to Search Service operators providing basic functionality called Search Operators. The second step is the optimization of the calculated execution plan.

The SMS is responsible for the first stage of query processing. This stage produces a query execution plan, which in the gCube implementation is a directed acyclic graph of Search Operator invocations. This element is responsible for gathering the whole set of information that is expected to be needed by the various search services and provides it as context to the processed query. In this manner, delays for gathering info at the various services are significantly reduced and assist responsiveness.

The information gathered is produced by various components or services of the gCube Infrastructure. They include the gCube Information Service (IS), Content and Metadata Management, Indexing service etc. The process of gathering all needed information proves to be very time consuming. To this end, the SMS keeps a cache of previously discovered information and state. 

The SMS validates the received Query using SearchLibrary elements. It validates the user supplied query against the elements of the specific Virtual Organisation (VO). This ensures that content collections are available, metadata elements (e.g. fields) are present, operators (i.e. services) are accessible etc. Afterwards it performs a number of preprocessing steps invoking functionality offered by services such as the Query Personalisation and the DIR (former Content Source Selection) service, in order to refine the context of the search or inject extra information at the query. These are specializations of the general Query Preprocessor Element. An order of Query Preprocessor calls is necessary in the case where they might inject conflicting information. Otherwise, a method for weighting the source of the conflicting information importance is necessary. Furthermore, a number of exceptions may occur during the operation of a preprocessor, as during the normal flow of any operation. The difference is that, although useful in the context of gCube, preprocessors are not necessary for a Search execution. So errors during Query Preprocessing must not stop the search flow of execution.

The above statement is a sub case of the more general need of a framework for defining fatal errors and warnings. During the entire Search operation a number of errors and/or warnings may emerge. Depending on the context in which they appear, they may have great or no significance to the overall progress of execution. Currently, these cases are handled separately but a uniform management may come into play as the specifications of each service’s needs in the grand plan of the execution become more apparent at a low enough level of detail.

After the above pre-processing steps are completed successfully, the SMS dispatches a QueryPlanner thread to create the Query Execution Plan. Its job is firstly to map the provided Query that has been enriched by the preprocessors to a concrete workflow of WS invocations. Subsequently, the Query Planner uses the information encapsulated with the provided Query, the information gathered by the SMS for the available gCube environment and a set of transformation rules to come up with a near optimal plan. When certain conditions are met (e.g. the Query Planner has finished, time has elapsed, all plans have been evaluated), the planer returns to its caller the best plan calculated. If more than one Query Planners are utilized, the plans calculated by each Query Planner are gathered by the SMS. He then chooses the overall optimal plan and passes it to a suitable execution engine, where execution and scheduling is being achieved in a generic manner. The actual integration with the available execution engines and the formalization of their interaction with the SMS is accomplished through the introduction of the eXecution Engine Api (XENA), which is thoroughly analyzed in the Search Library section. In this formal methodology, the SMS is able to selected among the various available engines, such as the Process Execution Service, the Internal Execution Engine or any other WS-workflow engine. These engines are free to enforce their own optimization strategies, as long as they respect to the semantic invariants dictated by the original Execution Plan.

Finally, the SMS receives the final ResultSet from the execution engine and pass its end point reference back to the requestor.

== DL Description ==
Through the Search Master, external services can receive a structured overview of the VO resources available and usable during a search operation. An example of this summarization is shown bellow:

   <SearchConfig>
     <collections>
       <collection name="Example Collection Name 1" id="1fc1fbf0-fa3c-11db-82de-905c553f17c3">
         <TYPE>DATA</TYPE>
         <ASSOCIATEDWITH>d510a060-fa3c-11db-aa91-f715cb72c9ff</ASSOCIATEDWITH>
         <ASSOCIATEDWITH>g45612f7-dth5-23fg-45df-45dfg5b1r34s</ASSOCIATEDWITH>
       </collection>
       <collection name="Example Collection Name 2" id="c3f685b0-fdb6-11db-a573-e4518f2111ab">
         <TYPE>DATA</TYPE>
         <ASSOCIATEDWITH>7bb87410-fdb7-11db-8476-f715cb72c9ff</ASSOCIATEDWITH>
         <INDEX>FEATURE</INDEX>
       </collection>
       <collection name="Example Collection Name 3" id="d510a060-fa3c-11db-aa91-f715cb72c9ff">
         <TYPE>METADATA</TYPE>
         <LANGUAGE>en</LANGUAGE>
         <SCHEMA>dc</SCHEMA>
         <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH>
         <INDEX>FTS</INDEX>
         <INDEX>XML</INDEX>
       </collection>
       <collection name="Example Collection Name 4" id="g45612f7-dth5-23fg-45df-45dfg5b1r34s">
         <TYPE>METADATA</TYPE>
         <LANGUAGE>en</LANGUAGE>
         <SCHEMA>tei</SCHEMA>
         <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH>
         <INDEX>FTS</INDEX>
         <INDEX>XML</INDEX>
       </collection>
       <collection name="Example Collection Name 5" id="7bb87410-fdb7-11db-8476-f715cb72c9ff">
         <TYPE>METADATA</TYPE>
         <LANGUAGE>en</LANGUAGE>
         <SCHEMA>dc</SCHEMA>
         <ASSOCIATEDWITH>c3f685b0-fdb6-11db-a573-e4518f2111ab</ASSOCIATEDWITH>
         <INDEX>FTS</INDEX>
         <INDEX>XML</INDEX>
       </collection>
     </collections>
   </SearchConfig>

= Query Processing (Search Library) =
== Query Planner ==
== Execution (Engine) Integration ==

The term Execution (Engine) Integration refers to a logical group of components which materialize the bridge between query planning and plan execution. As previously written, the QueryPlanner produces an ExecutionPlan which should be forwarded to the execution engine(s). The procedure of feeding the engine(s) with the ExecutionPlan is performed by the components described in this paragraph.

Search-Execution integration is by far not a trivial task because of the obvious fact that components from each side were developed by different groups, resulting in heterogeneity regarding their interfaces. The purpose of decoupling components is exactly the ability to employ them from various institutions. All in all, the problem comes up to the fact that we have an ExecutionPlan, a set of available execution engines, which potentially have completely different interfaces, and we must find a way to feed the engine with a corresponding execution plan, expressed in its native language, but obviously based on the original ExecutionPlan. The solution to this integration problem is the introduction of the eXecution Engine Api (XENA), which is described in the following paragraphs. Before that, we present the available execution engines, which are or will be employed in D4Science.
Currently, there are 2 execution engines available in gCube, ProcessExecutionService and the internal QEPExecution engine. The first one is an autonomous service developed separately from the Search Framework. It is thoroughly described in section 8.3.3.4. The second one is a simple, basic, SOAP engine without any advanced features such as persistence or failover recovery. For further details regarding the QEPExecution can be found below.

Besides these engines, we would like to have the opportunity to employ a widely used, popular, workflow (or better process) engine. We have selected ActiveBPEL, which is an open-source, BPEL engine, very popular in the web service world. For further details regarding ActiveBPEL, please refer to [[http://www.activevos.com/community-open-source.php]].

=== eXecution ENgine API (XENA) ===
As the name reveals XENA provides the abstractions needed by any engine to execute plans generated by the Search Framework, namely instances of the ExecutionPlan class. XENA can be thought of as a middleware between the Search Framework and the execution engine, with the responsibility of ‘translating’ the ExecutionPlan instances to the engine’s internal plan representation.

However, the integration issue cannot be treated so simplistically. First of all, one cannot assume that every engine should respect to the XENA paradigm and therefore offer an inherent support/integration to the Search Framework. Another problem is that the XENA design itself must be very flexible and expressive in order to encompass all of the significant syntactic and semantic attributes of a process execution. Therefore, it is imperative to clarify two major factors: the real-world architectural solution to the search-execution integration and the data model that XENA adopts.

The idea behind XENA is the following: Between The XENA API and the execution engine, there is a proxy component called Connector, which is responsible for translating the XENA API model to the engine’s internal model. The idea is not new. It has been applied in many middleware solutions and mostly known from the JDBC API. The ala-JDBC paradigm dictates the introduction of connectors that bridge JDBC and the underlying RDBMS. For every different RDBMS product there is a different software connector. So, in our context, we employ different XENA connectors for different workflow engines. Each connector, apart from the XENA artifacts, is fed with a set of execution engine endpoint references, in order for the connector to know which engine instance(s) it may refer to. Each XENA connector is first registered in a formal way to a special class called ExecutionEngineConnectorFactoryBuilder, publishing itself and the set of supported features. These features are key-value(s) pairs and describe the abilities of the execution engine. They may include persistence, automatic recovery, performance metrics, quality of service, etc. The registration procedure accommodates the dynamic binding to a specific connector (and thus execution engine) made by the SearchMasterService, based on some user preferences or predefined system parameteres. So, for example, if a user desires maximum availability in his/her processes, then the SearchMasterService can select an execution engine / respective connector that supports persistence and failsafe capabilities. The dynamic binding of XENA API to a specific connector is accomplished by employing some of the Java Classloader capabilities.

Regarding the data model that is adopted by XENA, it has borrowed the design philosophy of another middleware solution in the area of web service registries, JAXR, a Java Specification (JSR) for that field. The main abstract classes are: 

  * ExecutionVariable: Variables that participate either as data transfer containers or as control flow variables.
  * ExecutionResult: The current result of the plan execution. We don’t use the term final, since the ExecutionResult can be retrieved at anytime during the process execution. See the description regarding ExecutioConnector. Through this class, one can get the current ExecutionTrace.
  * ExecutionTrace: Current trace of execution. It contains the image of the already done/committed actions defined in the execution plan and the ExecutionVariables.
  * ExecutionConnector: Provides the actual abstraction of the execution engine. It declares some ‘execution’ methods. The most primitive task is a plain, blocking execute method. However, users may want a more fine-grained control over their process executions. For that reason we have defined three Connector levels:
  * Basic Level: it defines a single, blocking execute method, which receives an ExecutionPlan and returns back an ExecutionResult.
  * Advanced Procedural Level: Basic Level + process management methods, such as executeAsync, pause, resume, cancel, getStatus, waitCompletion.
  * Event-Driven Level: Apart from the purely procedural level, there are many workflow engines which adopt a different paradigm, the event-driven one. According to this, any action that takes place during a process execution, e.g. service invocation finished, or variable initialized, produces a message which can be handled by a corresponding callback method. So the “service invocation finished” event causes the invocation of its associated callback method, within which the developer can perform any management action, housekeeping, logging, etc. The Event-Driven Level offers all the mechanics for registering the set of event which the developer wishes to handle and their associated callback methods. Note that XENA does not offer a callback system. This is the responsibility of the underlying engine. Consequently, only engines that adopt the event model should implement the Event-Driven ExecutionConnector Level.

Since there are three available execution engines (PES, QEPExecution, ActiveBPEL), we the initial desing includes three corresponding connectors:
  * A connector for the workflow (process) language employed by PES is a shortened version of BPEL.
  * An ActiveBPEL connector which fully conforms to the BPEL OASIS standard. 
  * QEPExecution, internally implemented within the Search Framework and it can directly execute ExecutionPlans..

== Search Operators Core Library ==
The classes that belong to the [[#Search_Operators | Search Operators]] set are the core classes of the data retrieval and processing procedure. They implement the necessary processing algorithms and are wrapped by the respective Search Services, in order to expose their functionality to the gCube infrastructure. However, the SearchOperator classes implement only a subset of the whole gCube functionality; there are some services that do not rely on these classes and incorporate the full functionality, without any dependencies to any library (apart from the ResultSet and ws-core, of course). These services are the IndexLookupService, GeoIndexLookupService, ForwardIndex, FeatureIndex and Fuse; they will be analyzed in separate, following paragraphs.
= Search Operators =

== Introduction ==
The Search Operator family of services are the building blocks of any search operation. These along with external to the Search services handle the production, filtering and refinement of available data according to the user queries. The various intermediate steps towards producing the final search output are handled by Search Operator services. In this section we will only describe the Search Service internal Services listed below, although the Search Operator Framework reaches out to "integrate" on a high level other services too that can be utilized within a Search operation context.

The following operators are implemented as stateless services. They receive their input and produce their output in the context of a single invocation without holding any intermediate state. In case any data transferring is necessary either as input to a service or as output from the processing, the [[ResultSet Framework]] is employed.

The search operators cover the basic functionality that could be encountered in a typical search operation. A search can be decomposed in undividable units consisting of the above operators and their interaction can construct a workflow producing the net result delivered to the requester. The external source search and the service invocation services provide some extendibility for future operators by offering a method for invoking an \u201cunknown\u201d to the Search framework service, importing its results to the search operator workflow. The distinguished search operators at present time are listed below.
== Example Code ==
[[media:SearchClients.tar.gz\u200e|Search Operators Usage Examples]]

== Operators ==
=== BooleanOperator ===
==== Description ====
The Boolean Operator is used in conditional execution and more specifically, in evaluating the condition. So, it actually offers the ability of selecting alternative execution plans. For example, one can follow a plan (let\u2019s say a projection on a given field of a set of data), if a given precondition is valid; otherwise, she may follow the alternative plan (e.g. a projection on another field of the same set of data and then sort on the field). The precondition validation is the responsibility of this Service.

The condition is a Boolean expression. Basically, it involves comparisons using the operations: equal, not_equal, greater_than, lower_than, greater_equal, lower_equal. The comparing parts are either literals (date, string, integer, double literals are supported) or aggregate functions on the results of a search service execution. These aggregate functions include max, min, average, size, sum and they can be applied to a given field of the result set of a search service execution, by referring to that field employing xPath expressions.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== FilterByXPathOperator ===
==== Description ====
The role of the FilterByXPath Operator is to perform search through an expression to be evaluated against an XML structure. Such an expression could be an xPath query. The XML structure against which the expression is to be evaluated is a ResultSet, previously constructed by an other operator or complete search execution. The result of the operation is a new ResultSet and the end point reference to this is returned to the caller.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== JoinInnerOperator ===
==== Description ====
The role of the JoinResultSetService is to perform a join operation on a specific field using a set of ResultSets whose end point references are provided. This operation produces a new ResultSet, leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its endpoint reference is returned to the caller. An in memory hash \u2013 join algorithm has been implemented to perform the Joining functionality.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== KeepTopOperator ===
==== Description ====
The role of the KeepTop Operator is to perform a simple filtering operation on its input ResultSet and to produce as output a new ResultSet that holds a defined number of leading records.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== MergeOperator ===
==== Description ====
The role of the Merge Operator is to perform a merge operation using a set of ResultSets whose end point references are provided. This operation produces a new ResultSet leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its endpoint reference is returned to the caller.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== GoogleOperator ===
==== Description ====
The role of the GoogleOperator is to redirect a query to the Google search engine through its Web Service interface and wrapping the output produced by the external service in a ResultSet, whose endpoint reference returns to its caller. The above mentioned functionality is supported by elements residing in the SearchLibrary.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary

=== SortOperator ===
==== Description ====
The role of the Sort Operator is to sort the provided ResultSet using as key a specific field. This operation produces a new ResultSet leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its end point reference is returned to the caller. The algorithm used is merge sort. The comparison rules differ depending on the type of the elements to be sorted.
The key of the sort operator can be expressed in one of the ways defined in the following method org.diligentproject.searchservice.searchlibrary.resultset.elements.ResultElementGeneric#extractValue

==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary
=== TransformByXSLTOperator ===
==== Description ====
The role of the TransformByXSLT Operator is to transform a ResultSet it receives as input from one schema to another through a transformation technology such as XSL / XSLT. These transformations are directly supplied as input to the service. The output of the transformation, which could be a projection of the initial ResultSet, is a new ResultSet wrapped in a WS-Resource whose endpoint reference is returned to the caller.
==== Dependencies ====
*jdk 1.5
*gCore
*ResultSetClientLibrary
*SearchLibrary

= Execution Engines =