Difference between revisions of "Search Framework"

From Gcube Wiki
Jump to: navigation, search
Line 1: Line 1:
Gamw
+
= Search Intro =
 +
= Query Processing Chain =
 +
= Search from the User Perspective: Querying =
 +
 
 +
== Available Operations ==
 +
 
 +
{| border="1"
 +
! Operation !! Semantics
 +
|-
 +
! project
 +
| Perform projection over a result set on a set of elements. By default, all header data (DocID, RankID, CollID) are kept in the projected result set; that is, they need not to be specified. If the projected term set is empty, then only the header data are kept.
 +
|-
 +
! sort
 +
| Perform a sort operation on a result set based on an element of that result set, either ascending (ASC) or descending (DESC).
 +
|-
 +
! merge
 +
| Concatenate two or more result sets. The records are non-deterministically concatenated in the final result set.
 +
|-
 +
! join
 +
| Join two result sets on a join key, using one of the available modes (currently only 'inner' is implemented). The semantics of the inner join is similar to the respective operation found in the relational algebra, with the difference that in our case, only the join key from the left RS is kept and joined with the right payload.
 +
|-
 +
! fielded search
 +
| Keep only those records of a source, which conform to a selection expression. This expression is a relation between a key and a value. The key is an element of the result set and the relation is one of the: ==, !=, >, <, <=, >=, contains. The 'contains' relation refers to whether a string is a substring of another string. Using this comparison function, one can use the wildcard '*', which means any. We discriminate these cases:
 +
* '*ad*'. It can match any of these: 'ad', 'add', 'mad', 'ladder'
 +
* '*der'. It can match any of these: 'der', 'ladder', but not 'derm' ot 'ladders'.
 +
* 'ad*'. It can match any of these: 'ad', 'additional', but not 'mad' or 'ladder'.
 +
* 'ad'. It can only match 'ad'.
 +
If we search on a text field, then ''contains'' refers to any of its consisting words. For example, if we search on the field '''title''' which is '''the rain in spain stays mainly in the plane''', then the matching criteria '*ain*' refers to any of the 'rain', 'spain', 'mainly'. If the predicate is '==' then we search for exact match; that is, in the previous example, the title == 'stays', won't succeed.
 +
Predicates can be combined with ORs, ANDs (currently under development).
 +
The source of this operation can be either a result set generated by another search operation or a content source. In the last case, you should use a source string identifier.
 +
|-
 +
! full text search
 +
| Perform a full text search on a given source based on a set of terms. The full text search source must be a string identifier of the content source.
 +
Each full text search term may contain a single or multiple words. In both cases, all terms are combined with a logical '''AND'''. In the second case, is a term is e.g. 'hello nasty', we search for the words 'hello' and 'nasty', with the latter following the former, as stated in the term; Text that does not contain such exact succession of the two words, it won't match the search criteria. Another feature of fulltextsearch is the [http://en.wikipedia.org/wiki/Stemming#Lemmatisation_Algorithms lemmatization]. In a few words, the terms are processed and a set of ''relative'' words is generated and also used in the full text search.
 +
|-
 +
! filter by xpath, xslt, math
 +
| Perform a low level xpath or a xslt operation on result set. The math type refers to a mathematical language and is used by advanced users who are acquainted with that language. For more details about the semantics and syntax of that language, please see the documentation for the ResultSetScanner service, which implements this language.
 +
|-
 +
! keep top
 +
| Keep only a given number of records of a result set.
 +
|-
 +
! retrieve metadata
 +
| Retrieve ALL metadata associated to the search results of a previous search operation.
 +
|-
 +
! read
 +
| Read a result set endpoint reference, in order to process it. This operation can be used for further processing the results of a previous search operation.
 +
|-
 +
! external search (deprecated)
 +
| Perform a search on external (diligent-disabled) source. Currently, google, RDBMS and the OSIRIS infrastructures can be queried. Depending on the source, the query string can vary. As far as google is concerned, the query string must conform to the query language of google. In the case of RDBMS, the query must have the following form, in order to be executed successfully:
 +
 
 +
<pre>
 +
<root>
 +
<driverName>your jdbc driver</driverName>
 +
<connectionString>your jdbc connection string</connectionString>
 +
<query>your sql queryt</query>
 +
</root>
 +
Finally, in the OSIRIS case, the query string must have the following format:
 +
<root>
 +
<collection>your osiris collection</collection>
 +
<imageURL>your image URL to be searched for similar images</imageURL>
 +
<numberOfResults>the number of results</numberOfResults>
 +
</root>
 +
</pre>
 +
|-
 +
! similarity search
 +
| Perform a similarity search on a source for a multimedia content (currently, only images). The image URL is defined, along with the source string identifier and pairs of feature and weight.
 +
|-
 +
! spatial search
 +
| Perform a classic spatial search against a used defined shape (polygon, to be exact) and a spatial relation (contains, crosses, disjoint, equals, inside, intersect, overlaps, touches.
 +
|-
 +
! conditional search
 +
| Classic If-Then-Else construct. The hypothesis clause involves the (potentially aggragated) value of one or more fields which are part of the result of previous search operation(s). The central predicate involves a comparison of two clauses, which are combinations (with the basic math functions +, -, *, /) of these values
 +
|}
 +
 
 +
== Syntax ==
 +
 
 +
  <function> ::= <project_fun> | <sort_fun> | <filter_fun> | <merge_fun> | <join_fun> | <keeptop_fun> | <fulltexts_fun> |
 +
      <fieldedsearch_fun> | <extsearch_fun> | <read_fun> | <similsearch_fun> | <spatialsearch_fun> | <retrieve_metadata_fun>
 +
 
 +
----
 +
 
 +
  <read_fun> ::= <read_fun_name> <epr>
 +
  <read_fun_name> ::= 'read'
 +
  <epr> ::= string
 +
 
 +
----
 +
 
 +
  <project_fun> ::= <project_fun_name> <by> <project_key> <project_source>
 +
  <project_fun_name> ::= 'project'
 +
  <project_key> ::= string
 +
  <project_source> ::= <non_leaf_source>
 +
 
 +
----
 +
 
 +
  <sort_fun> ::= <sort_fun_name> <sort_order> <by> <sort_key> <sort_source>
 +
  <sort_fun_name> ::= 'sort'
 +
  <sort_key> ::= string
 +
  <sort_order> ::= 'ASC' | 'DESC'
 +
  <sort_source> ::= <non_leaf_source>
 +
 
 +
----
 +
 
 +
  <filter_fun> ::= <filter_fun_name> <filter_type> <by> <filter_statement> <filter_source>
 +
  <filter_fun_name> ::= 'filter'
 +
  <filter_type> ::= string
 +
  <filter_statement> ::= string
 +
  <filter_source> ::= <non_leaf_source> | <leaf_source>
 +
 
 +
----
 +
 
 +
  <merge_fun> ::= <merge_fun_name> <on> <merge_sources>
 +
  <merge_fun_name> ::= 'merge'
 +
  <merge_sources> ::= <merge_source> <and> <merge_source> <merge_sources2>
 +
  <merge_sources2> ::= <and> <merge_source> <merge_sources2> | φ
 +
  <merge_source> ::= <left_parenthesis> <function> <right_parenthesis>
 +
 
 +
----
 +
 
 +
  <join_fun> ::= <join_fun_name> <join_type> <by> <join_key> <on> <join_source> <and> <join_source>
 +
  <join_fun_name> ::= 'join'
 +
  <join_key> ::= string
 +
  <join_type> ::= 'inner' | 'fullOuter' | 'leftOuter' | 'rightOuter'
 +
  <join_source> ::= <left_parenthesis> <function> <right_parenthesis>
 +
 
 +
----
 +
 
 +
  <keeptop_fun> ::= <keeptop_fun_name> <keeptop_number> <keeptop_source>
 +
  <keeptop_fun_name> ::= 'keeptop'
 +
  <keeptop_number> ::= integer
 +
  <keeptop_source> ::= <non_leaf_source>
 +
 
 +
----
 +
 
 +
  <fulltexts_fun> ::= <fulltexts_fun_name> <by> <fulltexts_term> <fulltexts_terms> <in> <language> <on> <fulltexts_sources>
 +
  <fulltexts_fun_name> ::= 'fulltextsearch'
 +
  <fulltexts_terms> ::= <comma> <fulltexts_term> <fulltexts_terms> | φ
 +
  <fulltexts_sources> ::= <fulltexts_source> <fulltexts_sources_2>
 +
  <fulltexts_sources_2> ::= <comma> <fulltexts_source> <fulltexts_source> | φ
 +
  <fulltexts_source> ::= string
 +
 
 +
----
 +
 
 +
  <fieldedsearch_fun> ::= <fieldedsearch_fun_name> <by> <query> <fieldedsearch_source>
 +
  <fieldedsearch_fun_name> ::= 'fieldedsearch'
 +
  <query> ::= string
 +
  <fieldedsearch_source> ::= <non_leaf_source> | <leaf_source>
 +
 
 +
----
 +
 
 +
  <extsearch_fun> ::= <extsearch_fun_name> <by> <extsearch_query> <on> <extsearch_source>
 +
  <extsearch_fun_name> ::= 'externalsearch'
 +
  <extsearch_query> ::= string
 +
  <extsearch_source> ::= string
 +
 
 +
----
 +
 
 +
  <similsearch_fun> ::= <similaritysearch_fun_name> <as> <URL> <by> <pair> <pairs> <similarity_source>
 +
  <similsearch_fun_name> ::= 'similaritysearch'
 +
  <URL> ::= string
 +
  <pair> ::= <feature> <equal> <weight>
 +
  <pairs> ::= <and> <pair> <pairs> | φ
 +
  <similarity_source> ::= <leaf_source>
 +
 
 +
----
 +
 
 +
  <if-syntax> ::= <if> <left_parenthesis> <function-st> <compare-sign> <function-st> <right_parenthesis> <then> <search-op> <else> <search-op>
 +
  <compare-sign> ::= '==' | '>' | '<' | '>=' | '<='
 +
  <function-st> ::= <left-op> <math-op> <right-op> | <left-op>
 +
  <math-op> ::= '+' | '-' | '*' | '/'
 +
  <left-op> ::= <function> <left_parenthesis> <left-op> <right_parenthesis> | <literal>
 +
  <function> ::= <max-fun> | <min-fun> | <sum-fun> | <av-fun> | <va r-fun> | <size-fun>
 +
  <max-fun> ::= 'max' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
 +
  <min-fun> ::= 'min' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
 +
  <sun-fun> ::= 'sum' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
 +
  <av-fun> ::= 'av' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
 +
  <va r-fun> ::= 'var' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
 +
  <size-fun> ::= size' <left_parenthesis> <search-op> <right_parenthesis>
 +
  <right-op> ::= <function-st> | <left-op>
 +
  <element> ::= an element of the result set payload (either XML element, or XML attribute)
 +
 
 +
----
 +
 
 +
  <retrieve_metadata_fun> ::= <rm_fun_name> <in> <language> <on> <rm_source> <as> <schema>
 +
  <rm_fun_name> ::= 'retrievemetadata'
 +
  <schema> ::= string
 +
  <rm_source> ::= <left_parenthesis> <function> <right_parenthesis>
 +
 
 +
----
 +
 
 +
  <spatialsearch_fun> ::= <spatialsearch_fun_name> <relation> <geometry> [<timeBoundary>] <spatial_source>
 +
  <spatialsearch_fun_name> ::= 'spatialsearch'
 +
  <relation> ::= {'intersects', 'contains', 'isContained'}
 +
  <geometry> ::= <polygon_name> <left_parenthesis> <points> <right_parenthesis>
 +
  <polygon_name> ::= 'polygon'
 +
  <timeBoundary> ::= 'within' <startTime> <stopTime>
 +
  <startTime> ::= double
 +
  <stopTime> ::= double
 +
  <spatial_source> ::= <leaf_source>
 +
  <points> ::= <point> {<comma> <point>}+
 +
  <point> ::= <x> <y>
 +
  <x> ::= long
 +
  <y> ::= long
 +
 
 +
----
 +
 
 +
  <leaf_source>  ::= [<in> <language>] <on> <source> [<as> <schema>]
 +
  <non_leaf_source>  ::= <on> <left_parenthesis> <function> <right_parenthesis>
 +
 
 +
----
 +
 
 +
  <language>  ::= 'AFRIKAANS' | 'ARABIC' | 'AZERI' | 'BYELORUSSIAN' | 'BULGARIAN' | 'BANGLA' | 'BRETON' | 'BOSNIAN' | 'CATALAN' |
 +
      'CZECH' | 'WELSH' |    'DANISH' | 'GERMAN' | 'GREEK' | 'ENGLISH' | 'ESPERANTO' | 'SPANISH' | 'ESTONIAN' | 'BASQUE' | 'FARSI' |
 +
      'FINNISH' | 'FAEROESE' | 'FRENCH' | 'FRISIAN' | 'IRISH_GAELIC' | 'GALICIAN' | 'HAUSA' | 'HEBREW' | 'HINDI' | 'CROATIAN' |
 +
      'HUNGARIAN' | 'ARMENIAN' | 'INDONESIAN' | 'ICELANDIC' | 'ITALIAN' | 'JAPANESE' | 'GEORGIAN' | 'KAZAKH' | 'GREENLANDIC' | 'KOREAN' |
 +
      'KURDISH' | 'KIRGHIZ' | 'LATIN' | 'LETZEBURGESCH' | 'LITHUANIAN' | 'LATVIAN' | 'MAORI' | 'MONGOLIAN' | 'MALAY' | 'MALTESE' |
 +
      'NORWEGIAN_BOKMAAL' | 'DUTCH' | 'NORWEGIAN_NYNORSK' | 'POLISH' | 'PASHTO' | 'PORTUGUESE' | 'RHAETO_ROMANCE' | 'ROMANIAN' | 'RUSSIAN' |
 +
      'SAMI_NORTHERN' | 'SLOVAK' | 'SLOVENIAN' | 'ALBANIAN' | 'SERBIAN' | 'SWEDISH' | 'SWAHILI' | 'TAMIL' | 'THAI' | 'FILIPINO' | 'TURKISH' |
 +
      'UKRAINIAN' | 'URDU' | 'UZBEK' | 'VIETNAMESE' | 'SORBIAN' | 'YIDDISH' | 'CHINESE_SIMPLIFIED' | 'CHINESE_TRADITIONAL' | 'ZULU'
 +
  <source> ::= string
 +
  <schema>  ::= string
 +
  <left_parenthesis> ::= '('
 +
  <right_parenthesis> ::= ')'
 +
  <comma> ::= ','
 +
  <and> ::= 'and'
 +
  <on> ::= 'on'
 +
  <as> ::= 'as'
 +
  <by> ::= 'by'
 +
  <sort_by> ::= 'sort'
 +
  <from> ::= 'from'
 +
  <if> ::= 'if'
 +
  <then> ::= 'then'
 +
  <else> ::= 'else'
 +
 
 +
== Examples ==
 +
 
 +
{| border="1"
 +
|+ '''Example 1'''
 +
|-
 +
! User Request
 +
| Give me back all documents whose metadata contain the word ''woman'' from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''>
 +
|-
 +
! Actual Query
 +
| <pre>fulltextsearch by 'woman' in 'ENGLISH' on '0a952bf0-fa44-11db-aab8-f715cb72c9ff' as 'dc'</pre>
 +
|-
 +
! Explanation
 +
| We perform the ''fulltextsearch'' operation, using the ''woman'' term in the data source identified by the laguage ''ENGLISH'', source number ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'' and schema ''dc''
 +
|}
 +
 
 +
----
 +
 
 +
{| border="1"
 +
|+ '''Example 2'''
 +
|-
 +
! User Request
 +
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea; that is, the creator's name contains the separate word dorothea, e.g. ''Hemans, Felicia Dorothea Browne''
 +
|-
 +
! Actual Query
 +
| <pre>fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc'</pre>
 +
|-
 +
! Explanation
 +
| We perform the ''fieldedsearch'' operation in the data source identified by the laguage ''ENGLISH'', source number ''568a5220-fa43-11db-82de-905c553f17c3'' and schema ''dc'' and retrieve only those that their creator's name contain the word 'dorothea'. CAUTION: This does not cover creator names such as 'abcdorothea'. In this case, users should use the wildcard '*'. The absence of '*' implies string delimiter. E.g. '*dorothea' matches 'abcdorothea' but not 'dorotheas', 'dorothea*' matches 'dorotheas' but not 'abcdorothea'. Another critical issue is the data source identifier. Example 1 and Example 2 refer to the 'A Celebration of Women Writers' collection. However, in Example 1 we refer to the metadata of this collection, whereas in Example 2 we refer to the actual content.
 +
|}
 +
 
 +
----
 +
 
 +
{| border="1"
 +
|+ '''Example 3'''
 +
|-
 +
! User Request
 +
| Give me back the ''creator'' and ''subject'' the first 10 documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> sorted by the ''DocID'' field, whose creator's name contains ''ro''.
 +
|-
 +
! Actual Query
 +
| <pre>project by 'creator', 'subject' on
 +
  (keeptop '10' on
 +
      (sort 'ASC' by 'DocID' on
 +
        (fieldedsearch by 'creator' contains '*ro*' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')))</pre>
 +
|-
 +
! Explanation
 +
| First of all, we perform the fieldedsearch operation in the data source identified by the laguage ''ENGLISH'', source number ''568a5220-fa43-11db-82de-905c553f17c3'' and schema ''dc'' and retrieve only those that their creator's name contain the word 'ro'. On this result set, we apply the sort operation on the ''DocID'' field. Then, we apply the keep top operation, in order to keep only the first 10 sorted documents. Finally, we apply the project operation, keeping only the ''creator'' and ''subject'' fields.
 +
|}
 +
 
 +
----
 +
 
 +
{| border="1"
 +
|+ '''Example 4'''
 +
|-
 +
! User Request
 +
| Perform spatial search against the collection identified by the triplet <''ENGLISH'',''6cbe79b0-fbe0-11db-857b-d6a400c8bdbb'',''eiDB''> defining a search rectagular (0,0), (0,50), (50,50), (50,0).
 +
|-
 +
! Actual Query
 +
| <pre>spatialsearch contains polygon(0 0, 0 50, 50 50, 50 0)
 +
  in 'ENGLISH' on '6cbe79b0-fbe0-11db-857b-d6a400c8bdbb' as 'eiDB'</pre>
 +
|-
 +
! Explanation
 +
| Search in collection identified by the triplet <''ENGLISH'',''6cbe79b0-fbe0-11db-857b-d6a400c8bdbb'',''eiDB''> for records that define geometric shapes which include the rectagular identified by the points {(0,0), (0,50), (50,50), (50,0)}.
 +
|}
 +
 
 +
----
 +
 
 +
{| border="1"
 +
|+ '''Example 5'''
 +
|-
 +
! User Request
 +
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea and all documents from the same collection whose title contain the word ''woman''
 +
|-
 +
! Actual Query
 +
| <pre>merge on
 +
    (fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')
 +
and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')</pre>
 +
|-
 +
! Explanation
 +
| This is an example of how the user can merge the results of more than one subqueries.
 +
|}
 +
 
 +
----
 +
 
 +
{| border="1"
 +
|+ '''Example 6'''
 +
|-
 +
! User Request
 +
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea AND whose description contain the word ''London''
 +
|-
 +
! Actual Query
 +
| <pre>join inner by 'DocID' on
 +
    (fieldedsearch by 'description' contains 'London' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')
 +
and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')</pre>
 +
|-
 +
! Explanation
 +
| This is an example of how the user can perform a join (logical AND) of subqueries. We perform the ''join'' operation in the field ''DocID'', which is the document unique identifier. In this way, documents that are members of both the result sets of the subqueries, participate in the final result set.
 +
|}
 +
 
 +
----
 +
 
 +
 
 +
= Search Orchestrator (Search Master Service)=
 +
 
 +
The Search Master Service (SMS) is the main entry point to the functionality of the search engine. It contains the elements that will organize the execution of the search operators for the various tasks the Search engine is responsible for.
 +
 
 +
There are two steps in achieving the necessary Query processing before it can be forwarded to the Search engine for the actual execution and evaluation of the net results. The first step is the transformation of the abstract Query terms in a series of WS invocations. The output of this step is an enriched execution plan mapping the abstract Query to a workflow of service invocations. These invocations are calls to Search Service operators providing basic functionality called Search Operators. The second step is the optimization of the calculated execution plan.
 +
 
 +
The SMS is responsible for the first stage of query processing. This stage produces a query execution plan, which in the gCube implementation is a directed acyclic graph of Search Operator invocations. This element is responsible for gathering the whole set of information that is expected to be needed by the various search services and provides it as context to the processed query. In this manner, delays for gathering info at the various services are significantly reduced and assist responsiveness.
 +
 
 +
 
 +
The information gathered is produced by various components or services of the gCube Infrastructure. They include the gCube Information Service (IS), Content and Metadata Management, Indexing service etc. The process of gathering all needed information proves to be very time consuming. To this end, the SMS keeps a cache of previously discovered information and state.
 +
 
 +
 
 +
The SMS validates the received Query using SearchLibrary elements. It validates the user supplied query against the elements of the specific Virtual Organisation (VO). This ensures that content collections are available, metadata elements (e.g. fields) are present, operators (i.e. services) are accessible etc. Afterwards it performs a number of preprocessing steps invoking functionality offered by services such as the Query Personalisation and the DIR (former Content Source Selection) service, in order to refine the context of the search or inject extra information at the query. These are specializations of the general Query Preprocessor Element. An order of Query Preprocessor calls is necessary in the case where they might inject conflicting information. Otherwise, a method for weighting the source of the conflicting information importance is necessary. Furthermore, a number of exceptions may occur during the operation of a preprocessor, as during the normal flow of any operation. The difference is that, although useful in the context of gCube, preprocessors are not necessary for a Search execution. So errors during Query Preprocessing must not stop the search flow of execution.
 +
 
 +
 
 +
The above statement is a sub case of the more general need of a framework for defining fatal errors and warnings. During the entire Search operation a number of errors and/or warnings may emerge. Depending on the context in which they appear, they may have great or no significance to the overall progress of execution. Currently, these cases are handled separately but a uniform management may come into play as the specifications of each service’s needs in the grand plan of the execution become more apparent at a low enough level of detail.
 +
 
 +
 
 +
After the above pre-processing steps are completed successfully, the SMS dispatches a QueryPlanner thread to create the Query Execution Plan. Its job is firstly to map the provided Query that has been enriched by the preprocessors to a concrete workflow of WS invocations. Subsequently, the Query Planner uses the information encapsulated with the provided Query, the information gathered by the SMS for the available gCube environment and a set of transformation rules to come up with a near optimal plan. When certain conditions are met (e.g. the Query Planner has finished, time has elapsed, all plans have been evaluated), the planer returns to its caller the best plan calculated. If more than one Query Planners are utilized, the plans calculated by each Query Planner are gathered by the SMS. He then chooses the overall optimal plan and passes it to a suitable execution engine, where execution and scheduling is being achieved in a generic manner. The actual integration with the available execution engines and the formalization of their interaction with the SMS is accomplished through the introduction of the eXecution Engine Api (XENA), which is thoroughly analyzed in the Search Library section. In this formal methodology, the SMS is able to selected among the various available engines, such as the Process Execution Service, the Internal Execution Engine or any other WS-workflow engine. These engines are free to enforce their own optimization strategies, as long as they respect to the semantic invariants dictated by the original Execution Plan.
 +
 
 +
 
 +
Finally, the SMS receives the final ResultSet from the execution engine and pass its end point reference back to the requestor.
 +
 
 +
 
 +
 
 +
== DL Description ==
 +
Through the Search Master, external services can receive a structured overview of the VO resources available and usable during a search operation. An example of this summarization is shown bellow:
 +
 
 +
  <SearchConfig>
 +
    <collections>
 +
      <collection name="Example Collection Name 1" id="1fc1fbf0-fa3c-11db-82de-905c553f17c3">
 +
        <TYPE>DATA</TYPE>
 +
        <ASSOCIATEDWITH>d510a060-fa3c-11db-aa91-f715cb72c9ff</ASSOCIATEDWITH>
 +
        <ASSOCIATEDWITH>g45612f7-dth5-23fg-45df-45dfg5b1r34s</ASSOCIATEDWITH>
 +
      </collection>
 +
      <collection name="Example Collection Name 2" id="c3f685b0-fdb6-11db-a573-e4518f2111ab">
 +
        <TYPE>DATA</TYPE>
 +
        <ASSOCIATEDWITH>7bb87410-fdb7-11db-8476-f715cb72c9ff</ASSOCIATEDWITH>
 +
        <INDEX>FEATURE</INDEX>
 +
      </collection>
 +
      <collection name="Example Collection Name 3" id="d510a060-fa3c-11db-aa91-f715cb72c9ff">
 +
        <TYPE>METADATA</TYPE>
 +
        <LANGUAGE>en</LANGUAGE>
 +
        <SCHEMA>dc</SCHEMA>
 +
        <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH>
 +
        <INDEX>FTS</INDEX>
 +
        <INDEX>XML</INDEX>
 +
      </collection>
 +
      <collection name="Example Collection Name 4" id="g45612f7-dth5-23fg-45df-45dfg5b1r34s">
 +
        <TYPE>METADATA</TYPE>
 +
        <LANGUAGE>en</LANGUAGE>
 +
        <SCHEMA>tei</SCHEMA>
 +
        <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH>
 +
        <INDEX>FTS</INDEX>
 +
        <INDEX>XML</INDEX>
 +
      </collection>
 +
      <collection name="Example Collection Name 5" id="7bb87410-fdb7-11db-8476-f715cb72c9ff">
 +
        <TYPE>METADATA</TYPE>
 +
        <LANGUAGE>en</LANGUAGE>
 +
        <SCHEMA>dc</SCHEMA>
 +
        <ASSOCIATEDWITH>c3f685b0-fdb6-11db-a573-e4518f2111ab</ASSOCIATEDWITH>
 +
        <INDEX>FTS</INDEX>
 +
        <INDEX>XML</INDEX>
 +
      </collection>
 +
    </collections>
 +
  </SearchConfig>
 +
 
 +
= Query Processing (Search Library) =
 +
== Query Planner ==
 +
== eXecution ENgine API (XENA) ==
 +
== Search Operators Core Library ==
 +
= Search Operators =
 +
= Execution Engines =

Revision as of 14:08, 28 August 2008

Search Intro

Query Processing Chain

Search from the User Perspective: Querying

Available Operations

Operation Semantics
project Perform projection over a result set on a set of elements. By default, all header data (DocID, RankID, CollID) are kept in the projected result set; that is, they need not to be specified. If the projected term set is empty, then only the header data are kept.
sort Perform a sort operation on a result set based on an element of that result set, either ascending (ASC) or descending (DESC).
merge Concatenate two or more result sets. The records are non-deterministically concatenated in the final result set.
join Join two result sets on a join key, using one of the available modes (currently only 'inner' is implemented). The semantics of the inner join is similar to the respective operation found in the relational algebra, with the difference that in our case, only the join key from the left RS is kept and joined with the right payload.
fielded search Keep only those records of a source, which conform to a selection expression. This expression is a relation between a key and a value. The key is an element of the result set and the relation is one of the: ==, !=, >, <, <=, >=, contains. The 'contains' relation refers to whether a string is a substring of another string. Using this comparison function, one can use the wildcard '*', which means any. We discriminate these cases:
  • '*ad*'. It can match any of these: 'ad', 'add', 'mad', 'ladder'
  • '*der'. It can match any of these: 'der', 'ladder', but not 'derm' ot 'ladders'.
  • 'ad*'. It can match any of these: 'ad', 'additional', but not 'mad' or 'ladder'.
  • 'ad'. It can only match 'ad'.

If we search on a text field, then contains refers to any of its consisting words. For example, if we search on the field title which is the rain in spain stays mainly in the plane, then the matching criteria '*ain*' refers to any of the 'rain', 'spain', 'mainly'. If the predicate is '==' then we search for exact match; that is, in the previous example, the title == 'stays', won't succeed. Predicates can be combined with ORs, ANDs (currently under development). The source of this operation can be either a result set generated by another search operation or a content source. In the last case, you should use a source string identifier.

full text search Perform a full text search on a given source based on a set of terms. The full text search source must be a string identifier of the content source.

Each full text search term may contain a single or multiple words. In both cases, all terms are combined with a logical AND. In the second case, is a term is e.g. 'hello nasty', we search for the words 'hello' and 'nasty', with the latter following the former, as stated in the term; Text that does not contain such exact succession of the two words, it won't match the search criteria. Another feature of fulltextsearch is the lemmatization. In a few words, the terms are processed and a set of relative words is generated and also used in the full text search.

filter by xpath, xslt, math Perform a low level xpath or a xslt operation on result set. The math type refers to a mathematical language and is used by advanced users who are acquainted with that language. For more details about the semantics and syntax of that language, please see the documentation for the ResultSetScanner service, which implements this language.
keep top Keep only a given number of records of a result set.
retrieve metadata Retrieve ALL metadata associated to the search results of a previous search operation.
read Read a result set endpoint reference, in order to process it. This operation can be used for further processing the results of a previous search operation.
external search (deprecated) Perform a search on external (diligent-disabled) source. Currently, google, RDBMS and the OSIRIS infrastructures can be queried. Depending on the source, the query string can vary. As far as google is concerned, the query string must conform to the query language of google. In the case of RDBMS, the query must have the following form, in order to be executed successfully:
<root>
<driverName>your jdbc driver</driverName>
<connectionString>your jdbc connection string</connectionString>
<query>your sql queryt</query>
</root>
Finally, in the OSIRIS case, the query string must have the following format:
<root>
<collection>your osiris collection</collection>
<imageURL>your image URL to be searched for similar images</imageURL>
<numberOfResults>the number of results</numberOfResults>
</root>
similarity search Perform a similarity search on a source for a multimedia content (currently, only images). The image URL is defined, along with the source string identifier and pairs of feature and weight.
spatial search Perform a classic spatial search against a used defined shape (polygon, to be exact) and a spatial relation (contains, crosses, disjoint, equals, inside, intersect, overlaps, touches.
conditional search Classic If-Then-Else construct. The hypothesis clause involves the (potentially aggragated) value of one or more fields which are part of the result of previous search operation(s). The central predicate involves a comparison of two clauses, which are combinations (with the basic math functions +, -, *, /) of these values

Syntax

  <function> ::= <project_fun> | <sort_fun> | <filter_fun> | <merge_fun> | <join_fun> | <keeptop_fun> | <fulltexts_fun> | 
     <fieldedsearch_fun> | <extsearch_fun> | <read_fun> | <similsearch_fun> | <spatialsearch_fun> | <retrieve_metadata_fun> 

  <read_fun> ::= <read_fun_name> <epr>
  <read_fun_name> ::= 'read'
  <epr> ::= string

  <project_fun> ::= <project_fun_name> <by> <project_key> <project_source>
  <project_fun_name> ::= 'project'
  <project_key> ::= string
  <project_source> ::= <non_leaf_source>

  <sort_fun> ::= <sort_fun_name> <sort_order> <by> <sort_key> <sort_source>
  <sort_fun_name> ::= 'sort'
  <sort_key> ::= string
  <sort_order> ::= 'ASC' | 'DESC'
  <sort_source> ::= <non_leaf_source>

  <filter_fun> ::= <filter_fun_name> <filter_type> <by> <filter_statement> <filter_source>
  <filter_fun_name> ::= 'filter'
  <filter_type> ::= string
  <filter_statement> ::= string
  <filter_source> ::= <non_leaf_source> | <leaf_source>

  <merge_fun> ::= <merge_fun_name> <on> <merge_sources>
  <merge_fun_name> ::= 'merge'
  <merge_sources> ::= <merge_source> <and> <merge_source> <merge_sources2>
  <merge_sources2> ::= <and> <merge_source> <merge_sources2> | φ
  <merge_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <join_fun> ::= <join_fun_name> <join_type> <by> <join_key> <on> <join_source> <and> <join_source>
  <join_fun_name> ::= 'join'
  <join_key> ::= string
  <join_type> ::= 'inner' | 'fullOuter' | 'leftOuter' | 'rightOuter'
  <join_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <keeptop_fun> ::= <keeptop_fun_name> <keeptop_number> <keeptop_source>
  <keeptop_fun_name> ::= 'keeptop'
  <keeptop_number> ::= integer
  <keeptop_source> ::= <non_leaf_source>

  <fulltexts_fun> ::= <fulltexts_fun_name> <by> <fulltexts_term> <fulltexts_terms> <in> <language> <on> <fulltexts_sources>
  <fulltexts_fun_name> ::= 'fulltextsearch'
  <fulltexts_terms> ::= <comma> <fulltexts_term> <fulltexts_terms> | φ
  <fulltexts_sources> ::= <fulltexts_source> <fulltexts_sources_2>
  <fulltexts_sources_2> ::= <comma> <fulltexts_source> <fulltexts_source> | φ
  <fulltexts_source> ::= string

  <fieldedsearch_fun> ::= <fieldedsearch_fun_name> <by> <query> <fieldedsearch_source>
  <fieldedsearch_fun_name> ::= 'fieldedsearch'
  <query> ::= string
  <fieldedsearch_source> ::= <non_leaf_source> | <leaf_source>

  <extsearch_fun> ::= <extsearch_fun_name> <by> <extsearch_query> <on> <extsearch_source>
  <extsearch_fun_name> ::= 'externalsearch'
  <extsearch_query> ::= string
  <extsearch_source> ::= string

  <similsearch_fun> ::= <similaritysearch_fun_name> <as> <URL> <by> <pair> <pairs> <similarity_source>
  <similsearch_fun_name> ::= 'similaritysearch'
  <URL> ::= string
  <pair> ::= <feature> <equal> <weight>
  <pairs> ::= <and> <pair> <pairs> | φ
  <similarity_source> ::= <leaf_source>

  <if-syntax> ::= <if> <left_parenthesis> <function-st> <compare-sign> <function-st> <right_parenthesis> <then> <search-op> <else> <search-op>
  <compare-sign> ::= '==' | '>' | '<' | '>=' | '<='
  <function-st> ::= <left-op> <math-op> <right-op> | <left-op>
  <math-op> ::= '+' | '-' | '*' | '/'
  <left-op> ::= <function> <left_parenthesis> <left-op> <right_parenthesis> | <literal>
  <function> ::= <max-fun> | <min-fun> | <sum-fun> | <av-fun> | <va r-fun> | <size-fun>
  <max-fun> ::= 'max' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <min-fun> ::= 'min' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <sun-fun> ::= 'sum' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <av-fun> ::= 'av' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <va r-fun> ::= 'var' <left_parenthesis> <element> <comma> <search-op> <right_parenthesis>
  <size-fun> ::= size' <left_parenthesis> <search-op> <right_parenthesis>
  <right-op> ::= <function-st> | <left-op>
  <element> ::= an element of the result set payload (either XML element, or XML attribute)

  <retrieve_metadata_fun> ::= <rm_fun_name> <in> <language> <on> <rm_source> <as> <schema>
  <rm_fun_name> ::= 'retrievemetadata'
  <schema> ::= string
  <rm_source> ::= <left_parenthesis> <function> <right_parenthesis>

  <spatialsearch_fun> ::= <spatialsearch_fun_name> <relation> <geometry> [<timeBoundary>] <spatial_source>
  <spatialsearch_fun_name> ::= 'spatialsearch'
  <relation> ::= {'intersects', 'contains', 'isContained'}
  <geometry> ::= <polygon_name> <left_parenthesis> <points> <right_parenthesis>
  <polygon_name> ::= 'polygon'
  <timeBoundary> ::= 'within' <startTime> <stopTime>
  <startTime> ::= double
  <stopTime> ::= double
  <spatial_source> ::= <leaf_source>
  <points> ::= <point> {<comma> <point>}+
  <point> ::= <x> <y>
  <x> ::= long
  <y> ::= long

<leaf_source>  ::= [<in> <language>] <on>

Invalid language.

You need to specify a language like this: <source lang="html4strict">...</source>

Supported languages for syntax highlighting:

4cs, 6502acme, 6502kickass, 6502tasm, 68000devpac, abap, actionscript, actionscript3, ada, aimms, algol68, apache, applescript, arm, asm, asp, asymptote, autoconf, autohotkey, autoit, avisynth, awk, bascomavr, bash, basic4gl, bf, bibtex, blitzbasic, bnf, boo, c, caddcl, cadlisp, cfdg, cfm, chaiscript, chapel, cil, clojure, cmake, cobol, coffeescript, cpp, csharp, css, cuesheet, d, dart, dcl, dcpu16, dcs, delphi, diff, div, dos, dot, e, ecmascript, eiffel, email, epc, erlang, euphoria, ezt, f1, falcon, fo, fortran, freebasic, freeswitch, fsharp, gambas, gdb, genero, genie, gettext, glsl, gml, gnuplot, go, groovy, gwbasic, haskell, haxe, hicest, hq9plus, html4strict, html5, icon, idl, ini, inno, intercal, io, ispfpanel, j, java, java5, javascript, jcl, jquery, kixtart, klonec, klonecpp, latex, lb, ldif, lisp, llvm, locobasic, logtalk, lolcode, lotusformulas, lotusscript, lscript, lsl2, lua, m68k, magiksf, make, mapbasic, matlab, mirc, mmix, modula2, modula3, mpasm, mxml, mysql, nagios, netrexx, newlisp, nginx, nimrod, nsis, oberon2, objc, objeck, ocaml, octave, oobas, oorexx, oracle11, oracle8, oxygene, oz, parasail, parigp, pascal, pcre, per, perl, perl6, pf, php, pic16, pike, pixelbender, pli, plsql, postgresql, postscript, povray, powerbuilder, powershell, proftpd, progress, prolog, properties, providex, purebasic, pycon, pys60, python, q, qbasic, qml, racket, rails, rbs, rebol, reg, rexx, robots, rpmspec, rsplus, ruby, rust, sas, scala, scheme, scilab, scl, sdlbasic, smalltalk, smarty, spark, sparql, sql, standardml, stonescript, systemverilog, tcl, teraterm, text, thinbasic, tsql, typoscript, unicon, upc, urbi, uscript, vala, vb, vbnet, vbscript, vedit, verilog, vhdl, vim, visualfoxpro, visualprolog, whitespace, whois, winbatch, xbasic, xml, xpp, yaml, z80, zxbasic


 [<as> <schema>]
   <non_leaf_source>  ::= <on> <left_parenthesis> <function> <right_parenthesis>

----

   <language>  ::= 'AFRIKAANS' | 'ARABIC' | 'AZERI' | 'BYELORUSSIAN' | 'BULGARIAN' | 'BANGLA' | 'BRETON' | 'BOSNIAN' | 'CATALAN' | 
      'CZECH' | 'WELSH' |    'DANISH' | 'GERMAN' | 'GREEK' | 'ENGLISH' | 'ESPERANTO' | 'SPANISH' | 'ESTONIAN' | 'BASQUE' | 'FARSI' |
      'FINNISH' | 'FAEROESE' | 'FRENCH' | 'FRISIAN' | 'IRISH_GAELIC' | 'GALICIAN' | 'HAUSA' | 'HEBREW' | 'HINDI' | 'CROATIAN' | 
      'HUNGARIAN' | 'ARMENIAN' | 'INDONESIAN' | 'ICELANDIC' | 'ITALIAN' | 'JAPANESE' | 'GEORGIAN' | 'KAZAKH' | 'GREENLANDIC' | 'KOREAN' |
      'KURDISH' | 'KIRGHIZ' | 'LATIN' | 'LETZEBURGESCH' | 'LITHUANIAN' | 'LATVIAN' | 'MAORI' | 'MONGOLIAN' | 'MALAY' | 'MALTESE' |
      'NORWEGIAN_BOKMAAL' | 'DUTCH' | 'NORWEGIAN_NYNORSK' | 'POLISH' | 'PASHTO' | 'PORTUGUESE' | 'RHAETO_ROMANCE' | 'ROMANIAN' | 'RUSSIAN' | 
      'SAMI_NORTHERN' | 'SLOVAK' | 'SLOVENIAN' | 'ALBANIAN' | 'SERBIAN' | 'SWEDISH' | 'SWAHILI' | 'TAMIL' | 'THAI' | 'FILIPINO' | 'TURKISH' |
      'UKRAINIAN' | 'URDU' | 'UZBEK' | 'VIETNAMESE' | 'SORBIAN' | 'YIDDISH' | 'CHINESE_SIMPLIFIED' | 'CHINESE_TRADITIONAL' | 'ZULU'
   <source> ::= string
   <schema>  ::= string
   <left_parenthesis> ::= '('
   <right_parenthesis> ::= ')'
   <comma> ::= ','
   <and> ::= 'and'
   <on> ::= 'on'
   <as> ::= 'as'
   <by> ::= 'by'
   <sort_by> ::= 'sort'
   <from> ::= 'from'
   <if> ::= 'if'
   <then> ::= 'then'
   <else> ::= 'else'

== Examples ==

{| border="1"
|+ '''Example 1'''
|-
! User Request
| Give me back all documents whose metadata contain the word ''woman'' from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''>
|-
! Actual Query
| <pre>fulltextsearch by 'woman' in 'ENGLISH' on '0a952bf0-fa44-11db-aab8-f715cb72c9ff' as 'dc'</pre>
|-
! Explanation
| We perform the ''fulltextsearch'' operation, using the ''woman'' term in the data source identified by the laguage ''ENGLISH'', source number ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'' and schema ''dc''
|}

----

{| border="1"
|+ '''Example 2'''
|-
! User Request
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea; that is, the creator's name contains the separate word dorothea, e.g. ''Hemans, Felicia Dorothea Browne''
|-
! Actual Query
| <pre>fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc'</pre>
|-
! Explanation
| We perform the ''fieldedsearch'' operation in the data source identified by the laguage ''ENGLISH'', source number ''568a5220-fa43-11db-82de-905c553f17c3'' and schema ''dc'' and retrieve only those that their creator's name contain the word 'dorothea'. CAUTION: This does not cover creator names such as 'abcdorothea'. In this case, users should use the wildcard '*'. The absence of '*' implies string delimiter. E.g. '*dorothea' matches 'abcdorothea' but not 'dorotheas', 'dorothea*' matches 'dorotheas' but not 'abcdorothea'. Another critical issue is the data source identifier. Example 1 and Example 2 refer to the 'A Celebration of Women Writers' collection. However, in Example 1 we refer to the metadata of this collection, whereas in Example 2 we refer to the actual content.
|}

----

{| border="1"
|+ '''Example 3'''
|-
! User Request
| Give me back the ''creator'' and ''subject'' the first 10 documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> sorted by the ''DocID'' field, whose creator's name contains ''ro''.
|-
! Actual Query
| <pre>project by 'creator', 'subject' on 
   (keeptop '10' on 
      (sort 'ASC' by 'DocID' on 
         (fieldedsearch by 'creator' contains '*ro*' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')))</pre>
|-
! Explanation
| First of all, we perform the fieldedsearch operation in the data source identified by the laguage ''ENGLISH'', source number ''568a5220-fa43-11db-82de-905c553f17c3'' and schema ''dc'' and retrieve only those that their creator's name contain the word 'ro'. On this result set, we apply the sort operation on the ''DocID'' field. Then, we apply the keep top operation, in order to keep only the first 10 sorted documents. Finally, we apply the project operation, keeping only the ''creator'' and ''subject'' fields.
|}

----

{| border="1"
|+ '''Example 4'''
|-
! User Request
| Perform spatial search against the collection identified by the triplet <''ENGLISH'',''6cbe79b0-fbe0-11db-857b-d6a400c8bdbb'',''eiDB''> defining a search rectagular (0,0), (0,50), (50,50), (50,0).
|-
! Actual Query
| <pre>spatialsearch contains polygon(0 0, 0 50, 50 50, 50 0) 
   in 'ENGLISH' on '6cbe79b0-fbe0-11db-857b-d6a400c8bdbb' as 'eiDB'</pre>
|-
! Explanation
| Search in collection identified by the triplet <''ENGLISH'',''6cbe79b0-fbe0-11db-857b-d6a400c8bdbb'',''eiDB''> for records that define geometric shapes which include the rectagular identified by the points {(0,0), (0,50), (50,50), (50,0)}.
|}

----

{| border="1"
|+ '''Example 5'''
|-
! User Request
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea and all documents from the same collection whose title contain the word ''woman''
|-
! Actual Query
| <pre>merge on 
    (fieldedsearch by 'creator' contains 'dorothea' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc') 
and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')</pre>
|-
! Explanation
| This is an example of how the user can merge the results of more than one subqueries.
|}

----

{| border="1"
|+ '''Example 6'''
|-
! User Request
| Give me back all documents from the collection identified by the triplet <''ENGLISH'', ''0a952bf0-fa44-11db-aab8-f715cb72c9ff'', ''dc''> that are created by dorothea AND whose description contain the word ''London''
|-
! Actual Query
| <pre>join inner by 'DocID' on 
    (fieldedsearch by 'description' contains 'London' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc') 
and (fieldedsearch by 'title' contains 'woman' in 'ENGLISH' on '568a5220-fa43-11db-82de-905c553f17c3' as 'dc')</pre>
|-
! Explanation
| This is an example of how the user can perform a join (logical AND) of subqueries. We perform the ''join'' operation in the field ''DocID'', which is the document unique identifier. In this way, documents that are members of both the result sets of the subqueries, participate in the final result set.
|}

----


= Search Orchestrator (Search Master Service)=

The Search Master Service (SMS) is the main entry point to the functionality of the search engine. It contains the elements that will organize the execution of the search operators for the various tasks the Search engine is responsible for.

There are two steps in achieving the necessary Query processing before it can be forwarded to the Search engine for the actual execution and evaluation of the net results. The first step is the transformation of the abstract Query terms in a series of WS invocations. The output of this step is an enriched execution plan mapping the abstract Query to a workflow of service invocations. These invocations are calls to Search Service operators providing basic functionality called Search Operators. The second step is the optimization of the calculated execution plan.

The SMS is responsible for the first stage of query processing. This stage produces a query execution plan, which in the gCube implementation is a directed acyclic graph of Search Operator invocations. This element is responsible for gathering the whole set of information that is expected to be needed by the various search services and provides it as context to the processed query. In this manner, delays for gathering info at the various services are significantly reduced and assist responsiveness.


The information gathered is produced by various components or services of the gCube Infrastructure. They include the gCube Information Service (IS), Content and Metadata Management, Indexing service etc. The process of gathering all needed information proves to be very time consuming. To this end, the SMS keeps a cache of previously discovered information and state. 


The SMS validates the received Query using SearchLibrary elements. It validates the user supplied query against the elements of the specific Virtual Organisation (VO). This ensures that content collections are available, metadata elements (e.g. fields) are present, operators (i.e. services) are accessible etc. Afterwards it performs a number of preprocessing steps invoking functionality offered by services such as the Query Personalisation and the DIR (former Content Source Selection) service, in order to refine the context of the search or inject extra information at the query. These are specializations of the general Query Preprocessor Element. An order of Query Preprocessor calls is necessary in the case where they might inject conflicting information. Otherwise, a method for weighting the source of the conflicting information importance is necessary. Furthermore, a number of exceptions may occur during the operation of a preprocessor, as during the normal flow of any operation. The difference is that, although useful in the context of gCube, preprocessors are not necessary for a Search execution. So errors during Query Preprocessing must not stop the search flow of execution.


The above statement is a sub case of the more general need of a framework for defining fatal errors and warnings. During the entire Search operation a number of errors and/or warnings may emerge. Depending on the context in which they appear, they may have great or no significance to the overall progress of execution. Currently, these cases are handled separately but a uniform management may come into play as the specifications of each service’s needs in the grand plan of the execution become more apparent at a low enough level of detail.


After the above pre-processing steps are completed successfully, the SMS dispatches a QueryPlanner thread to create the Query Execution Plan. Its job is firstly to map the provided Query that has been enriched by the preprocessors to a concrete workflow of WS invocations. Subsequently, the Query Planner uses the information encapsulated with the provided Query, the information gathered by the SMS for the available gCube environment and a set of transformation rules to come up with a near optimal plan. When certain conditions are met (e.g. the Query Planner has finished, time has elapsed, all plans have been evaluated), the planer returns to its caller the best plan calculated. If more than one Query Planners are utilized, the plans calculated by each Query Planner are gathered by the SMS. He then chooses the overall optimal plan and passes it to a suitable execution engine, where execution and scheduling is being achieved in a generic manner. The actual integration with the available execution engines and the formalization of their interaction with the SMS is accomplished through the introduction of the eXecution Engine Api (XENA), which is thoroughly analyzed in the Search Library section. In this formal methodology, the SMS is able to selected among the various available engines, such as the Process Execution Service, the Internal Execution Engine or any other WS-workflow engine. These engines are free to enforce their own optimization strategies, as long as they respect to the semantic invariants dictated by the original Execution Plan.


Finally, the SMS receives the final ResultSet from the execution engine and pass its end point reference back to the requestor.



== DL Description ==
Through the Search Master, external services can receive a structured overview of the VO resources available and usable during a search operation. An example of this summarization is shown bellow:

   <SearchConfig>
     <collections>
       <collection name="Example Collection Name 1" id="1fc1fbf0-fa3c-11db-82de-905c553f17c3">
         <TYPE>DATA</TYPE>
         <ASSOCIATEDWITH>d510a060-fa3c-11db-aa91-f715cb72c9ff</ASSOCIATEDWITH>
         <ASSOCIATEDWITH>g45612f7-dth5-23fg-45df-45dfg5b1r34s</ASSOCIATEDWITH>
       </collection>
       <collection name="Example Collection Name 2" id="c3f685b0-fdb6-11db-a573-e4518f2111ab">
         <TYPE>DATA</TYPE>
         <ASSOCIATEDWITH>7bb87410-fdb7-11db-8476-f715cb72c9ff</ASSOCIATEDWITH>
         <INDEX>FEATURE</INDEX>
       </collection>
       <collection name="Example Collection Name 3" id="d510a060-fa3c-11db-aa91-f715cb72c9ff">
         <TYPE>METADATA</TYPE>
         <LANGUAGE>en</LANGUAGE>
         <SCHEMA>dc</SCHEMA>
         <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH>
         <INDEX>FTS</INDEX>
         <INDEX>XML</INDEX>
       </collection>
       <collection name="Example Collection Name 4" id="g45612f7-dth5-23fg-45df-45dfg5b1r34s">
         <TYPE>METADATA</TYPE>
         <LANGUAGE>en</LANGUAGE>
         <SCHEMA>tei</SCHEMA>
         <ASSOCIATEDWITH>1fc1fbf0-fa3c-11db-82de-905c553f17c3</ASSOCIATEDWITH>
         <INDEX>FTS</INDEX>
         <INDEX>XML</INDEX>
       </collection>
       <collection name="Example Collection Name 5" id="7bb87410-fdb7-11db-8476-f715cb72c9ff">
         <TYPE>METADATA</TYPE>
         <LANGUAGE>en</LANGUAGE>
         <SCHEMA>dc</SCHEMA>
         <ASSOCIATEDWITH>c3f685b0-fdb6-11db-a573-e4518f2111ab</ASSOCIATEDWITH>
         <INDEX>FTS</INDEX>
         <INDEX>XML</INDEX>
       </collection>
     </collections>
   </SearchConfig>

= Query Processing (Search Library) =
== Query Planner ==
== eXecution ENgine API (XENA) ==
== Search Operators Core Library ==
= Search Operators =
= Execution Engines =