Search Operators

From Gcube Wiki
Jump to: navigation, search

Search Operators

Introduction

The Search Operator family of services are the building blocks of any search operation. These along with external to the Search services handle the production, filtering and refinement of available data according to the user queries. The various intermediate steps towards producing the final search output are handled by Search Operator services. In this section we will only describe the Search Service internal Services listed below, although the Search Operator Framework reaches out to "integrate" on a high level other services too that can be utilized within a Search operation context.

The following operators are implemented as stateless services. They receive their input and produce their output in the context of a single invocation without holding any intermediate state. In case any data transferring is necessary either as input to a service or as output from the processing, the ResultSet Framework is employed.

The search operators cover the basic functionality that could be encountered in a typical search operation. A search can be decomposed in undividable units consisting of the above operators and their interaction can construct a workflow producing the net result delivered to the requester. The external source search and the service invocation services provide some extendibility for future operators by offering a method for invoking an “unknown” to the Search framework service, importing its results to the search operator workflow. The distinguished search operators at present time are listed below.

Example Code

Search Operators Usage Examples

Operators

BooleanOperator

Description

The Boolean Operator is used in conditional execution and more specifically, in evaluating the condition. So, it actually offers the ability of selecting alternative execution plans. For example, one can follow a plan (let’s say a projection on a given field of a set of data), if a given precondition is valid; otherwise, she may follow the alternative plan (e.g. a projection on another field of the same set of data and then sort on the field). The precondition validation is the responsibility of this Service.

The condition is a Boolean expression. Basically, it involves comparisons using the operations: equal, not_equal, greater_than, lower_than, greater_equal, lower_equal. The comparing parts are either literals (date, string, integer, double literals are supported) or aggregate functions on the results of a search service execution. These aggregate functions include max, min, average, size, sum and they can be applied to a given field of the result set of a search service execution, by referring to that field employing xPath expressions.

FilterResultSetByXPathOperator

Description

The role of the FilterResultSetByXPath Operator is to perform search through an expression to be evaluated against an XML structure. Such an expression could be an xPath query. The XML structure against which the expression is to be evaluated is a ResultSet, previously constructed by an other operator or complete search execution. The result of the operation is a new ResultSet and the end point reference to this is returned to the caller.

JoinInnerOperator

Description

The role of the JoinResultSetService is to perform a join operation on a specific field using a set of ResultSets whose end point references are provided. This operation produces a new ResultSet, leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its endpoint reference is returned to the caller. An in memory hash – join algorithm has been implemented to perform the Joining functionality.

KeepTopOperator

Description

The role of the KeepTop Operator is to perform a simple filtering operation on its input ResultSet and to produce as output a new ResultSet that holds a defined number of leading records.

MergeOperator

Description

The role of the Merge Operator is to perform a merge operation using a set of ResultSets whose end point references are provided. This operation produces a new ResultSet leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its endpoint reference is returned to the caller.

QueryExtSourceOperatorGoogle

Description

The role of the QueryExtSourceOperatorGoogle is to redirect a query to the Google search engine through its Web Service interface and wrapping the output produced by the external service in a ResultSet, whose endpoint reference returns to its caller. The above mentioned functionality is supported by elements residing in the SearchLibrary.

QueryExtSourceOperatorJDBC

Description

The role of the QueryExtSourceOperatorJDBC is to redirect a query to an external search engine through a JDBC interface after completing appropriate actions. These actions include getting the attributes of the external service, submitting the query to the external service and wrapping the output produced by the external service in a ResultSet, whose endpoint reference returns to its caller. The above mentioned functionality is supported by elements residing in the SearchLibrary. Query String Example:

  <root>
    <driverName>your jdbc driver</driverName>
    <connectionString>your jdbc connection string</connectionString>
    <query>your sql queryt</query>
  </root>

QueryExtSourceOperatorOSIRIS

Description

The role of the QueryExtSourceOperatorOSIRIS is to redirect a query to the external content based search engine provided by the ISIS/OSIRIS service through its http interface after completing appropriate actions. These actions include getting the attributes of the external service, submitting the query to the external service and wrapping the output produced by the external service in a ResultSet, whose endpoint reference returns to its caller. The above mentioned functionality is supported by elements residing in the SearchLibrary. Query string example:

  <root>
    <collection>your osiris collection</collection>
    <imageURL>your image URL to be searched for similar images</imageURL>
    <numberOfResults>the number of results</numberOfResults>
  </root>

ScannerOperator

Description

The Scanner Operator defines and provides a generic methodology of scanning through a result set, which is produced by another search operation service. It provides the ability to filter records, retrieve and update element/attributes values and remove selected elements/attributes. For this purpose it employs a formal function-like mathematical language, which is used for defining the operation on a given result set. The evaluation is done by an external package called JEP, which is a parser for mathematical expressions with the additional ability of defining new custom functions. Taking this ability into consideration, we have introduced some functions (do, like, filter, in, select), in order to provide a full-fledged filtering language. More analytically, the do function gets an arbitrary number of arguments and evaluates them. The filter function receives a boolean expression. If it is true then the respective result set record is removed from the derived result set. The like function performs a pattern matching and returns the boolean result of the matching. The in function determines whether a variable is in a given range of numeric values. Finally the select function selects specific elements|attributes to be included in the new result set.

Language Semantics: In order for the evaluator to produce a valid result, the filtering expression should contain at least one select or filter function. Besides that, the expression can contain any possible mathematical expression. More precisely, the evaluator supports the most frequently used operators (!, +, -, *, /, ^, <, >, =, !=) and functions (sin, cos, tan, ln, log, exp, sqrt, abs, rand, mod, ...). Also, users are free to define their own temporary variables. However, the variable names of the leaf element names (leaf elements are the XML elements which do not have any child elements, but plain text values) cannot be redefined, cause they are automatically defined by the evaluator and initialized to their values, which can be either strings or numerics (doubles). For further information about the available functions and operators, see org.nfunk.jep The syntax of our custom functions is the following.

  BNF Syntax:
    <custom_functions> ::= <filter_fun> <do_fun> <like_fun> <in_fun> <select_fun>
    <filter_fun> ::= <filter_fun_name> <left_parenthesis> <boolean_expression> <right_parenthesis>
    <filter_fun_name> ::= 'filter'
    <do_fun> ::= <do_fun_name> <left_parenthesis> <do_arguments> <right_parenthesis>
    <do_fun_name> ::= 'do'
    <do_arguments> ::= <do_argument> <do_args>
    <do_args> ::= <comma> <do_arguments> | EMPTY
    <do_argument> ::= <expression>
    <like_fun> ::= <like_fun_name> <left_parenthesis> <like_object> <comma> <regular_expression> <right_parenthesis>
    <like_fun_name> ::= 'like'
    <like_object> ::= <element> | <attribute>
    <regular_expression> ::= (see java.util.regexp.Pattern)
    <element> ::= String
    <attribute> ::= <element> '_' <attribute_name>
    <attribute_name> ::= String
    <in_fun> ::= <in_fun_name> <left_parenthesis> <object> <comma> <lower_bound> <comma> <upper_bound> <right_parenthesis>
    <in_fun_name> ::= 'in'
    <lower_bound> ::= Numeric
    <upper_bound> ::= Numeric
    < object> ::= <user_defined_variable> | <bound_variable>
    <bound_variable> ::= <attribute> | <element>
    <user_defined_variable> ::= (any instantiated variable, e.g. a=2)
    <select_fun> ::= <select_fun_name> <left_parenthesis> <select_object_list> <right_parenthesis>
    <select_fun_name> ::= 'select'
    <select_object_list> ::= <bound_variable> <select_args>
    <select_args> ::= <comma> <select_object_list>

SortOperator

Description

The role of the Sort Operator is to sort the provided ResultSet using as key a specific field. This operation produces a new ResultSet leaving the input untouched. The newly created ResultSet is wrapped around a WS-Resource and its end point reference is returned to the caller. The algorithm used is merge sort. The comparison rules differ depending on the type of the elements to be sorted. The key of the sort operator can be expressed in one of the ways defined in the following method org.diligentproject.searchservice.searchlibrary.resultset.elements.ResultElementGeneric#extractValue

TransformByXSLTOperator

Description

The role of the TransformByXSLT Operator is to transform a ResultSet it receives as input from one schema to another through a transformation technology such as XSL / XSLT. These transformations are directly supplied as input to the service. The output of the transformation, which could be a projection of the initial ResultSet, is a new ResultSet wrapped in a WS-Resource whose endpoint reference is returned to the caller.

SelectOperator

Description

The role of the Select Operator is to receive a ResultSet as an input along with a logical expression and a filter mask, and then return a new ResultSet after applying corresponding constrains to each record. In this way, Select Operator can filter whole records and also rearrange or even omit fields. An example of a logical expression that gets evaluated is:

([field_a] > 'value_a' AND [field_b] != 'value_b') OR [field_c] == [field_d]

where field names can be referenced either by name or index number.

An example of filter mask that can be applied is:

[2, 1, 4]

which mean that fields will be rearranged accordingly or either by omitted.

ScriptOperator

Description

The role of the Script Operator is to transform a ResultSet it receives as input after executing a script command. Every field of a single record is joined into a single line, delimited by tab ('\t') character and passed to standard input of the script. Respectively, every line of standard output of the script is split by delimiter character tab and passed to a new record, to the output ResultSet.

PartitionOperator

Description

The role of the Partition Operator is to partition a ResultSet it receives as input to many Result Sets as an output. Partition Operator takes the partition field as input and for each different value of that field a new Result Set is created to forward those records. A single Result Set is given as an output containing all the locators that were created.

GradualMergeOperator

Description

The role of the Gradual Merge Operator is to merge an arbitrary number of ResultSets into a single one. The difference with Merge Operator is that this operator takes a Result Set as input containing all Result Set locators that will be Merge. Also the merge operation starts since first result set is retrieved, and other incoming result set can be gradually added.

gRS2SplitterOperator

Description

The role of the gRS2Splitter operator is to read a result set containing file fields and splitting each file to multiple records by new line. The new records contain as many fields as the values separated by denoted delimiter character.

gRS2AggregatorOperator

Description

The role of the gRS2Splitter operator is to read a result set and aggregate all its records into a single file that will be transmitted through the output Result Set.

DataFusionOperator

Description

The role of the DataFusion operator is to re-rank the results from multiple datasources. More information about this operator can be found here

Dependencies

  • jdk 1.5
  • WS-Core 4.0.4
  • ResultSetClientLibrary
  • SearchLibrary