Data Fusion (New)

From Gcube Wiki
Jump to: navigation, search

Data Fusion

Introduction

Data fusion is an operator that is used by gCube Search System in order to merge and the search results from different datasources and sort them by their score. Fusion is enabled when 2 or more datasources are participating in search query.

Note that the new score (after re-ranking) may be complete different from the initial score for each result but the final ordering of the results is the goal of the data fusion.

The Data Fusion library is available in our Maven repositories with the following coordinates

<groupId>org.gcube.search</groupId>
<artifactId>data-fusion</artifactId>
<version>...</version>

Data Fusion Procedure

The following steps are executed in the data fusion procedure:

  1. Execution of the query on all appropriate datasources (like ordinary search).
  2. Collection of the records from the datasources(*).
  3. Re-rank (re-index) of the records based on some field(s) or actual content (see Fusion Fields).
  4. Execution of the initial query against the new index.
  5. Optionally, boosting of the (new) score with the position that each record had at the origin datasource.

Search System uses it through operators library when fuse keyword appears at the end of the CQL query followed by the search term: example:

(gDocCollectionID == "ColID") and (title = tuna) project * fuse tuna

(*) Record retrieval: Data fusion comes with a custom iterator that multiplexes ranked and unranked result sets into one while keeping them sorted based on their score as well as removes the duplicates. Unranked records are considered of higher value. Sorting can improve the performance of the fusion process when combined with count.

Data Fusion Configuration

The Data Fusion component can be parametrized through the properties file fusion.properties.

Fusion Fields

The field on which the records will be re-ranked can be customized: We can define a list of fields in the property file. For example:

snippet-fields=S,description,title

For each record the selected field will be the first field in the list that exists in the record. If no field is found the actual content of the record will be retrieved.

Positional Boost

In case that unranked datasources may be ordered we can exploit the initial position of each record in order to boost their final score. Currently the following Position Score formula is used (experimental and naive):

s(p) = a / (b ^ p), where (a = 0.986..., b = 1.025...)

This feature can be easily enabled/disabled by changing the include-position property in the properties file. For example:

include-position=false