GCube Information Retrieval Services

From Gcube Wiki
Jump to: navigation, search

The gCube Information Retrieval Services is the family of components offering Information Retrieval (IR) facilities to the gCube infrastructure, i.e. allowing searching over data and information by a wide range of techniques. The IR family of services can be decomposed in three major categories, which are presented below and are entitled as “frameworks” due to the fact that they are not standalone services. Instead, they are rather large collaborating systems based on protocols, specifications and software, which expose remarkable extensibility to the gCube system they empower:

  • Search Framework: This category includes all services focused on the search-specific aspects of the gCube platform. More analytically, it consists of the search orchestrator component, search operators, query processor components and the data transfer mechanism. The workflow required for computing a user-query is the following: The search orchestrator receives queries from the gCube portal, communicates with the gCore IS service for retrieving environment information. In the next step, the orchestrator feeds this information along with the query to the query processor components which ultimately produce an execution plan. This plan is forwarded to the gCube execution engine (one of which is the Process Management Service) which orchestrates the execution by invoking the search operators, as dictated by the plan. The data transfer is performed by the ResultSet component of the Search Framework. However, due to its importance, it requires special credit and therefore is analyzed in a distinct section. The final results are then forwarded to the user (portal). The Search Operators cover most of the traditional relational algebra operations, as well as some advanced ones, such as geospatial search and similarity search, thus providing a full fledged set of capabilities to the final user. Index Management and DIR frameworks provide a major part of the Search Operators and are analyzed in distinct sections. For further details regarding the Search Framework, please refer to Section 7.3.
  • Index Management Framework: This category includes all services that are involved in the creation and management of gCube indices. Management refers to all aspects of an index lifecycle as well as support for search capabilities. In gCube a rich set of indices, such as full text, forward, feature, geospatial indices, is employed, offering a full-fledged set of storage and search capabilities regarding various data types and models. The services of Index Management Framework communicate with the Content And Storage Management services in order to acquire the data set to be indexed and also to preserve their state. They also employ the gCore IS capabilities so as to publish themselves and therefore be used by clients.
  • Distributed Information Retrieval Support Framework: This category includes all services which enhance and support the IR system. This framework provides higher-level IR capabilities which include content ranking, source selection and result set fusion (ranked merging of various data sets). Components of this framework communicate with the Index Management Services for statistic extraction and the IS service for information publication. Search Framework employs the advanced capabilities offered by DIR framework in order to enhance its search capabilities, by refining queries, enhancing produced search results and finally exhibiting a higher level of services.
  • An additional component which does not belong to any of the frameworks mentioned above, but acts independently and improves the search quality, is the Personalisation Service. It is indirectly invoked by the Search Framework, through an appropriate wrapper, and used for enhancing user queries, by injecting additional “personalized” information.