Search Planner

From Gcube Wiki
Jump to: navigation, search

Search Planner

The outcome of a CQL query is a set of documents that satisfy the criteria defined by the query. Documents are hosted in a gCube infrastructure by Data Sources. Data Sources are able to execute CQL queries, bound to the information they host, and each Data Source supports different CQL capabilities. In most cases the initial CQL query can not be answered solely by a single Data Source. Therefore we need to use the functionality of Search Operators and combine the sets of documents retrieved from various Data Sources. The Search Planner detects which are the Data Sources that must be involved in order to answer the initial query and produces a plan. This plan specifies a)the subquery of the initial query that must be answered by each Data Source and b)how the Data Sources are combined with Search Operators in order to produce the final outcome.

Search Planner takes into account the following facts about the gCube environment:

  • Vertical and Horizontal partitioning of data
  • Data Sources CQL capabilities
  • The costs implied by the execution of a search plan

The information for one document is distributed across multiple Data Sources. We define this information partitioning as the Vertical Data Partitioning. Documents are also divided into collections (see OCMA), forming in such a way a Horizontal Data Partitioning. Multiple Horizontal partitions are hosted by each Data Source. For the example of Figure 1 assume that the documents are divided into collections A, B, C, D, E, which constitute the Horizontal Data Partitioning. The header, type, and location for each document are hosted into different Data Sources, forming the Vertical Data Partitioning. Assume also that DataSource ABh hosts the header information for the documents of collections A and B, source ABt hosts the type information for collections A and B, and ABl hosts the location for documents of A and B. Following the same fashion sources CDh, CDt, CDl, Eh, Et and El host the information for the header, type and location of the documents of collections C, D and E.

Figure 1. CQL query before the first stage of planning

In order to produce the final outcome for a CQL query, different data partitions are implicated. This set of partitions comes of the criteria defined in the CQL query to be answered. To this end, Search planner detects the partitions and the related Data Sources, where the documents satisfying the criteria of the query, belong. The second fact the Search Planner takes into account, is the CQL capabilities supported by each Data Source. Based on the CQL capabilities exposed by Data Sources, the Planner discovers which subqueries of the initial query, can be answered by each Data Source. Taking into account the data partitioning and the CQL capabilities supported by the Data Sources, the Planner ends up with a number of search plans that are sufficient for answering the initial query. The final goal is to find the plan that minimizes the total cost implied by its execution. The parameters affecting the total cost of a search plan are the following:

  • Computational cost for answering a given CQL query by a specific Data Source.
  • Computational cost for combining the outcome of Data Sources by a specific Search Operator.
  • Communication cost for transferring a set of documents from a machine hosting a Data Source or an Operator, to a machine hosting an Operator.
  • Cached results from previous queries that can be used to answer parts of the current query.

The Planning process consists of two stages. In the first stage we apply a set of static rules and we rewrite the initial query. Through this first stage, we aim at detecting the minimum set of partitions that need to be involved in order to answer the query. In the example of Figure 1 the documents coming from the left subtree of the root can be part of collections C, D or E. The right child of the root, will further filter the documents of the left subtree, so that only the documents that belong to collection E are returned as results. The rewriting rules applied in this example will substitute criteria collection=A, collection=B, which are the right children of the two NOT nodes of the query tree. The rewritten query is depicted in Figure 2. In this example the Planner detects, after the rewriting stage, that only partition defined by collection E is involved in this query. The static rewriting rules applied in the first stage of planning, are a combination of a)boolean algebra rules and b)rules related to the horizontal and vertical partitioning of the data.

Figure 2. CQL query after the first stage of planning

In the second stage of the Planning process we search for the optimal plan that answers the rewritten query. The Planner estimates the total cost of each plan, taking into account the four parameters above. The plan signifying the best cost estimation, is considered as the optimal. An exhaustive search of the space containing all the alternative plans is clearly inefficient for large queries. Instead of exhaustive search, we explore a part of the search space using factorization techniques [1] [2] [3] [4]. The rationale of factorization here is to examine queries, equivalent to the initial one, that have corresponding plans which are close to the optimal. Consider again the situation described in the example of Figures 1 and 2. The plan that corresponds to the rewritten query implies the invocation of source Eh with input the CQL query 'header all "marine polution"' and source Et with input the query 'type = "report"'. Then a merge operator with input the sets of documents coming from Eh and Et, provides as output their union. By the same token, a Merge operator keeps the union of the sets generated by source El with input the query 'location within "Europe"' and source Eh with input 'header all "polution"'. On top of the two merge operations a Join operation provides as a final outcome for the query the common subset of the two unions. The Planner using factorization techniques will explore the search space of the equivalent queries and will try to find a near-optimal plan. Assume for this example, that we already have cached results for subquery S1 : '(type = "report") AND (location within "Europe") AND (collection == "E")'. In Figure 3 you see the equivalent query, the Planner will end up with. Note that the subquery 'header all "marine polution"' is used as a common factor. In the plan corresponding to the query of Figure 3, source Eh is invoked with input the CQL query 'header all "marine polution"' and then a merge operation is applied on the output of Eh and the cached results for S1.

Figure 3. CQL query after the second stage of planning

The factorization algorithm we use in the second planning stage is driven by a)the detection of subqueries that can be answered solely by single Data Sources, b)the detection of subqueries with cached results and c)the cost of the corresponding plan for each equivalent query that is being examined.

References

  1. A. Mintz and M.C. Golumbic, "Factoring boolean functions using graph partitioning," in Discrete Appl. Math. 149, 1-3, 131-153, 2005.
  2. S.C. Chang and D.I. Cheng, "Efficient Boolean division and substitution using redundancy addition and removing," in IEEE Trans. Comput. Aided Design, 18:1096-1106, 1999.
  3. T. Stanion and C. Sechen, "Boolean division and factorization using binary decision diagrams," in IEEE Trans. Comput. Aided Design, 13:1179-1184, 1994.
  4. A.R.R. Wang, "Algorithms for multilevel logic optimization," Ph.D. Thesis, University of California, Berkeley, 1989.