Difference between revisions of "Creating Indices at the VO Level"

From Gcube Wiki
Jump to: navigation, search
(FtsRowset_Transformer)
 
(48 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 +
[[Category:Administrator's Guide]]
 +
==Indexing Procedure==
 +
 
The Indexing procedure refers to the creation of indices for the collections [[ Content Import | imported ]] in a Virtual Organization. It consists of three steps:
 
The Indexing procedure refers to the creation of indices for the collections [[ Content Import | imported ]] in a Virtual Organization. It consists of three steps:
  
Line 5: Line 8:
 
* Definition of an [[ IR Bootstrapper | IRBootstrapper]] job that will perform the steps required to create the Indices.
 
* Definition of an [[ IR Bootstrapper | IRBootstrapper]] job that will perform the steps required to create the Indices.
  
In the first two steps we create generic resources for the Rowset XSLTs and Index Types through the [[ Resource Management | Resource Management portlet ]]. You can find detailed descriptions for the Rowset data (the output of the Rowset XSLT transformation) in the following sections:
+
In the first two steps we create generic resources for the Rowset XSLTs and Index Types through the [[ Resource Management | Resource Management portlet ]]. You can find detailed descriptions for the Rowset data (the output of the Rowset XSLT transformation) in the following section:
  
 
* [[ Index_Management_Framework#RowSet| Full Text Index Rowset ]]
 
* [[ Index_Management_Framework#RowSet| Full Text Index Rowset ]]
* [[ Index_Management_Framework#RowSet_2 | Geo-Spatial Index Rowset ]]
 
* [[ Index_Management_Framework#RowSet_3 | Forward Index Rowset ]]
 
  
 
You can find detailed descriptions for the Index Type definition here:
 
You can find detailed descriptions for the Index Type definition here:
 
  
 
* [[ Index_Management_Framework#IndexType | Full Text Index Type ]]
 
* [[ Index_Management_Framework#IndexType | Full Text Index Type ]]
* [[ Index_Management_Framework#GeoIndexType | Geo-Spatial Index Type ]]
+
 
* [[ Index_Management_Framework#Forward_Index | Forward Index key-value pairs ]]
+
For the third step, a definition of an IRBootstrapper job is required. You can find the details for defining such a job in the [[ IR Bootstrapper ]] section. To complete the Index creation, the administrator must go to the IRBootstrapper and run the job. The two examples that follow will clarify the three steps.
 +
 
 +
==Creating an Index for an OAI-DC collection==
 +
 
 +
=== DataTransformation Programs ===
 +
 
 +
====FtsRowset_Transformer====
 +
The following transformation program is called for fulltext rowset creation. Transformation unit with id="6" takes multiple XSLTs and applies final XSLT at the end.
 +
 
 +
[[File:FtsRowset_Transformer.xml]]
 +
 
 +
=== Index Types ===
 +
In this section we present the required IndexType for (FullText) Index.
 +
 
 +
====FullTextIndexType====
 +
In order to extract the fields from the OAI-DC payload and build the FullText Index the following FullTextIndexType is required:
 +
 
 +
<source lang="xml">
 +
<Name>IndexType_ft_oai_dc_1.0</Name>
 +
<SecondaryType>FullTextIndexType</SecondaryType>
 +
<Description>Definition of the fulltext index type for the 'oai dc' schema</Description>
 +
<Body>
 +
    <index-type name="default">
 +
        <field-list sort-xnear-stop-word-threshold="2E8">
 +
              <field name="contributor">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="coverage">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="creator">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="date">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="description">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="format">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="identifier">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="language">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="publisher">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="relation">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="rights">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="source">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="subject">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="title">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="type">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="ObjectID">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="gDocCollectionID">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="gDocCollectionLang">
 +
                    <index>yes</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>yes</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
              <field name="S">
 +
                    <index>no</index>
 +
                    <store>yes</store>
 +
                    <return>yes</return>
 +
                    <tokenize>no</tokenize>
 +
                    <sort>no</sort>
 +
                    <boost>1.0</boost>
 +
              </field>
 +
        </field-list>
 +
    </index-type>
 +
</Body>
 +
</source>
 +
 
 +
 
 +
=== Bootstrapper Configuration ===
 +
The IRBootstrapper portlet requires a Generic Resource to be available on the IS with name: '''IRBootstrapperConfiguration''' and secondary type: '''IRBootstrapperConfig'''.
 +
For more information please refer to this section [https://gcube.wiki.gcube-system.org/gcube/index.php/IR_Bootstrapper#Bootstrapper_Static_Configuration IRBoostrapperConfiguration Generic Resource]
 +
 
 +
An example of the configuration is the following:
 +
 
 +
[[File:Bootstrapper_Configuration.xml]]
 +
 
 +
=== Metadata Broker XSLT ===
 +
 
 +
*BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage
 +
The following XSLT transforms data elements with oai-dc schema to fulltext rowsets:
 +
 
 +
[[File:BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage.xml]]

Latest revision as of 18:04, 9 May 2014

Indexing Procedure

The Indexing procedure refers to the creation of indices for the collections imported in a Virtual Organization. It consists of three steps:

  • Creation of the Rowset XSLT generic resources, that transform collection data into data that can be fed to an Index.
  • Creation of the Index type generic resources, that define the Index configuration.
  • Definition of an IRBootstrapper job that will perform the steps required to create the Indices.

In the first two steps we create generic resources for the Rowset XSLTs and Index Types through the Resource Management portlet . You can find detailed descriptions for the Rowset data (the output of the Rowset XSLT transformation) in the following section:

You can find detailed descriptions for the Index Type definition here:

For the third step, a definition of an IRBootstrapper job is required. You can find the details for defining such a job in the IR Bootstrapper section. To complete the Index creation, the administrator must go to the IRBootstrapper and run the job. The two examples that follow will clarify the three steps.

Creating an Index for an OAI-DC collection

DataTransformation Programs

FtsRowset_Transformer

The following transformation program is called for fulltext rowset creation. Transformation unit with id="6" takes multiple XSLTs and applies final XSLT at the end.

File:FtsRowset Transformer.xml

Index Types

In this section we present the required IndexType for (FullText) Index.

FullTextIndexType

In order to extract the fields from the OAI-DC payload and build the FullText Index the following FullTextIndexType is required:

<Name>IndexType_ft_oai_dc_1.0</Name>
<SecondaryType>FullTextIndexType</SecondaryType>
<Description>Definition of the fulltext index type for the 'oai dc' schema</Description>
<Body>
    <index-type name="default">
        <field-list sort-xnear-stop-word-threshold="2E8">
              <field name="contributor">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="coverage">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="creator">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="date">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="description">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="format">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="identifier">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="language">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="publisher">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="relation">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="rights">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="source">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="subject">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="title">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="type">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="ObjectID">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="gDocCollectionID">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="gDocCollectionLang">
                     <index>yes</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>yes</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
              <field name="S">
                     <index>no</index>
                     <store>yes</store>
                     <return>yes</return>
                     <tokenize>no</tokenize>
                     <sort>no</sort>
                     <boost>1.0</boost>
              </field>
        </field-list>
    </index-type>
</Body>


Bootstrapper Configuration

The IRBootstrapper portlet requires a Generic Resource to be available on the IS with name: IRBootstrapperConfiguration and secondary type: IRBootstrapperConfig. For more information please refer to this section IRBoostrapperConfiguration Generic Resource

An example of the configuration is the following:

File:Bootstrapper Configuration.xml

Metadata Broker XSLT

  • BrokerXSLT_oai_dc_anylanguage_to_ftRowset_anylanguage

The following XSLT transforms data elements with oai-dc schema to fulltext rowsets:

File:BrokerXSLT oai dc anylanguage to ftRowset anylanguage.xml