Difference between revisions of "Indexer Service"

From Gcube Wiki
Jump to: navigation, search
 
(8 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
'''Notes to administrator:'''
 
'''Notes to administrator:'''
  
In order for the Indexer service to work in a scope and since we currently can't specify requirements for the software of the execution node through Hadoop Adaptor, it is necessary
+
In order for the Indexer service to work in a scope, and since we can't specify requirements for the software of the execution node through Hadoop Adaptor, it is necessary
that indexer.jar file has been uploaded to Content Management System manually before running an Indexer job. Client org.gcube.execution.indexerservice.tests.UploadIndexerJarClient must be used for that reason. You can call that client as:
+
that org.gcube.application.parallelindexing.jar file has been uploaded to Content Management System manually, before running an Indexer job. You must use client org.gcube.execution.indexerservice.tests.UploadIndexerJarClient for that reason. You can call that client as:
  
 
<source lang="java5" highlight="5" >
 
<source lang="java5" highlight="5" >
java org.gcube.execution.indexerservice.tests.UploadIndexerJarClient <scope> <location of jar file>
+
java org.gcube.execution.indexerservice.tests.UploadIndexerJarClient <scope>
 
</source>
 
</source>
  
 
e.g.
 
e.g.
 
<source lang="java5" highlight="5" >
 
<source lang="java5" highlight="5" >
java org.gcube.execution.indexerservice.tests.UploadIndexerJarClient /gcube/devNext $GLOBUS_LOCATION/sample-indexing-mod7.jar
+
java org.gcube.execution.indexerservice.tests.UploadIndexerJarClient /gcube/devNext  
 
</source>
 
</source>
(make sure that Content Management jars exist in your classpath) and a unique collection will be created in scope "/gcube/devNext" that will contain only the jar file "$GLOBUS_LOCATION/sample-indexing-mod7.jar" from your filesystem. The Indexer Service will find that file and send it along with the other resource when a new Indexer Job is submitted through Hadoop Adaptor.
 
  
 +
Make sure that Content Management jars exist in your classpath and org.gcube.application.parallelindexing.jar file exists under $GLOBUS_LOCATION/lib directory. A unique collection will be created in scope "/gcube/devNext" that will contain only the jar file "$GLOBUS_LOCATION/lib/org.gcube.application.parallelindexing.jar" from your filesystem. Indexer Service instances will locate that file in Content Management at runtime, and send it along with the other job parameters, when a new Indexer Job is submitted through Hadoop Adaptor.
 +
 +
 +
If the hadoop installation in scope needs to have fully qualified names that contain the uri needed to connect to the hadoop cluster (e.g. it needs
 +
hdfs://node1.hadoop.research-infrastructures.eu:8020/user/INSPIRE/smalldata/ instead of
 +
/user/INSPIRE/smalldata/ ), you need to add the following xml element in the $GLOBUS_LOCATION/GHNLabels.xml file of the container of the machine with hadoop gateway:
 +
 +
        <Variable>
 +
        <Key>hadoopLocationPrefix</Key>
 +
        <Value>hdfs://node1.hadoop.research-infrastructures.eu:8020</Value>
 +
        </Variable>
  
  
Line 23: Line 33:
 
'''Notes to developer:'''
 
'''Notes to developer:'''
  
When indexer service factory receives a call from a user, it tries to find a Workflow Engine instance in that scope which will use to submit a new job using Hadoop adaptor. In case of success, it will create a Web Service resource for that job that will contain information of that job such as job name,execution id,workflow engine endpoint etc. A background thread operates periodically and is in charge of collecting all WS-resources, polling the workflow engine for the jobs that are still running and updating the corresponding WS-resources.
+
When indexer service factory receives a call from a user, it tries to find a Workflow Engine instance in that scope, which will use to submit a new job using Hadoop adaptor. In case of success, it will create a Web Service resource for that job, that will contain information of that job such as job name,execution id,workflow engine endpoint etc. A background thread that operates periodically is in charge of collecting all WS-resources, polling the workflow engine instance for the jobs that are still running, and updating the corresponding WS-resources.
  
  
Line 29: Line 39:
 
'''Notes to user:'''
 
'''Notes to user:'''
  
Indexer service can be consumed through the org.gcube.execution.indexerservice.tests.TestIndexerService client. That client submits a Parallel Indexing job by providing the location of the input and the number of shards and polls the status of the job until completion. The output of the job is a directory in the hdfs.
+
Indexer service can be consumed through the org.gcube.execution.indexerservice.tests.TestIndexerService client. That client submits a Parallel Indexing job by providing the location of input and output, and the number of shards and polls the status of the job until completion. The output of the job is a directory in the hdfs.
  
 
Usage:
 
Usage:
 
<source lang="java5" highlight="5" >
 
<source lang="java5" highlight="5" >
java org.gcube.execution.indexerservice.tests.TestIndexerService <indexer factory address> <gcube scope> <input location> <shards number> <optional job name in >=0 words >
+
java org.gcube.execution.indexerservice.tests.TestIndexerService <indexer factory address> <gcube scope> <input location> <output location> <shards number> <optional job name in >=0 words >
 
</source>
 
</source>
  
Example of use:
+
Example of use when fully qualified input location is needed by hadoop installation:
 
<source lang="java5" highlight="5" >
 
<source lang="java5" highlight="5" >
java org.gcube.execution.indexerservice.tests.TestIndexerService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory /gcube/devNext hdfs://node1.hadoop.research-infrastructures.eu:8020/user/INSPIRE/smalldata 5 Indexing by John in /user/INSPIRE/smalldata/texts
+
java org.gcube.execution.indexerservice.tests.TestIndexerService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory /d4science.research-infrastructures.eu/INSPIRE hdfs://node1.hadoop.research-infrastructures.eu:8020/user/INSPIRE/smalldata hdfs://node1.hadoop.research-infrastructures.eu:8020/tmp/output 5 Indexing by John in /user/INSPIRE/smalldata/texts
 +
</source>
 +
 
 +
Example of use when fully qualified input location is not needed by hadoop installation:
 +
<source lang="java5" highlight="5" >
 +
java org.gcube.execution.indexerservice.tests.TestIndexerService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory /gcube/devNext hdfs:///user/INSPIRE/smalldata/texts hdfs:///tmp/output 5 Indexing by John in devNext scope</source>
 +
 
 +
 
 +
'''Example of execution:'''
 +
 
 +
<source lang="java5" highlight="5" >
 +
java org.gcube.execution.indexerservice.tests.TestIndexerService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory  /d4science.research-infrastructures.eu/INSPIRE    hdfs:///user/INSPIRE/smalldata.har  hdfs:///tmp/tmp000444457763211  5  Indexing in Inspire scope
 +
 
 +
Client started with arguments:
 +
Indexer factory:  http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory
 +
Scope:            /d4science.research-infrastructures.eu/INSPIRE
 +
Input:            hdfs:///user/INSPIRE/smalldata.har
 +
Output:          hdfs:///tmp/tmp000444457763211
 +
Shard Number:    5
 +
jobName:          Indexing in Insp
 +
 
 +
Getting Indexer Factory stub
 +
Preparing Indexer arguments
 +
Calling submit() for Indexing job
 +
 
 +
Calling status() until completion
 +
 
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:17:27 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:17:27 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:17:42 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:17:42 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:18:07 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:18:07 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:18:07 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:18:32 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:18:32 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:18:57 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:18:57 EEST 2011
 +
 
 +
Description:    Indexer Job is running
 +
Last poll date: Thu Sep 01 14:18:57 EEST 2011
 +
 
 +
Description:    Indexer job finished with no reported errors
 +
Last poll date: Thu Sep 01 14:19:22 EEST 2011
 +
 
 +
 
 +
Indexing Job has finished:
 +
-----------------------------------------------------
 +
Job Name:        Indexing in Insp
 +
Description:    Indexer job finished with no reported errors
 +
Input: hdfs:///user/INSPIRE/smalldata.har
 +
Output:          hdfs:///tmp/tmp000444457763211
 +
Submited:        Thu Sep 01 14:17:27 EEST 2011
 +
Last Poll:      Thu Sep 01 14:19:22 EEST 2011
 +
Error:          null
 +
ErrorDetails:    null
 +
output ssid: cms://5f930a20-a619-11e0-8948-d83cf0a68390/3beef1a0-d48c-11e0-98d7-9759330c054b
 +
stdout ssid: cms://5f930a20-a619-11e0-8948-d83cf0a68390/0e5275f0-d48c-11e0-98d7-9759330c054b
 +
stderr ssid: cms://5f930a20-a619-11e0-8948-d83cf0a68390/0e2c2940-d48c-11e0-98d7-9759330c054b
 +
 
 +
 
 
</source>
 
</source>

Latest revision as of 10:47, 2 September 2011

This is a stateful Web Service that serves as a wrapper to Parallel Indexing application developed by the INSPIRE team.


Notes to administrator:

In order for the Indexer service to work in a scope, and since we can't specify requirements for the software of the execution node through Hadoop Adaptor, it is necessary that org.gcube.application.parallelindexing.jar file has been uploaded to Content Management System manually, before running an Indexer job. You must use client org.gcube.execution.indexerservice.tests.UploadIndexerJarClient for that reason. You can call that client as:

java org.gcube.execution.indexerservice.tests.UploadIndexerJarClient <scope>

e.g.

java org.gcube.execution.indexerservice.tests.UploadIndexerJarClient /gcube/devNext

Make sure that Content Management jars exist in your classpath and org.gcube.application.parallelindexing.jar file exists under $GLOBUS_LOCATION/lib directory. A unique collection will be created in scope "/gcube/devNext" that will contain only the jar file "$GLOBUS_LOCATION/lib/org.gcube.application.parallelindexing.jar" from your filesystem. Indexer Service instances will locate that file in Content Management at runtime, and send it along with the other job parameters, when a new Indexer Job is submitted through Hadoop Adaptor.


If the hadoop installation in scope needs to have fully qualified names that contain the uri needed to connect to the hadoop cluster (e.g. it needs hdfs://node1.hadoop.research-infrastructures.eu:8020/user/INSPIRE/smalldata/ instead of /user/INSPIRE/smalldata/ ), you need to add the following xml element in the $GLOBUS_LOCATION/GHNLabels.xml file of the container of the machine with hadoop gateway:

       <Variable>
       <Key>hadoopLocationPrefix</Key>
       <Value>hdfs://node1.hadoop.research-infrastructures.eu:8020</Value>
       </Variable>


Notes to developer:

When indexer service factory receives a call from a user, it tries to find a Workflow Engine instance in that scope, which will use to submit a new job using Hadoop adaptor. In case of success, it will create a Web Service resource for that job, that will contain information of that job such as job name,execution id,workflow engine endpoint etc. A background thread that operates periodically is in charge of collecting all WS-resources, polling the workflow engine instance for the jobs that are still running, and updating the corresponding WS-resources.


Notes to user:

Indexer service can be consumed through the org.gcube.execution.indexerservice.tests.TestIndexerService client. That client submits a Parallel Indexing job by providing the location of input and output, and the number of shards and polls the status of the job until completion. The output of the job is a directory in the hdfs.

Usage:

java org.gcube.execution.indexerservice.tests.TestIndexerService <indexer factory address> <gcube scope> <input location> <output location> <shards number> <optional job name in >=0 words >

Example of use when fully qualified input location is needed by hadoop installation:

java org.gcube.execution.indexerservice.tests.TestIndexerService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory /d4science.research-infrastructures.eu/INSPIRE hdfs://node1.hadoop.research-infrastructures.eu:8020/user/INSPIRE/smalldata		hdfs://node1.hadoop.research-infrastructures.eu:8020/tmp/output 5 Indexing by John in /user/INSPIRE/smalldata/texts

Example of use when fully qualified input location is not needed by hadoop installation:

java org.gcube.execution.indexerservice.tests.TestIndexerService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory /gcube/devNext hdfs:///user/INSPIRE/smalldata/texts		hdfs:///tmp/output 5 Indexing by John in devNext scope


Example of execution:

java org.gcube.execution.indexerservice.tests.TestIndexerService http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory   /d4science.research-infrastructures.eu/INSPIRE     hdfs:///user/INSPIRE/smalldata.har  hdfs:///tmp/tmp000444457763211  5  Indexing in Inspire scope
 
Client started with arguments: 
Indexer factory:  http://jazzman.di.uoa.gr:8080/wsrf/services/gcube/execution/indexerservice/IndexerServiceFactory
Scope:            /d4science.research-infrastructures.eu/INSPIREInput:            hdfs:///user/INSPIRE/smalldata.har
Output:           hdfs:///tmp/tmp000444457763211
Shard Number:     5
jobName:          Indexing in Insp
 
Getting Indexer Factory stub
Preparing Indexer arguments
Calling submit() for Indexing job
 
Calling status() until completion
 
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:17:27 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:17:27 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:17:42 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:17:42 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:18:07 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:18:07 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:18:07 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:18:32 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:18:32 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:18:57 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:18:57 EEST 2011
 
Description:    Indexer Job is running
Last poll date: Thu Sep 01 14:18:57 EEST 2011
 
Description:    Indexer job finished with no reported errors
Last poll date: Thu Sep 01 14:19:22 EEST 2011
 
 
Indexing Job has finished:
-----------------------------------------------------
Job Name:        Indexing in Insp
Description:     Indexer job finished with no reported errors
Input:			 hdfs:///user/INSPIRE/smalldata.har
Output:          hdfs:///tmp/tmp000444457763211
Submited:        Thu Sep 01 14:17:27 EEST 2011
Last Poll:       Thu Sep 01 14:19:22 EEST 2011
Error:           null
ErrorDetails:    null
output ssid: 	 cms://5f930a20-a619-11e0-8948-d83cf0a68390/3beef1a0-d48c-11e0-98d7-9759330c054b
stdout ssid: 	 cms://5f930a20-a619-11e0-8948-d83cf0a68390/0e5275f0-d48c-11e0-98d7-9759330c054b
stderr ssid: 	 cms://5f930a20-a619-11e0-8948-d83cf0a68390/0e2c2940-d48c-11e0-98d7-9759330c054b