InspireIndexing

From Gcube Wiki
Jump to: navigation, search

INSPIRE Parallel Indexing

The parallel indexing module is designed for indexing very large collections of full text documents (hundreds of thousands or millions) The implementation uses Hadoop and Lucene libraries and is meant to be executed on a Hadoop facility. The input documents as well as computed indexes are located on the Hadoop DFS. The arguments for an indexing job:

  • Input directory
  • Output directory
  • number of workers