Difference between revisions of "InspireOCR"

From Gcube Wiki
Jump to: navigation, search
(INSPIRE Optical Character Recognition (OCR))
(JDLAdaptor Examples)
Line 9: Line 9:
 
=== JDLAdaptor Examples ===
 
=== JDLAdaptor Examples ===
  
JDLAdaptor input files needed (detailed examples will be added):
+
JDLAdaptor input files needed:
  
 
* JDL file
 
* JDL file
Line 15: Line 15:
 
* OCR code
 
* OCR code
 
* OCR job file
 
* OCR job file
 +
 +
 +
 +
JDL file example:
 +
 +
<source lang="python">
 +
[
 +
    Type = "Job";
 +
    JobType = "Normal";
 +
    Executable = "ocrjob.sh";
 +
    Arguments = "None";
 +
    StdOutput = "job.out";
 +
    StdError = "job.err";
 +
    VirtualOrganisation = "d4science.research-infrastructures.eu";
 +
    InputSandbox = {"NobelAnnounce.pdf", "ocrjob.sh", "ocropus-0.3.1-i386-JK.tar.gz", "job_d4science_grid.job"};
 +
    OutputSandbox = {"NobelAnnounce.pdf", "NobelAnnounce.pdf.hocr", "job.out", "job.err"};
 +
    Requirements = other.GlueHostOperatingSystemName == ScientificSL &&
 +
other.GlueHostOperatingSystemRelease == 5.0;
 +
]
 +
</source>
 +
 +
 +
Resource file example:
 +
 +
<source lang="python">
 +
scope # /d4science.research-infrastructures.eu/INSPIRE
 +
jdl # /home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/ocr.jdl
 +
chokeProgressEvents # false
 +
chokePerformanceEvents # false
 +
storePlans # true
 +
NobelAnnounce.pdf#local#/home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/NobelAnnounce.pdf
 +
ocrjob.sh#local#/home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/ocrjob.sh
 +
ocropus-0.3.1-i386-JK.tar.gz#url#ftp://meteora.di.uoa.gr/d5s/ocropus-0.3.1-i386-JK.tar.gz
 +
</source>
 +
  
 
Output is a pdf file with textual content and a file in hOCR format.
 
Output is a pdf file with textual content and a file in hOCR format.

Revision as of 13:45, 29 June 2011

INSPIRE Optical Character Recognition (OCR)

Introduction

Optical character recognition (OCR) is the translation of scanned documents into machine-encoded text. CERN library and many other digital repositories have large numbers of scanned documents where textual information is not available. Therefore it is not possible to search for words or phrases in these documents, and applying techniques such as text mining is not possible. OCR process has often been done using commercial services and tools but now there are powerful open source tools for OCR. Using these tools the OCR process can be carried out in one workstation or by dividing the work in many parallel grid computing jobs. The OCR tool used is OCRopus.

OCRopus is a document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. It is released under the Apache License and has a modular design through the use of plugins. OCRopus is also well suited for large scale batch processing so that the OCR tasks can be divided into independent grid jobs. Typical OCR process consists of selecting a set of scanned documents in pdf format, performing document layout analysis, line recognition and character identification. The output is in hOCR format (HTML document) which can be converted into pfd format. All the tools needed for OCR are available in one package (tar.gz file) that can be sent with the grid job or the tools can be pre-installed in the grid nodes where jobs are executed.

JDLAdaptor Examples

JDLAdaptor input files needed:

  • JDL file
  • resource file
  • OCR code
  • OCR job file


JDL file example:

[
    Type = "Job";
    JobType = "Normal";
    Executable = "ocrjob.sh";
    Arguments = "None";
    StdOutput = "job.out";
    StdError = "job.err";
    VirtualOrganisation = "d4science.research-infrastructures.eu";
    InputSandbox = {"NobelAnnounce.pdf", "ocrjob.sh", "ocropus-0.3.1-i386-JK.tar.gz", "job_d4science_grid.job"};
    OutputSandbox = {"NobelAnnounce.pdf", "NobelAnnounce.pdf.hocr", "job.out", "job.err"};
    Requirements = other.GlueHostOperatingSystemName == ScientificSL &&
other.GlueHostOperatingSystemRelease == 5.0;
]


Resource file example:

scope # /d4science.research-infrastructures.eu/INSPIRE
jdl # /home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/ocr.jdl
chokeProgressEvents # false
chokePerformanceEvents # false
storePlans # true
NobelAnnounce.pdf#local#/home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/NobelAnnounce.pdf
ocrjob.sh#local#/home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/ocrjob.sh
ocropus-0.3.1-i386-JK.tar.gz#url#ftp://meteora.di.uoa.gr/d5s/ocropus-0.3.1-i386-JK.tar.gz


Output is a pdf file with textual content and a file in hOCR format.

GridAdaptor Examples

GridAdaptor input files needed (detailed examples will be added):

  • JDL file
  • resource file
  • OCR code
  • OCR job file

Output is a pdf file with textual content and a file in hOCR format.

Local OCR execution

OCR process can be tested locally by executing the OCR job file script. Local pdf files are processed and textual content added.