InspireOCR

From Gcube Wiki
Jump to: navigation, search

INSPIRE Optical Character Recognition (OCR)

Introduction

Optical character recognition (OCR) is the translation of scanned documents into machine-encoded text. CERN library and many other digital repositories have large numbers of scanned documents where textual information is not available. Therefore it is not possible to search for words or phrases in these documents, and applying techniques such as text mining is not possible. OCR process has often been done using commercial services and tools but now there are powerful open source tools for OCR. Using these tools the OCR process can be carried out in one workstation or by dividing the work in many parallel grid computing jobs. The OCR tool used is OCRopus.

OCRopus is a document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. It is released under the Apache License and has a modular design through the use of plugins. OCRopus is also well suited for large scale batch processing so that the OCR tasks can be divided into independent grid jobs. Typical OCR process consists of selecting a set of scanned documents in pdf format, performing document layout analysis, line recognition and character identification. The output is in hOCR format (HTML document) which can be converted into pfd format. All the tools needed for OCR are available in one package (tar.gz file) that can be sent with the grid job or the tools can be pre-installed in the grid nodes where jobs are executed.


JDLAdaptor Examples

JDLAdaptor input files needed:

  • JDL file
  • resource file
  • OCR code (available as a ".tar.gz" file)
  • OCR job script


JDL file example:

[
    Type = "Job";
    JobType = "Normal";
    Executable = "ocrjob.sh";
    Arguments = "None";
    StdOutput = "job.out";
    StdError = "job.err";
    VirtualOrganisation = "d4science.research-infrastructures.eu";
    InputSandbox = {"NobelAnnounce.pdf", "ocrjob.sh", "ocropus-0.3.1-i386-JK.tar.gz", "job_d4science_grid.job"};
    OutputSandbox = {"NobelAnnounce.pdf", "NobelAnnounce.pdf.hocr", "job.out", "job.err"};
    Requirements = other.GlueHostOperatingSystemName == ScientificSL &&
other.GlueHostOperatingSystemRelease == 5.0;
]


Resource file example:

scope # /d4science.research-infrastructures.eu/INSPIRE
jdl # /home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/ocr.jdl
chokeProgressEvents # false
chokePerformanceEvents # false
storePlans # true
NobelAnnounce.pdf#local#/home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/NobelAnnounce.pdf
ocrjob.sh#local#/home/jklem/d4s_process_engine/wsclient/JDLAdaptorOCR/ocrjob.sh
ocropus-0.3.1-i386-JK.tar.gz#url#ftp://meteora.di.uoa.gr/d5s/ocropus-0.3.1-i386-JK.tar.gz


Output is a pdf file with textual content and a file in hOCR format.


GridAdaptor Examples

GridAdaptor input files needed:

  • JDL file
  • resource file
  • OCR code (available as a ".tar.gz" file)
  • OCR job script


JDL file example:

[
    Type = "Job";
    JobType = "Normal";
    Executable = "ocrjob.sh";
    Arguments = "None";
    StdOutput = "job.out";
    StdError = "job.err";
    VirtualOrganisation = "d4science.research-infrastructures.eu";
 
    InputSandbox = {"NobelAnnounce.pdf", "ocrjob.sh", "ocropus-0.3.1-i386-JK.tar.gz", "job_d4science_grid.job", "pdfopt"};
    OutputSandbox = {"NobelAnnounce.pdf", "NobelAnnounce.pdf.hocr", "job.out", "job.err"};
 
]


Resource file example:

scope # /d4science.research-infrastructures.eu/INSPIRE
chokeProgressEvents # false
chokePerformanceEvents # false
storePlans # true
timeout # -1
pollPeriod # 60000
jdl # ocr.jdl # /home/jklem/d4s_process_engine/wsclient/GridAdaptorOCR/ocr.jdl
userProxy # userProxy # /home/jklem/d4s_process_engine/wsclient/userProxy
inData#pdfopt#/usr/bin/pdfopt#local
inData#NobelAnnounce.pdf#/home/jklem/gridsubmit/gridsubmit/NobelAnnounce.pdf#local
inData#ocrjob.sh#/home/jklem/d4s_process_engine/wsclient/GridAdaptorOCR/ocrjob.sh#local
inData#ocropus-0.3.1-i386-JK.tar.gz#ftp://meteora.di.uoa.gr/d5s/ocropus-0.3.1-i386-JK.tar.gz#url
inData#job_d4science_grid.job#/path/job_d4science_grid.job#local
outData#NobelAnnounce.pdf
outData#NobelAnnounce.pdf.hocr
outData#job.out
outData#job.err


Output is a pdf file with textual content and a file in hOCR format.


OCR Job Script Examples

OCR job script for one pdf file (ocrjob.sh):

#!/bin/sh
# -*- coding: utf-8 -*-
 
echo "Start."
echo "Start." 1>&2
 
OCROPUS_PACKAGE=./ocropus-0.3.1-i386.tar.gz.gz
OCROPUS_PATH=./ocropus-0.3.1-i386
PYTHON_MAIN=$OCROPUS_PATH/run.py
 
export OCROSCRIPTS=$OCROPUS_PATH/share/ocropus/scripts
export OCRODATA=$OCROPUS_PATH/share/ocropus
export TESSDATA_PREFIX=$OCROPUS_PATH/share/
export LD_LIBRARY_PATH=$OCROPUS_PATH/lib
export PYTHONPATH=$OCROPUS_PATH/python
if [ -z "$PYTHONPATH" ] ; then
  export PYTHONPATH=$OCROPUS_PATH/python:$PYTHONPATH
else
  export PYTHONPATH=$OCROPUS_PATH/python
fi
 
tar -xzf $OCROPUS_PACKAGE
python $PYTHON_MAIN
 
rm -rf $OCROPUS_PATH conversion* tmp*


OCR job script for many pdf files is below. In this case the pdf files are zipped in many_pdfs_in.zip and it replaces the pdf input file in the other examples. The files can be zipped by "zip many_pdfs_in.zip *.pdf" command. The file many_pdfs_out.zip replaces the pdf and hocr output files in the other examples.

#!/bin/sh
# -*- coding: utf-8 -*-
 
echo "Start."
echo "Start." 1>&2
 
OCROPUS_PACKAGE=./ocropus-0.3.1-i386.tar.gz.gz
OCROPUS_PATH=./ocropus-0.3.1-i386
PYTHON_MAIN=$OCROPUS_PATH/run.py
 
export OCROSCRIPTS=$OCROPUS_PATH/share/ocropus/scripts
export OCRODATA=$OCROPUS_PATH/share/ocropus
export TESSDATA_PREFIX=$OCROPUS_PATH/share/
export LD_LIBRARY_PATH=$OCROPUS_PATH/lib
export PYTHONPATH=$OCROPUS_PATH/python
if [ -z "$PYTHONPATH" ] ; then
  export PYTHONPATH=$OCROPUS_PATH/python:$PYTHONPATH
else
  export PYTHONPATH=$OCROPUS_PATH/python
fi
 
unzip many_pdfs_in.zip
 
tar -xzf $OCROPUS_PACKAGE
python $PYTHON_MAIN
 
rm -rf $OCROPUS_PATH conversion* tmp*
 
zip many_pdfs_out.zip *.pdf *.hocr
rm *.pdf *.hocr


Local OCR execution

OCR process can be tested locally by executing the OCR job file script. Local pdf files are processed and textual content added.


OCR languages

The language for OCR processing can be selected with "@ lang = " part in the ".job" file. Software can also automatically try different OCR languages. The tested languages include English (eng) and French (fr).

An example ".job" file:

@ lang = fr
@ jobgroup = test
@ script = /path/ocring/ocrjob.sh
@ package = ftp://meteora.di.uoa.gr/d5s/ocropus-0.3.1-i386-JK.tar.gz
@ grid = d4science_grid
@ scope = /d4science.research-infrastructures.eu/INSPIRE
@ vo = d4science.research-infrastructures.eu
@ proxy = /path/gridsubmit/userProxy


JDL requirements, choosing location for job execution

OCR process needs some executables (such as pdfopt) to be present at the worker node. Suitable worker node can be found by adding requirements in the jdl file.

The following example (to be added in jdl file) is recommended in August 2011:

Requirements =  other.GlueCEUniqueID == "cream-ce.research-infrastructures.eu:8443/cream-pbs-d4science";