InspireRefExtract

From Gcube Wiki
Jump to: navigation, search

INSPIRE Reference Extraction

Reference extraction documentation.


The Reference Extraction tool (Refextract) is capable of automatic extraction of references from PDF or text documents. Refextract can be paired with a domain-specific knowledge base to identify domain-related information inside references (e.g. Higher Energy Physics journal titles or report numbers). The output conforms to the MARC standard, returning numerated fields per reference, and reference content as lettered sub-fields inside an XML file.

Refextract is capable of identifying DOI numbers and URLs inside references. It is also capable of identifying author names: primarily used to aid multi-citation identification, and thus output separate cites of single references. Author identification does not rely on a text file knowledge base, but one can be added to include domain-specific `authors' in the extraction process. The mechanisms involved with author identification rely on a regular expression which attempts to embrace a wide variety of author formats inside citations. The process is context sensitive, in the sense that previously identified domain specific information will improve author recognition, as well as the information obtained from other authors, on the premise that this domain-specific information has a one-to-one relationship with citations.

Refextract adheres to the following API:

       --fulltext              : pdf or text input file from which to extract from
       --xmlfile               : the absolute output xml file
       --dictfile              : also output a dictionary of statistical information about the extracted journals
       --raw-references        : treat the input as pure references (skip the reference section identification step)
       --output-raw-refs       : output a file holding extracted references without the xml mark-up
       --kb-journal            : specify inline the location of the `journal titles' knowledge base
                 (A text file of line-by-line information, which should be  recognised inside references.)
       --kb-report-number      : specify inline the location of the `report number' knowledge base 
                 (A text file of line-by-line information, which should be recognised inside references,
                 this can be treated as a separate `group' of information from which to obtain from references.
                 It differs from the above kb with the inclusion of a numeric pattern which can follow a base-string.)

Each line in the journal knowledge base has the following format:

       `upper-case, punctuation-stripped search term'---`standardised replace term (outputted into xml)'

The LHS of the above line will be compared against content in an identified reference line. The RHS has the standardised version of the LHS term. This term will be outputted into the XML file. Numeration after this replace term is handled by Refextract, and also matched if present.

In addition to the above, each line in the report number knowledge base also has a numeration pattern which can follow the matched terms. Each numeration pattern applies to a `group' of search terms (denoted ##### name of group ##### in the kb). It follows this format, much in a similar way to regular expressions:

       \     -> \\
       9     -> \d #any single digit
       a     -> [A-Za-z] #A-z
       v     -> [Vv]
       mm    -> (0[1-9]|1[0-2]) #month
       yy    -> \d{2}  #two digits
       yyyy  -> [12]\d{3} #full year
       /     -> \/
       s     -> \s*? #possible whitespace

A feature which is currently still in development is the added ability for Refextract to automatically identify authors of papers, in addition to those inside references. This builds upon the current work to identify authors of references. In addition, work is underway to extract affiliations, enriching the output information obtained. An added goal of finding affiliations is using their properties (position & numeration) to improve the accuracy of the extraction of authors; Affiliations and authors share a close relationship inside papers, and levering this relationship is a key initiative of this feature.

in addition to those above, for the author-of-paper extraction version of Refextract:

       --authors       : extract authors from the input documents, paired with any associated affiliations
       --affiliations  : extract affiliations from the input documents