Difference between revisions of "InspireRefExtract"
Jukka.klem (Talk | contribs) (Created page with '=== INSPIRE Reference Extraction === Reference extraction documentation.') |
(→INSPIRE Reference Extraction) |
||
Line 2: | Line 2: | ||
Reference extraction documentation. | Reference extraction documentation. | ||
+ | |||
+ | |||
+ | The Reference Extraction tool (Refextract) is capable of automatic extraction of references from PDF or text documents. Refextract can be paired with a domain-specific knowledge base to identify domain-related information inside references (e.g. Higher Energy Physics journal titles or report numbers). The output conforms to the MARC standard, returning numerated fields per reference, and reference content as lettered sub-fields inside an XML file. | ||
+ | |||
+ | Refextract is capable of identifying DOI numbers and URLs inside references. It is also capable of identifying author names: primarily used to aid multi-citation identification, and thus output separate cites of single references. Author identification does not rely on a text file knowledge base, but one can be added to include domain-specific `authors' in the extraction process. The mechanisms involved with author identification rely on a regular expression which attempts to embrace a wide variety of author formats inside citations. The process is context sensitive, in the sense that previously identified domain specific information will improve author recognition, as well as the information obtained from other authors, on the premise that this domain-specific information has a one-to-one relationship with citations. | ||
+ | |||
+ | Refextract adheres to the following API: | ||
+ | --fulltext : pdf or text input file from which to extract from | ||
+ | --xmlfile : the absolute output xml file | ||
+ | --dictfile : also output a dictionary of statistical information about the extracted journals | ||
+ | --raw-references : treat the input as pure references (skip the reference section identification step) | ||
+ | --output-raw-refs : output a file holding extracted references without the xml mark-up | ||
+ | --kb-journal : specify inline the location of the `journal titles' knowledge base | ||
+ | (A text file of line-by-line information, which should be recognised inside references.) | ||
+ | --kb-report-number : specify inline the location of the `report number' knowledge base | ||
+ | (A text file of line-by-line information, which should be recognised inside references, | ||
+ | this can be treated as a separate `group' of information from which to obtain from references. | ||
+ | It differs from the above kb with the inclusion of a numeric pattern which can follow a base-string.) | ||
+ | |||
+ | A feature which is currently still in development is the added ability for Refextract to automatically identify authors of papers, in addition to those inside references. This builds upon the current work to identify authors of references. In addition, work is underway to extract affiliations, enriching the output information obtained. An added goal of finding affiliations is using their properties (position & numeration) to improve the accuracy of the extraction of authors; Affiliations and authors share a close relationship inside papers, and levering this relationship is a key initiative of this feature. | ||
+ | |||
+ | For the author-of-paper extraction version of Refextract: | ||
+ | --authors : extract authors from the input documents, paired with any associated affiliations | ||
+ | --affiliations : extract affiliations from the input documents |
Revision as of 14:21, 16 June 2011
INSPIRE Reference Extraction
Reference extraction documentation.
The Reference Extraction tool (Refextract) is capable of automatic extraction of references from PDF or text documents. Refextract can be paired with a domain-specific knowledge base to identify domain-related information inside references (e.g. Higher Energy Physics journal titles or report numbers). The output conforms to the MARC standard, returning numerated fields per reference, and reference content as lettered sub-fields inside an XML file.
Refextract is capable of identifying DOI numbers and URLs inside references. It is also capable of identifying author names: primarily used to aid multi-citation identification, and thus output separate cites of single references. Author identification does not rely on a text file knowledge base, but one can be added to include domain-specific `authors' in the extraction process. The mechanisms involved with author identification rely on a regular expression which attempts to embrace a wide variety of author formats inside citations. The process is context sensitive, in the sense that previously identified domain specific information will improve author recognition, as well as the information obtained from other authors, on the premise that this domain-specific information has a one-to-one relationship with citations.
Refextract adheres to the following API:
--fulltext : pdf or text input file from which to extract from --xmlfile : the absolute output xml file --dictfile : also output a dictionary of statistical information about the extracted journals --raw-references : treat the input as pure references (skip the reference section identification step) --output-raw-refs : output a file holding extracted references without the xml mark-up --kb-journal : specify inline the location of the `journal titles' knowledge base (A text file of line-by-line information, which should be recognised inside references.) --kb-report-number : specify inline the location of the `report number' knowledge base (A text file of line-by-line information, which should be recognised inside references, this can be treated as a separate `group' of information from which to obtain from references. It differs from the above kb with the inclusion of a numeric pattern which can follow a base-string.)
A feature which is currently still in development is the added ability for Refextract to automatically identify authors of papers, in addition to those inside references. This builds upon the current work to identify authors of references. In addition, work is underway to extract affiliations, enriching the output information obtained. An added goal of finding affiliations is using their properties (position & numeration) to improve the accuracy of the extraction of authors; Affiliations and authors share a close relationship inside papers, and levering this relationship is a key initiative of this feature.
For the author-of-paper extraction version of Refextract:
--authors : extract authors from the input documents, paired with any associated affiliations --affiliations : extract affiliations from the input documents