X-Link
Contents
- 1 Overview
- 2 Key features
- 3 Functionality
- 3.1 Load the GATE component and print the currently supported categories of entities
- 3.2 Create an instance of X-Link, set the accepted (active) categories of entities, and perform entity mining in a Web page
- 3.3 Link the identified entities with semantic resources (i.e. URIs)
- 3.4 Stop the GATE component and print the identified entities, together with the matching URIs
- 3.5 Export the results
- 4 Configuring (programmatically) X-Link
Overview
X-Link is a fully configurable (Linke Data-based) Named Entity Extraction (NEE) tool which allows the user/developer to easily define the categories of entities that are interesting for the application at hand by exploiting one or more (online) Semantic Knowledge Bases (Linked Data). The user is also able to update a category and specify how to semantically link and enrich the identified entities. This enhanced configurability allows X-Link to be lightly configured for different contexts, for building domain-specific applications (e.g. for identifying drugs in a medical search system, for annotating and exploring fish species in a marine-related web page, etc.).
X-Link is based on Gate ANNIE and supports both gazetteers (lists of names) and natural language processing functions. Gate ANNIE is a ready-made information extraction system which contains several components (e.g. Tokeniser, Gazetteer, Sentence Splitter, Orthographic Coreference, etc.). We have extended Gate ANNIE in order to be able to create a new supported category and update an existing one (using gazetteers) by exploiting the Linked Data.
Currently, X-Link exports the results in XML and CSV.
Key features
Currently X-Link supports the analysis of plain text files, HTML pages, Microsoft Word and Powerpoint files (.doc, .docx, .ppt and .pptx), PDF files, and XML-based files (e.g. XML and RDF files). At first it reads the contents of the corresponding document and performs a "cleaning" task, i.e. it removes useless text (e.g. HTML tags in a Web page or Meta elements in a Microsoft Word file). Then, it applies NEE in the cleaned contents of the document.
X-Link starts by reading an initial configuration which is stored in a properties file. It also implements functions that allow the user/developer to configure the system, e.g. through an administrator API. Specifically, the following functions are currently supported:
- Add a new category, using one or more lists of entities, one or more instances resource classes or one or more instances SPARQL queries.
- Update an existing category, using one or more lists of entities, one or more instances resource classes or one or more instances SPARQL queries. The user can either totally replace a category (i.e. remove the old entities and add the new ones) or just add the new entities.
- Remove an existing category.
- Change the dispayed name of an existing category (i.e rename).
- Set/change the underlying Knowledge Bases
- Set/change how to query the underlying Knowledge Bases for linking the identified entities.
- Set/change how to query the underlying Knowledge Bases for enriching the identified entities.
- Set/change the active categories, i.e. the categories for which X-Link identified entities.
Functionality
Load the GATE component and print the currently supported categories of entities
EntityMiningComponent emc = new GateEntityMiningComponent("C:/XLinkGateComponent"); emc.startup(); emc.printAvailableCategories();
Create an instance of X-Link, set the accepted (active) categories of entities, and perform entity mining in a Web page
XLink xlink = new XLink(); xlink.setEntityMiningComponent(emc); HashSet<String> acceptedCategoryNames = new HashSet<String>(); acceptedCategoryNames.add("species"); TextExtractor extractor = new WebPageTextExtractor("http://en.wikipedia.org/wiki/Yellowfin_tuna"); xlink.retrieveEntities(extractor, acceptedCategoryNames);
Link the identified entities with semantic resources (i.e. URIs)
xlink.matchEntities();
Stop the GATE component and print the identified entities, together with the matching URIs
emc.shutdown(); ArrayList<Entity> entities = xlink.getEntities(); // Gets the detected entities (together with all their information). System.out.println("# Detected entities: "); for (Entity entity : entities) { // Print the mane characteristics of the detected entities. System.out.println("Entity name: " + entity.getName()); System.out.println("Category: " + entity.getCategoryName()); System.out.println("Matching URIs: " + entity.getMatchingURIs()); System.out.println("-----"); }
Export the results
ResultExporter exp1 = new XMLExporter("C:/x-link/results/results.xml", entities); exp1.exportResults(); ResultExporter exp2 = new TXTExporter("C:/x-link/results/results.txt", entities); exp2.exportResults(); ResultExporter exp3 = new CSVExporter("C:/x-link/results/results.csv", entities); exp3.exportResults();
Configuring (programmatically) X-Link
Adding a new category of entities
Giving a list of entity names
Category categoryToAdd = new Category("North Aegean Greek Islands"); TreeSet<String> names = new TreeSet<String>(); names.add("Lesvos"); names.add("Chios"); names.add("Samos"); names.add("Limnos"); names.add("Ikaria"); names.add("Samothraki"); names.add("Agios Eustratios"); names.add("Psara"); names.add("Fournoi"); names.add("Oinouses"); names.add("Thymaina"); names.add("Antipsara"); names.add("Pasas"); names.add("Agios Minas"); names.add("Samiopoula"); categoryToAdd.setNamedEntities(names); emc.addNewCategory(categoryToAdd);
Giving a resource class
Category categoryToAdd = new Category("Fish Species"); categoryToAdd.setResourceClass("http://dbpedia.org/ontology/Fish"); categoryToAdd.setEndpoint("http://dbpedia.org/sparql"); categoryToAdd.retrieveNamedEntitiesByResourceClass(); emc.addNewCategory(categoryToAdd);
Giving a SPARQL query
Category categoryToAdd = new Category("Fish Species"); categoryToAdd.setEndpoint("http://dbpedia.org/sparql"); categoryToAdd.setSparqlQueryFilepathOfEntities("C:/sparql/getFishSpeciesQuery.sparql"); categoryToAdd.retrieveNamedEntitiesByQuery(); emc.addNewCategory(categoryToAdd);
or
Category categoryToAdd = new Category("Fish Species"); categoryToAdd.setEndpoint("http://dbpedia.org/sparql"); categoryToAdd.setSparqlQuery("select distinct str(?name) where { ?uri a <http://dbpedia.org/ontology/Fish> . ?uri rdfs:label ?name }"); categoryToAdd.retrieveNamedEntitiesByQuery(); emc.addNewCategory(categoryToAdd);
Updating a category of entities
Giving a list of entity names
Set<String> newEntities = new TreeSet<String>(); newEntities.add("Patmos"); newEntities.add("Crete"); newEntities.add("Karpathos"); emc.updateCategoryBySet("Greek Islands", newEntities);
Giving a resource class
String endpoint = "http://lod.openlinksw.com/sparql"; String resourceClass = "http://umbel.org/umbel/rc/Fish"; emc.updateCategoryByResourceClass("Fish Species", endpoint, resourceClass);
or (in case we have already specified the SPARQL endpoint for this category):
String resourceClass = "http://umbel.org/umbel/rc/Fish"; emc.updateCategoryByResourceClass("Fish Species", resourceClass);
or (in case we have already specified the SPARQL endpoint and a resource class for this category):
emc.updateCategoryByResourceClass("Fish Species");
Giving a SPARQL query
String endpoint = "http://dbpedia.org/sparql"; String sparqlQuery = "select distinct str(?name) where { ?uri a <http://dbpedia.org/ontology/Fish> . ?uri rdfs:label ?name FILTER(lang(?name)='es') }" emc.updateCategoryByQuery("water areas" , endpoint, sparqlQuery);
or (in case we have already specified the SPARQL endpoint for this category):
String sparqlQuery = "select distinct str(?name) where { ?uri a <http://dbpedia.org/ontology/Fish> . ?uri rdfs:label ?name FILTER(lang(?name)='es') }" emc.updateCategoryByQuery("water areas" , sparqlQuery);
or (in case we have already specified the SPARQL endpoint and a SPARQL query for this category):
emc.updateCategoryByQuery("water areas" );