Content Source Description

From Gcube Wiki
Jump to: navigation, search

Alert icon2.gif THIS SECTION OF GCUBE DOCUMENTATION IS OBSOLETE. THE NEW VERSION IS UNDER CONSTRUCTION.

Introduction

The Content Source Description (CSD) is a digital libarary service that supports the execution of content-based queries against a number of content sources (such as collections) that are associated with DILIGENT indices.

Implementation Overview

Among the many possible ways of implementing a content source description service, the provided reference CSD service is based on the representation of text sources as term histograms. A histogram basically contains the most representative words and phrases of a content source (i.e. a content collection) together with statistics information. To obtain these statistics, the reference CSD service interacts with index services in order to derive statistical information from full-text DILIGENT indices of internal sources and to subscribe for notifications should these indices change (notifications will be available in the beta-release of the project).

The CSD service operatates on a number of underlying component packages that provide a corse-grained division of functionality:

  • Core: This package groups components responsible for generating and exposing content source descriptions.
  • Handlers: This package contains a range of handlers that are used to specify the use of atomic and possibly stateful processes within the Content Source Description service - primarily during initialisation and update. This includes the atomic tasks of description generation and the publication of descriptions after its generation.
  • Notification: This package groups components that are responsible for monitoring external changes which are relevant to content source descriptions and for reflecting those changes onto the related descriptions in accordance with their update policies.

Dependencies

  • Java JDK 1.5
  • WS-Core
  • DiligentProvider
  • KXML (version 2.3.0)
  • Contentmanagement
  • DIRCommons library
  • Indexservice Generatorservice
  • Indexservice Lookupservice
  • DISHL client
  • DISIP


Usage Example

Creating Descriptor

// necessary imports
import org.diligentproject.CSDservice.impl.core.stubs.DescriptorPortType;
import org.diligentproject.CSDservice.impl.core.stubs.EPR;
import org.diligentproject.CSDservice.impl.core.stubs.DParams;
import org.diligentproject.CSDservice.impl.core.stubs.service.
  DescriptorServiceAddressingLocator;
import org.diligentproject.CSDservice.impl.core.stubs.service.
  DescriptionFactoryServiceAddressingLocator;
import org.apache.axis.message.addressing.Address;
import org.apache.axis.message.addressing.EndpointReferenceType;

// the host where the CSD service runs 
String myhost = "bob";

// the id of the collection from which a description is going to be built
String sourceURI = "ARTE_ArtiDellaMemoria";

String factoryURI = "http://" + myhost + ":8080/wsrf/services/diligentproject/
  CSDservice/DescriptionFactory";
EndpointReferenceType endpoint = new EndpointReferenceType();
endpoint.setAddress(new Address(factoryURI));
DescriptionFactoryServiceAddressingLocator factoryLocator = new 
  DescriptionFactoryServiceAddressingLocator();
descriptorLocator = new DescriptorServiceAddressingLocator();
factory = factoryLocator.getDescriptionFactoryPortTypePort(endpoint);

DParams params = new DParams();
params.setSourceURI(sourceURI);
EPR eprWrapper = factory.createResource(params);
EndpointReferenceType epr = eprWrapper.getEndpointReference();
DescriptorPortType descriptor = descriptorLocator.getDescriptorPortTypePort(epr);

Reading Terms and Term Statistics from Description (Histogram)

// necessary imports
import java.text.DateFormat;
import java.util.Calendar;
import org.diligentproject.CSDservice.impl.core.stubs.DescriptorPortType;
import org.diligentproject.CSDservice.impl.core.stubs.StringArray;
import org.diligentproject.CSDservice.impl.core.stubs.VoidType;
import org.diligentproject.CSDservice.impl.histograms.stubs.HistogramTerm;
import org.diligentproject.CSDservice.impl.histograms.stubs.HistogramTermArray;

DescriptorPortType descriptor = descriptorLocator.getDescriptorPortTypePort(epr);
String descriptionType = descriptor.getDescriptionType(new VoidType());
String persistentID = descriptor.getPersistentID(new VoidType());
String descriptionLocalURI = descriptor.getLocalURI(new VoidType());
String sourceURI = descriptor.getSourceURI(new VoidType());
Calendar lastModificationDate = descriptor.getLastModificationDate(new VoidType());
int numberOfDocuments = descriptor.getNumberOfDocuments(new VoidType());
		
StringArray request = new StringArray();
String[] names={"edizione", "document", "doesnotexist"};
request.setItems(names);
HistogramTermArray response = descriptor.getTerms(request);
HistogramTerm[] terms = response.getItems();
		
System.out.println("Description type:" + descriptionType);
System.out.println("Source URI:" + sourceURI);
System.out.println("NumberOfDocuments:" + numberOfDocuments);
System.out.println("Persistent ID:" + persistentID);
System.out.println("Local URI:" + descriptionLocalURI);
System.out.println("Last modification date:" + DateFormat.getDateInstance()
  .format(lastModificationDate.getTime()));

// print terms and statistical information		
if (terms != null) {
  System.out.println("Terms:");
  for (HistogramTerm term : terms){
    System.out.println("   term="+term.getName()+
    "(document frequency="+term.getDocFrequency()+
      ",sourcefrequency="+term.getSourceFrequency()+")");
  }
}

Accessing Resource Properties

 // necessary imports
 import org.apache.axis.message.addressing.EndpointReferenceType;
 import java.util.ArrayList;
 import java.util.List;
 import javax.xml.namespace.QName;
 import org.diligentproject.CSDservice.impl.core.stubs.DescriptorPortType;
 import org.oasis.wsrf.properties.GetMultipleResourceProperties_Element;
 import org.oasis.wsrf.properties.GetMultipleResourcePropertiesResponse;
 
 // use the "Creating Descriptor" code to obtain EPR
 EndpointReferenceType epr = ...
 
 // list of valid resource properties
 String[] properties= {"DescriptionType","PersistentID", "LocalURI", "SourceURI",
   "NumberOfDocuments","LastModificationDate"};
 
 String NS = "http://diligentproject.org/namespaces/CSDservice";
 List<QName> RP_TYPES = new ArrayList<QName>();
 for (String property:properties) {
   RP_TYPES.add(new QName(NS,property));
 }
   
 DescriptorPortType d = descriptorLocator.getDescriptorPortTypePort(epr);	
 GetMultipleResourceProperties_Element element = 
   new GetMultipleResourceProperties_Element();
 element.setResourceProperty(RP_TYPES.toArray(new QName[0]));
 GetMultipleResourcePropertiesResponse response = d.getMultipleResourceProperties(element);
 
 // output
 for (int i=0; i<response.get_any().length;i++) {
   System.out.println(response.get_any()[i].getLocalName() + ":" + 
     response.get_any()[i].getValue());
 }

Service Configuration

The Content Source Description (CSD) service allows for certain amounts of configuration through Java Naming and Directory Interface (JNDI). After deployment of the service, the JNDI configuration file can be found at $GLOBUS_LOCATION/etc/org_diligentproject_RefCSDservice/jndi-config.xml. The following entities can be configured for CSD:

Startup Sources

It is possible to define a list of content sources that are build upon the startup of CSD. This is accomplished thought the 'startupSources' environment in the DescriptionFactory service (see example):

 <environment 
   name="startupSources" 
   value="ARTE_Medita
     ARTE_Catalogo" 
   type="java.lang.String"
   override="false" />

This means, whenever the CSD service is started, the service tries to create descriptions for the collections in this list (in the example, ARTE_Medita and ARTE_Catalogo). The 'override' parameter allows to control the update mechanism and means that an existing content description is only updated when set to true, but reused when set to false.


Handler Configuration

Although the service is already preconfigured with a set of handlers that ensure the most common and stable behaviour of the service, it is open to adaptation through changing the list of handlers in the inithandler environment (see example):

 <environment 
   name="initHandler" 
   value="org.diligentproject.CSDservice.impl.handlers.DefaultGenerator 
     org.diligentproject.CSDservice.impl.handlers.LocalPublisher
     org.diligentproject.CSDservice.impl.handlers.RemotePublisher" 
   type="java.lang.String"
   override="false" />

The following handlers are delivered with the service:

  • DefaultGenerator: Retrieves term statistics from the content source (based on the content source id) and builds a description.
  • LocalGenerator: Same as DefaultGenerator, but instead retrieves content source from the local file system. Expects the sources in $GLOBUS_LOCATION/etc/org_diligentproject_RefCSDservice/.
  • LocalPublisher: Publishes a XML serialisation of the description in the local file system
  • RemotePublisher: Publishes an XML serialisation of the description on the DILIGENT platform.