Archive Import Service

From Gcube Wiki
Revision as of 14:20, 1 April 2009 by Diego.milano (Talk | contribs)

Jump to: navigation, search

Aim and Scope of Component

The AISL Service is dedicated to "batch" import of resources which are external to the gCube infrastructure into the infrastructure itself. The term "Importing" refers to the description of such resources and their logical relationships (e.g. the association between an image and a file containing metadata referring to it) inside the Information Organization stack of services. While the AIS is not strictly necessary for the creation and management of collections of resources in gCube, it makes possible in practice the creation of large collections, and their automated maintenance.


Logical Architecture

The task of importing a set of external resources is performed by building a description of the resources to import is built, in the form of a graph labelled on nodes and arcs with key-value pairs, called a graph of resources (GoR), and then processing this description accomplishing all steps needed to import the resources represented therein. The GoR is based on a custom data model, similar to the Information Object Model, and is built following a procedural description of how to build it expressed in a scripting language, called the Archive Import Service Language (AISL). Full details about this language and how to write an import script are given in the Administrator's guide. The actual import task is performed by a chain of pluggable software modules, called "importers", that in turn process the graph and can manipulate it, annotating it with additional knowledge produced during the import. Each importer is dedicated to import resources interacting with a specific subpart of the Information Organization set of Services. For instance, the metadata importer is responsible for the import of metadata, and it handles their import by interacting with the Metadata Management Service. The precise way in which importers handle the import task, and in particular how they define a specific description of the resources they need to consume inside a GoR is left to the single importers. This, and the pluggable nature of importers, makes possible to enable the import of new kind of content resources which might be defined in the infrastructure in the future.

The logical architecture of the AIS can be then depicted as follows:

AISL Parser (AISL Execution) Graph of Resources Builder Import Engine importer 1 importer 2 importer 3 ...


Import state and incremental import. During the import of some resources, the corresponding GoR is kept updated with information regarding the actual resources created, such as their OID. The Graph of Resources is stored persistently by the service, so that a subsequent execution of the same import script is aware of the status of the import and can perform only the differential operations needed to maintain the status of the resources up-to-date. While this solution involves a partial duplication of information inside the infrastructure, it has been chosen because it introduces a complete decoupling between the AIS and other gCube services, which are thus not forced to offer


Implementation

Note: this section refers to the Architecture of the AIS as a stateful gCube service. The status of the compoenent is however currently that of a standalone tool. Please see the section current limitations and known issues for further details.

The AIS is a stateful gCube service, following the factory approach. A stateless factory porttype, the ArchiveImportService porttype, allows to create stateful instances of the ArchiveImportTask porttype. Each import task is responsible for the execution of a single import script. It performs the import, maintains internally the status of the import (under the form of an annotated graph of resources), and provides notification about the status of the task. The resources it is responsible for are kept up to date by reexecuting the import script at suitable time intervals.


Beside acting as a factory for import tasks, the ArchiveImportService porttype also offers additional functionality related to the management of import scripts. Import scripts are generic resources inside the infrastructure. The porttype allows to publish new scripts, list and edit existing scripts, and validate them from a syntactic point of view.


Components and Libraries

Extensibility Features

The functionality of the Archive Import Service can be extended in three main ways. It is possible to define new functions for the AISL, and to plug in software modules to interact with external resources through additional network protocols and to interact with new gCube content-related components.

Defining Functions for the AISL

AISL is a scripting language intended to create graphs of resources for subsequent import. Its type system and grammar, together with usage examples, are fully described in the Administrator's guide. Its main features are a tight integration with the AIS, in the sense that the creation of model objects are first citizens in the language, and the ability to treat in a way as much as possibile transparent to the user some tasks which are frequent during import, like accessing files at remote locations. Beside avoiding the complexity of full fledged programming languages like Java, its limited expressivity - tailored to import tasks only - prevents security issues related to the execution of code in remote systems. The language does not allow the definition of new functions/types of objects inside a program, but can be easily extended with new functionality by defining new functions as plugin modules. Adding a new function amounts to two steps:

  1. Creating a new java class implementing the AISLFunction class. This is more easily done by subclassing the AbstractAISLFunction class. See below for further details.
  2. Registering the function in the "Functions" class. This step will be removed in later released, which will implement automatic plugin-like registration

The AISLFunction interface provides a way to specify a number of signatures (number and type of arguments and return type) for an AISL function. It is design to allow for overloaded functions. The number and types of the parameters are used to perform a number of static checks on the invocation of the function. The method evaluate provides the main code to evaluated during an invocation of the function in an AISL script. In the case of overloaded functions, theis method should redirect to appropriate methods based on the number and types of the arguments.

public interface AISLFunction {
	public String getName();
	
	public void setFunctionDefinitions(FunctionDefinition ... defs);
	public FunctionDefinition[] getFunctionDefinitions();

	public  Object evaluate(Object[] args) throws Exception;
	
	public interface FunctionDefinition{
		Class<?>[] getArgTypes();
		Class<?> getReturnType();
	}
	
}

A partial implementation of the AISLFunction interface is provided by the AbstractAISLFunction class. A developer can simply extend this class and then provide an appropriate constructor and implement the appropriate evaluate method. An example is given below. The function match returns a boolean value according to the match of a string value with a given regular expression pattern. Its signature is thus:

boolean match(string str, string pattern)

the class Match.class below is an implementation of this function. In the constructor, the function declares its name and its parameters. The method evaluate(Object[] args), which must be implemented to comply with the interface AISLFunction, performs some casting of the parameters and then redirects the evaluation to another evaluate function (Note, in this case, as the function is not overloaded, there is no actual need for a separate evaluate method, here it has been added for clarity).

public class Match extends AbstractAISLFunction{
		
	public Match(){
		setName("match");
		setFunctionDefinitions(
			new FunctionDefinitionImpl(Boolean.class, String.class, String.class)
		);
	}
			
	public Object evaluate(Object[] args) throws Exception{
		return evaluate((String)args[0], (String)args[1]);

	}
	
	private Boolean evaluate(String str, String pattern){
		return str.matches(pattern);
	}

}

Writing RemoteFile Adapters

When writing AISL scripts, the details of interaction with remote resources available on the network is hidden from the user, and encapsulated into the facilities related to a native data type of the language, the file type. The intention is to shield (almost) completely the user from such details, and presenting resources available through heterogeneous protocols via a homogeneous access mechanism.

A network resource is made available as file by invoking the getFile() function of the language. The function gets as argument a locator, which is a string (and optionally some parameters needed for authentication), and resolves, based on the form of the locator, which protocol to use and how to access the resource. To avoid excessive resource consumption, remote resources are not downloaded straight away. Instead, a file object acts as a placeholder, and content is made available on demand. Other properties of the resource, like for instance its length, last modification date or hash signature, are instead gathered and cached so to limit network usage. Of course, the availability of this information is related to the capabilities offered by the network protocol at hand. Once downloaded, content is also cached.


Writing Importers

The data model handled by AISL features three main types of constructs:

  • Collection
  • Resource
  • Relationship

A set of objects of these three main types, built by AISL script, form a so-called Graph of Resources (GoR). A graph of resources is a graph composed by nodes (resources) and edges (relationships). Furthermore, The collection construct allows to group nodes (resources) into sets. All constructs of the model can be annotated with properties, which are name/value pairs.

The three main types of constructs cannot be instantiated directly. Instead, objects of specific subtypes of these constructs must be instantiated. These subtypes are defined by specific plugins called importers that manage the import of different kinds of resources inside the gCube infrastructure. The precise semantics of the properties attached to types and the precise use of the constructs is not fixed in advance, but is determined by the specific importer that defines and manage a specific subtype. In particular, there is no direct correspondence between constructs in the GoR and how the resources are represented inside the gCube Information Organization facilities.

Beside being annoted with properties, each construct of the model must be assigned an external identifier. An external identifier is a string that uniquely identifies a certain external resource, and the model object that refers to it. This identification must hold across multiple, different invocations of the AIS.

For instance, consider the two files:

  1. http://mydomain.org/myimage.jpg
  2. http://myotherdomain.org/mymetadata.jpg

representing respectively an image and a file of metadata describing it. Here, we have three external resources to import: the files 1) and 2) and the association between the two. This set of resources can be represented by instantiating two model objects of type "resource", having specific type respectively equal to "content" and "metadata", and and a model object of type "relationship" and subtype "metadata". Furthermore, the GoR should contain two model objects of type collection and subtype respectively "content" and "metadata" which will contain the objects referring to resources 1) and 2).

When all these resources are created, they must be assigned an external identifier, which specifies their identity in the real worls. So, for instance, it would be possible to choose the string "http://mydomain.org/myimage.jpg" to identify the resource 1). The first time an import task is run that specifies this gor, the AIS will create an internal resource referring to the external resource 1). This internal resource will receive an (internal) object identifier. If another import task is executed, and another resource with external identifier "http://mydomain.org/myimage.jpg" is created, the AIS will treat this as the same resource, and so it will not create another internal resource but modify, if needed, the one already created.

The AISL Language

Importers

The Archive Import Service perform the import of external resources by representing them in a Graph of Resources and passing this graph to a chain of software modules called "importers". Each importer is responsible for treating specific kind of resource (e.g. metadata), and essentially is the bridge between the archive import service and the services of the Information Organization stack responsible for managing certain kind of internal resources (collections, metadata, documents etc). The precise way in which the importer performs the import is thus dependent on the specific subsystem the importer will interact with. Similarly, different importers will need to obtain different information about the resources to import. For instance, to import a document it is necessary to have its content or an url at which the content can be accessed. To create a metadata collection, it is necessary to specify some properties like the location of the schema of the objects contained in the collection. These values are passed to an importer by annotating objects in a graph of resource with appropriate properties. In order to constrain the kind of properties that the model objects it manipulates must have, an importer must define a set of subtypes of the model object types. For instance, the metadata importer (described below) defines a subtype for each basic type of the Resource Model types:

  • collection::metadata,
  • resource::metadata and
  • relationship::metadata

Each of these subtypes has specific properties that are understood, used and manipulated by that importer. The way subtyping is accomplished is described in more detail later in the section "writing new importers". The types defined by an importer and its properties must be publicly available, as AISL script developers must known which are the properties available for them and what is their semantics. Furthermore, subsequent importers in the chain may also need to access some properties. For example, an importer for metadata needs to access also model objects representing content to get their object id (internal identifier). Notice that in general an AISL script will not necessarily assign all properties defined by a subtype. Some of these properties may be conditionally needed, while come will only be written by an importer at import time. For example, when a new content object is imported, the importer must record into the GoR object the OID of the newly created object. For this reason, the specification of the subtypes defined by importers must also provide information about what properties are mandatory (i.e. they must be assigned during the creation of a GoR) and which properties are private (i.e. they should NOT be assigned during the creation of the GoR).


Built-in importers

The AIS comes already with the capability to import documents and metadata. This is provided by two importers called the content importer and the metadata importer. The types defined by these importers are described below:


Content Collection Importer

This importer defines one subtype.

  • collection::content

Its properties, their type and semantics are as follows:

[collection::content]
collectionName	: string  : mandatory -  The name of the collection. 
isUser        	: boolean : mandatory -  Denotes if a collection is or not a user collection
collectionId  	: string  : private   -  The id assigned to the collection to the collection management service

Document Importer

This importer defines three subtypes. They are:

  • resource::content
  • relationship::partof
  • relationship::alternativerepresentationof


[resource::content]
isVirtualImport        : boolean : mandatory
documentName           : string  : mandatory
hasMaterializedContent : boolean : mandatory
contentSourceLocator   : string  : 
content                : file    : 
documentId             : string  : private  -  The id assigned to the collection by the storage management service
isLargeFile            : boolean :

Note: the fields contentSourcelocator and content are alternative. They depend on the value of the field hasMaterializedContent

[relationship::partof]
position               : int      : 
[relationship::alternativerepresentationof]

Metadata Importer

This importer defines three subtypes, one for each basic construct in the Resource Model. They are:

  • collection::metadata
  • resource::metadata
  • relationship::metadata
[collection::metadata] 
relatedContentCollection: collection : mandatory - The content collection containing the objects to which this metadata collection refers
collectionName 		: string     : mandatory - The name of the collection
collectionDescription 	: string     : mandatory - A description of the collection
isUser 			: boolean    : mandatory - Indicates wheter this is a user collection
isIndexable 		: boolean    : mandatory - Indicates wheter this collection is indexable
metadataName 		: string     : mandatory - Name of the metadata schema in this collection
metadataLanguage 	: string     : mandatory - Language of the metadata in this collection
metadataSchemaURI 	: string     : mandatory - URI of the schema of the metadata in this collection
collectionId		: string     : private -  The id assigned to the metadata collection during the import
[resource::metadata]
content                 : string     : mandatory  - the content of this metadata object
objectID                : string     : private    - the id assigned to the metadata object during the import
[relationship::metadata](resource::metadata, resource::content)

This subtype does not define any property. It denotes an edge from a metadata resource object to a content resource object



Current Limitations and Known Issues

Note: The content of this section is temporary, and will be completely remove or changed in subsequent releases of the service.

The AISL is currently released as a standalone client. The software is available to the developers from the svn. The class org.gcube.contentmanagement.contentlayer.archiveimportservice.impl.AISLClient contains a client that performs the steps needed for the import: parsing and execution of the script, generation of the graph of resources, import of of the graph of resources. It accepts one or two arguments. The first one is the location (on the local file system) of a file containing an aisl script. The second argument is a boolean value. If it is set to true, the client will perform the creation of the graph of resources but will not start the importing. This is to ease debugging.

After the graph of resources is created, the client generates a dump of the graph in a file named resourcegraph.dump. The graph is serialized in an XML-like format. This is only for visualization and debugging purposes, and this format is not currently guaranteed to be valid (or even well formed) xml.