Difference between revisions of "Archive Import Service"

From Gcube Wiki
Jump to: navigation, search
m (Writing Importers)
 
(35 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Introduction ==
+
== Aim and Scope of Component ==
  
The role of the AISL Service is to import resources which are external to the gCube infrastructure into the infrastructure itself. "Importing", here and in the following, refers to the description
+
The Archive Import Service(AIS) is dedicated to '''"batch" import of resources''' which are external to the gCube infrastructure into the infrastructure itself. The term ''"importing"'' refers to the description of such resources and their logical relationships (e.g. the association between an image and a file containing metadata referring to it) inside the [[GCube_Information_Organisation_Services|Information Organization stack of services]]. While the AIS is not strictly necessary for the creation and management of collections of resources in gCube, it makes possible in practice the creation of large collections, and their automated maintenance.
of such resources inside the Information Organization stack of services, and not necessarily to the fact that the content of such resources is actually stored within facilities that belong to the
+
infrastructure. Similarly, kind of resources that can be imported are not necessarily objects that exist physically. The association between an image and a file containing metadata referring to it might not exist physically, but can still be considered a
+
resource that can be imported. The same holds for a collection of images. The word "external resource" is used to denote any resource outside the gCube infrastructure (independing of the fact that it has been imported already or not).
+
The world "internal resource" is used to denote the entity that represents an external resource inside the gCube infrastructure. This is normally an object or a relationship of the info-object model,
+
and is identified, inside the gCube infrastructure, by an Object Identifier. The service should support the import by providing a way to specify which resources should be imported and how, and offer facilities to automate this task whenever possible.
+
  
 +
== Logical Architecture ==
 +
The task of importing a set of external resources is performed by building a description of the resources to import, in the form of a graph labelled on nodes and arcs with key-value pairs, called a '''''graph of resources (GoR)''''', and then processing this description accomplishing all steps needed to import the resources represented therein.
 +
The GoR is based on a custom data model, similar to the Information Object Model, and is built following a procedural description of how to build it expressed in a scripting language, called the '''Archive Import Service Language (AISL)'''. Full details about this language and how to write an import script are given in the [[Content_Import|Administrator's guide]]. The actual import task is performed by a chain of pluggable software modules, called "importers", that in turn process the graph and can manipulate it, annotating it with additional knowledge produced during the import. Each importer is dedicated to import resources interacting with a specific subpart of the Information Organization set of Services. For instance, the ''metadata importer'' is responsible for the import of metadata, and it handles their import by interacting with the Metadata Management Service. The precise way in which importers handle the import task, and in particular how they define a specific description of the resources they need to consume inside a GoR is left to the single importers. This, and the pluggable nature of importers, makes possible to enable the import of new kind of content resources which might be defined in the infrastructure in the future.
  
 +
The logical architecture of the AIS can be then depicted as follows:
  
== Import procedure ==
+
[[Image:AISDiagram.jpeg]]
  
The task of importing a set of external resources is articulated in two major steps. First, a description of the resources to import is built. This description is based on a custom data model,
+
=== Import state and incremental import ===
which is described later on in this document. The resulting description is essentially a graph labelled on nodes and arcs with key-value pairs. This description is called a graph of resources (GoR).
+
During the import of some resources, the corresponding GoR is kept updated with information regarding the actual resources created, such as their OIDs. The Graph of Resources is stored persistently by the service, so that a subsequent execution of the same import script is aware of the status of the import and can perform only the differential operations needed to maintain the status of the resources up-to-date. While this solution involves a partial duplication of information inside the infrastructure, it has been chosen because it introduces a complete decoupling between the AIS and other gCube services, which are thus not forced to offer additional information needed for incremental import in their interfaces.
Second, the GoR built during the first phase is processed by a chain of software modules called "importers". Each importer is dedicated to import resources interacting with a specific subpart of
+
the Information Organization set of Services. For instance, the metadata importer is responsible for the import of metadata, and it handles their import by interacting with the Metadata Management Service.
+
The precise way in which importers handle the import task, and in particular how they define a specific description of the resources they need to consume inside a GoR is left to the single importers.
+
Details about the importers which are already included in the Archive Import Service are given below in this document. The creation of a graph of resources is done trhough the execution of a script written in the AISL language,  
+
which is described later on in this document.
+
  
 +
== Service Implementation ==
 +
'''Note:''' this section refers to the Architecture of the AIS as a stateful gCube service. The status of the compoenent is however currently that of a standalone tool. Please see the section current limitations and known issues for further details.
  
 +
The AIS is a '''''stateful gCube service''''', following the factory approach. A stateless factory porttype, the '''<tt>ArchiveImportService</tt> porttype''', allows to create stateful instances of the '''<tt>ArchiveImportTask</tt> porttype'''. Each import task is responsible for the execution of a single import script. It performs the import, maintains internally the status of the import (under the form of an annotated graph of resources), and provides notification about the status of the task. The resources it is responsible for are kept up to date by re-executing the import script at suitable time intervals. Beside acting as a factory for import tasks, the ArchiveImportService porttype also offers additional functionality related to the management of import scripts. Import scripts are generic resources inside the infrastructure. The porttype allows to publish new scripts, list and edit existing scripts, and validate them from a syntactic point of view. The semantics of the methods offered by the two porttypes are described in the following.
  
 +
=== Archive Import Service Porttype===
 +
* '''<tt>String[] list()</tt>''': returns a list of import scripts identifiers for scripts currently available in the VRE (as generic resources);
 +
* '''<tt>Script load(String scriptId)</tt>''': returns the import script corresponding to the given identifier;
 +
* '''<tt>void save(Script script)</tt>''': saves an import script, by creating or updating a corresponding generic resource;
 +
* '''<tt>void delete(String scriptId)</tt>''': deletes the import script corresponding to the given identifier;
 +
* '''<tt>ValidationReport validate(String scriptSpecification)</tt>''': this method performs validation of the given input string, which is treated as an AISL script and undergoes parsing and other syntactic validation steps. ;
 +
* '''<tt>EndpointReferenceType getTask(String importScriptIdentifier)</tt>''': this method gets an instance of the ArchiveImportTask service dedicated to a given script.
  
== Data Model ==
+
The Complex Types '''<tt>Script</tt>''' and '''<tt>ValidationReport</tt>''' used in the methods above represent respectively an AISL Script (characterized by a scriptId, a description and a content, i.e. the script itslef) and a syntactic validation report (containing a boolean validity flag, a message and other details like error row and column).
  
The data model handled by AISL features three main types of constructs:
+
=== Archive Import Task Porttype ===
 +
* '''<tt>void start(ImportOptions options)</tt>''': this method starts the import task with the given options. Options include a run mode (validate, build, simulation import, import);
 +
* '''<tt>void stop()</tt>''': this method  stops the import task;
 +
* '''<tt>TaskStatus getStatus()</tt>''': this method is included to get detailed information about the progress of an import task, including some information on errors happened during the import task execution. The Complex Type TaskStatus contains a number of informations about the task progress, e.g. execution state (new, running, stopped, failed), execution phase (parsing, building, importing) and eventually import phase (document, metadata...)  the number and types of graph objects currently created, imported and failed. For these last ones, it is also available detailed error information.
  
* Collection
+
=== Import Task Status Notification ===
* Resource
+
The ArchiveImportTask service maintains part of its state as a WS-resource. The properties of this resource are used to notify interested service about:
* Relationship
+
* the current phase of the import (e.g. graph building, importing etc.);
 +
* the current number of objects created in the GoR;
 +
* the current number of objects imported successfully;
 +
* the current number of objects whose import failed.
  
A set of objects of these three main types, built by AISL script, form a so-called Graph of Resources (GoR). A graph of resources is a graph composed by nodes (resources) and edges (relationships). Furthermore,
+
=== Implementation Details ===
The collection construct allows to group nodes (resources) into sets. All constructs of the model can be annotated with properties, which are name/value pairs.  
+
The following UML diagram describes concisely how the AIS is organized from an implementation point of view. The service itself depends on a library whose packages are dedicated each to a specific functionality.  
  
The three main types of constructs cannot be instantiated directly. Instead, objects of specific subtypes of these constructs must be instantiated. These subtypes are defined by specific plugins called importers that manage
+
[[Image:AISUML.jpg]]
the import of different kinds of resources inside the gCube infrastructure. The precise semantics of the properties attached to types and the precise use of the constructs is not fixed in advance, but is determined by the specific importer that defines and manage a specific
+
subtype. In particular, there is no direct correspondance between constructs in the GoR and how the resources are represented inside the gCube Information Organization facilities.
+
  
Beside being annoted with properties, each construct of the model must be assigned an external identifier. An external identifier is a string that uniquely identifies a certain external resource, and the model object that refers to it. This identification must hold across multiple, different invocations of the AIS.
+
In particular:
 +
* '''ais''' contains the classes related to the main functionality of the service, like defining the logical flow of a task execution
 +
* '''language''' contains the classes related to parsing and executing AISL scripts
 +
* '''datamodel''' contains the classes used in the description of graph of resources.
 +
* '''importing''' contains the classes used for the actual import, e.g. the so called importers.
 +
* '''remotefile''' contains all classes used to access remote resources.
 +
* '''util''' is a package collecting several different utility classes, like those used to manage the persistence of the GoR, caching, manipulating XML and HTML etc.
  
For instance, consider the two files:
+
Each of these packages is further divided into multiple subpackages dedicated to specific functionality. These are not shown here to avoid visual cluttering.
# <nowiki>http://mydomain.org/myimage.jpg</nowiki>
+
# <nowiki>http://myotherdomain.org/mymetadata.jpg</nowiki>
+
  
representing respectively an image and a file of metadata describing it. Here, we have three external resources to import: the files 1) and 2) and the association between the two.
+
== Extensibility Features ==
This set of resources can be represented by instantiating two model objects of type "resource", having specific type respectively equal to "content" and "metadata", and and a model object of
+
The functionality of the Archive Import Service can be extended in three main ways. It is possible to define new functions for the AISL, and to plug in software modules to interact with external resources through additional network protocols and to interact with new gCube content-related components.  
type "relationship" and subtype "metadata". Furthermore, the GoR should contain two model objects of type collection and subtype respectively "content" and "metadata" which will contain the objects referring to resources 1) and 2).
+
  
When all these resources are created, they must be assigned an external identifier, which specifies their identity in the real worls. So, for instance, it would be possible to choose
+
=== Defining Functions for the AISL ===
the string <nowiki>"http://mydomain.org/myimage.jpg"</nowiki> to identify the resource 1). The first time an import task is run that specifies this gor, the AIS will create an internal resource referring to
+
AISL is a scripting language intended to create graphs of resources for subsequent import. Its type system and grammar, together with usage examples, are fully described in the [[Content_Import|Administrator's guide]]. Its main features are a tight integration with the AIS, in the sense that the creation of model objects are first citizens in the language, and the ability to treat in a way as much as possible transparent to the user some tasks which are frequent during import, like accessing files at remote locations. Beside avoiding the complexity of full fledged programming languages like Java, its limited expressivity - tailored to import tasks only - prevents security issues related to the execution of code in remote systems. The language does not allow the definition of new functions/types of objects inside a program, but can be easily extended with new functionality by defining new functions as plugin modules. Adding a new function amounts to two steps:
the external resource 1). This internal resource will receive an (internal) object identifier. If another import task is executed, and another resource with external identifier <nowiki>"http://mydomain.org/myimage.jpg"</nowiki> is created, the AIS will treat this as the same resource, and so it will not create another internal resource but modify, if needed, the one already created.
+
# Creating a new java class implementing the AISLFunction class. This is more easily done by subclassing the <tt>AbstractAISLFunction</tt> class. See below for further details.
 
+
== The AISL Language ==
+
 
+
AISL is a scripting language intended to create graphs of resources for subsequent import. Its main features are a tight integration with the AIS, in the sense that the creation of model objects
+
are first citizens in the language, and the ability to treat in a way as much as possibile transparent to the user some tasks which are frequent during import, like accessing files at remote locations.
+
Beside avoiding the complexity of full fledged programming languages like Java, its limited expressivity - tailored to import tasks only - prevents security issues related to the execution of code in remote systems.
+
The language does not allow the definition of new functions/types of objects inside a program, but can be easily extended with new functionality by defining new functions as plugin modules. This feature is detailed later
+
on in this document.
+
 
+
 
+
 
+
=== Type System ===
+
 
+
The language is currently non typed. Variables can be assigned with any kind of object from the Java type system. However, it is planned to enforce at least a partial static type checking
+
in the future, to allow early detection of errors in the scripts. For this reason, the grammar already requires variables to be declared before their use.
+
Even though the language process does not currently check for type compliance, it is strongly suggested that implementors try to use the appropriate types,
+
in order to reduce the effort of reconverting scripts later on. The types supported by the language(i.e. for which the language allows an explicit variable declaration) are:
+
 
+
* Primitive types:
+
** integer
+
** float
+
** boolean
+
** string
+
** list
+
** file
+
 
+
* Model object types:
+
** collection
+
** resource
+
** relationship
+
 
+
Notice that AISL is not an object oriented language. Even if some of the types correspond to java types, there are no methods nor fields. Access to properties of objects is instead possible thorugh appropriate functions.
+
So, for example, the size of a list value can be obtained by invoking the function listsize() on it. However, model object types have "properties" that resemble fields in object oreinted programming languages and
+
that can be accessed through the familiar dot (".") notation.
+
 
+
Values of the primitive types are treated as in the Java language. As in Java, strings can be created by supplying a value between double quotes, and can be concatenated using
+
the "+" operator. Values of the "list" type are similar to lists in the java language, but array-like selection of their elements is supported, and the language provides
+
a built in constructor for lists (se below at the section type constructors). As in java, lists indexes are zero-based. Lists are currently heterogeneous
+
(i.e. they can contain values of different types). In future releases, it is planned to provide facilities to enforce list-type checking. The "file" type represents local or remote external resources in a way that is independent of
+
the specific protocol used to access these resources.
+
 
+
The types collection, resource and relationship correspond to the model object constructs introduced in the previous sections. Variables of these types are used to hold instances of objects of the resource graph data model.
+
 
+
 
+
 
+
=== Syntax ===
+
 
+
This section describes briefly the syntax of AISL and semantics of AISL constructs, focusing especially on aspects in which the language differs from the Java programming language syntax, to which it is close.
+
A formal description of the syntax can be found in the appendix following this document.
+
 
+
An AISL script is a sequence of instructions, which are either variable declarations, conditional statements, loop statements and some kind of expressions like variable assignments and function invocations.
+
Values for the various types in the AISL can be built through appropriate type constructors. Values can then be manipulated and composed using expressions of various kind, which are in most cases similar to the corresponding expressions in the Java programming language.
+
 
+
 
+
 
+
 
+
==== Type constructors ====
+
 
+
 
+
===== Primitive type constructors =====
+
====== integer ======
+
integer values are sequences of digits, starting with a non-zero digit, or a single '0' digit. In other words, they match the production DECIMAL_LITERAL: "0"|(["1"-"9"] (["0"-"9"])*)
+
 
+
 
+
====== float ======
+
floating point values are expressed by a decimal decimal value eventually prefixed with an integer value: FLOATING_POINT_LITERAL: (["0"-"9"])+ "." (["0"-"9"])*
+
 
+
 
+
====== boolean ======
+
the words true and false are reserved words in AISL, and they are interpreded as the corresponding boolean values
+
 
+
 
+
====== string ======
+
string literals are sequences of characters between double quotes. Special characters like newline and tab are escaped and treted as in the Java programming language.
+
 
+
 
+
====== list ======
+
lists are built by enclosing a list of expressions separated by commas into curly brackets. For example:
+
<pre>
+
list myList = {3+4, 56, "a"};
+
</pre>
+
 
+
====== file ======
+
file objects are built by invoking the constructor functions getfile(string locator) and getFile(string locator, list<string> accessinformation) A locator is a string encoding a location and a protocol to be used to access the file.
+
For instance the instruction:
+
 
+
<pre>
+
file f= getFile("ftp://ftp.example.org/pub/share/myfile.xml);
+
</pre>
+
 
+
builds a file object that accesses its content through the ftp protocol at the given location. The format of the locator string is not defined in advance, as it depnds on the specific protocol used. Currently supported protocols are ftp, http, file. They all accept an URL as locator.
+
Different formats may be provided by different subtypes of the AISLFile class. This is described in more detail later on in this document. The two-arguments constructor allows  to pass in login information
+
that might be needed to access remote resources.
+
 
+
===== Model Object Constructors =====
+
These constructors allow to create elements of the resource graph which is later used for import by the service. Once created, the properties of an object can be modified but the object itself cannot be deleted.
+
In other words, it sufficient to invoke one of these constructors for the object to be in the final graph of resources. In general, all constructors impose to provide:
+
 
+
* the type of construct to be created (i.e. collection, resource or relationship)
+
* the specific subtype of the object. This subtype should be defined in the context of a specific importer, by subclassing appropriately the class definining the basic construct. This allows to perform checks during the parsing of the script, e.g. on the properties of constructs. For example, the type collection::content is a subtype of the type collection that defines the properties collectionName, isVirtual and isUser.
+
* a unique '''external identifier'''. This string value uniquely identifies a certain construct, so that it can be recognized during subsequent import phases.
+
* in the case of resources, it is possible to supply to the constructor a list of collections to which the resource must belong
+
* in the case of relationships, the resources that the relationship links must be specified.
+
 
+
Furthermore, the body of the constructor allow to initialize one or more of the properties eventually defined by the construct. The names, types and precise semantics of these properties are described in the section
+
about importers.
+
 
+
 
+
Examples of constructors are as follow:
+
 
+
<pre>
+
collection metadatacollection = collection::metadata["medspiration_test_metadata"]{
+
collectionName = "medspiration_test_metadata",
+
collectionDescription = "test for the AIS with medspiration data",
+
relatedContentCollection=contentcollection,
+
isUser=true,
+
isIndexable=false,
+
metadataName="dc",
+
metadataLanguage="en",
+
metadataSchemaURI="http://www.opendlib.com/resources/schemas/metadata_dc.xsd"
+
};
+
</pre>
+
 
+
 
+
This constructor defines a metadata collection. Here the type is "collection", the subtype is "metadata", the external identifier is given by a static string ("medspiration_test_metadata").
+
The body of the constructor initializes a number of properties specific of the "metadata" subtype. Notice that the object created by the constructor is then assigned to a variable of the appropriate
+
type (collection).
+
 
+
 
+
<pre>
+
resource::content[url] in ccoll{
+
isVirtualImport=false,
+
contentSourceLocator=url,
+
documentName= name,
+
hasMaterializedContent = false
+
};
+
</pre>
+
 
+
This constructor defines a resource of type content. The external identifier here is a variable (url) that must evaluate to a string. Furthermore, the object is specified to belong to a specific
+
collection again using a variable (ccoll) holding an instance of a (content) collection.
+
 
+
<pre>
+
relationship rel= relationship::metadata(metadata, content)["metadatarel"+url]{};
+
</pre>
+
 
+
this constructor specifies an relationship of subtype metadata. The couple of variables (metadata, content) specify the resources to and from which the relationship holds. The external identifier
+
is computed as an expression evaluating to a string.
+
 
+
==== Expressions ====
+
 
+
Arithmetic Expressions
+
numeric types (integer and float) can be combined using the same operators available in the Java programming language, i.e. the unary operators + and - and the binary operators +, -, /, * and %.
+
These operators have the same precedence and semantics as in Java. If the operands of a binary operator have different type, the type of the result is always "float".
+
 
+
===== Relational Expressions =====
+
The relational operators ==, !=, <, <=, >, >= have the same precedence and semantics as in java. They all evaluate to a boolean value and they can all be applied to numeric values. Furthermore, the operators
+
== and != can be applied to all other types.
+
 
+
===== Boolean Expressions =====
+
Boolean expressions are built from boolean values by applying the unary operator ! (not) and the binary operators | (or), & (and), ^ (exclusive or), whit the same precedence and semantics as in Java. Notice that
+
differently from java AISL does not support the conditional boolean operators ||, && and ^^.
+
 
+
===== Selectors =====
+
The elements of list-typed values can be obtained with the same syntax that in Java is used to access the elements of arrays. E.g.:
+
<pre>
+
list myList = {3+4, 56, "a"};
+
integer myInt = myList[1];
+
</pre>
+
Lists can be nested, and selectors can be combined:
+
<pre>
+
list myList = {3+4, {45, 10}, "a"};
+
integer myInt = myList[1][0];
+
</pre>
+
 
+
the properties of model object typed values can be accessed by name with a dot notation. e.g.
+
 
+
<pre>
+
resource myContentResource = resource::content[url] in ccoll{
+
isVirtualImport=false,
+
contentSourceLocator=url,
+
documentName=name,
+
hasMaterializedContent = false
+
};
+
 
+
myContentResource.documentName="test";
+
</pre>
+
 
+
==== Variable Declarations ====
+
Variable Declarations contain a specification of an AISL type, a variable identifier and an optional initializer. E.g.:
+
<pre>
+
list myList = {3+4, 56, "a"};
+
</pre>
+
 
+
==== Functions ====
+
Function invocation in AISL is analogue to function invocation in Java, except that all function have global visibility and there are no objects or classes thorugh which invoke a function.
+
An example is:
+
<pre>
+
...
+
string mystring= "test";
+
boolean matches= match(mystring, "t.*t");
+
print(matches);
+
...
+
</pre>
+
 
+
This code snippet contains two function invocations, namely of the functions match and print (it prints "true"). AISL comes with a set of predefined functions, described below. New functions can be added to the language. This is described later on in this document.
+
 
+
 
+
===== Predefined AISL Functions =====
+
 
+
====== Functions on file ======
+
These predefined functions provide access to the properties of objects of type file.
+
 
+
<table valign="top" width="80%">
+
<tr><td>string</td><td>'''filename'''</td><td width="100">(file f)</td><td>returns the name of the file</td></tr>
+
<tr><td>integer</td><td>'''filesize'''</td><td>(file f)</td><td>returns the size of the file</td></tr>
+
<tr><td>boolean</td><td>'''isdirectory'''</td><td>(file f)</td><td>returns true if and only if this file is a directory</td></tr>
+
<tr><td>boolean</td><td>'''isfile'''</td><td>(file f)</td><td>returns true if this file is a regular file</td></tr>
+
<tr><td>list<file></td><td>'''children'''</td><td>(file f)</td><td>returns a list containing a file object for each of the children of f. The list returned is empty if this file is a regular file (i.e. not a directory) or is the protocol used to access the file does not model a hierarchical fileystem.</td></tr>
+
<tr><td>list<file></td><td>'''descendants'''</td><td>(file f)</td><td>returns a list containing a file object for each of the descendants of f, obtained by recursively exploring all subdirectores. Notice that the list contains all files in the subtree rooted at f, not only its leaves (i.e.  it also contains all directories taht are descendants of f. The list returned is empty if this file is a regular file (i.e. not a directory) or is the protocol used to access the file does not model a hierarchical fileystem.</td></tr>
+
</table>
+
 
+
====== Functions on string ======
+
* boolean '''match'''(string str, string patter) returns true if the given string matches the given regular expression pattern, false otherwise.
+
* boolean '''print'''(string str) prints str.
+
 
+
====== Functions on list ======
+
* integer '''listsize'''(list l) returns the size of the list l
+
 
+
====== Functions on dom objects ======
+
* list<dom> '''xpath'''(dom file, string xpathexpression)
+
* dom '''xslt'''(dom file, string xslt transformation)
+
* string '''toString'''(dom file) converts a given dom object into a string (i.e. to its XML serialization).
+
 
+
==== Control Flow Statements ====
+
AISL contains three control flow statements: if, switch and foreach. The major syntactic difference between these statements and the corresponding ones in the java language is that
+
instructions inside the constructs must be enclosed in curly brackets (even when they contain a single instruction). Notice that these statements are not terminated by a ";" character.
+
For the rest, if and switch statements are completely similar to their Java counterparts, while the foreach statement has a special syntax.
+
 
+
===== Conditional Statements =====
+
 
+
====== If Statement ======
+
This statement has the same syntax and semantics as in Java, and takes the two forms:
+
<pre>
+
if( conditional expression ){
+
...
+
}
+
</pre>
+
and
+
<pre>
+
if( conditional expression ){
+
...
+
}
+
else{
+
...
+
}
+
</pre>
+
 
+
====== Switch Statement ======
+
This statement has the same syntax and semantics as in java:
+
<pre>
+
switch( expression ){
+
case expression1:
+
...
+
break;
+
 
+
...
+
 
+
case expressionN:
+
...
+
break;
+
 
+
default:
+
...
+
break;
+
}
+
</pre>
+
 
+
===== Loop Statements =====
+
The AISL language tries to avoid as much as possible unbounded loops. For this reason it does not have a while statement and has a foreach statement that only allows bounded loops.
+
In particular, foreach allows to iterate over a range of integer values, with a fixed increment (or decrement).
+
 
+
<pre>
+
foreach loopvariable in [ expression to expression by expression]{
+
...
+
}
+
</pre>
+
The three expressions appearing in the statement correspond to the minimum and maximum value of the range and to the increment. If no increment is given, its value is assumed to be one.
+
The variable '''loopvariable''' is defined inside the foreach loop block only, and its value can be read but not assigned. Example.
+
<pre>
+
foreach i in [0 to listsize(mylist)-1]{
+
    print(mylist[i]);
+
}
+
</pre>
+
this code snippet will print the value of all objects in the list "mylist".
+
 
+
 
+
 
+
=== Example Script ===
+
The following is a complete example of AISL script:
+
<pre>
+
collection contentcollection = collection::content["contentcollection"]{
+
collectionName = "test_content_collection",
+
isUser=false
+
};
+
 
+
collection metadatacollection = collection::metadata["metadatacollection"]{
+
collectionName = "test_metadata_collection",
+
collectionDescription = "test for the AIS with medspiration data",
+
relatedContentCollection=contentcollection,
+
isUser=true,
+
isIndexable=false,
+
metadataName="dc",
+
metadataLanguage="en",
+
metadataSchemaURI="http://www.opendlib.com/resources/schemas/metadata_dc.xsd"
+
};
+
 
+
 
+
file f=getFile("ftp://ftp.ifremer.fr/ifremer/medspiration/metadata/l2p/eurdac/ats_nr_2p/2006");
+
list l=descendantfiles(f);
+
print("Going to create a resource graph from "+listsize(l)+" elements");
+
 
+
foreach k in [0 to listsize(l)-1]{
+
if(!isdirectory(l[k])){
+
string name=tostring(xpath(dom(l[k]),"//File_Name/text()")[0]);
+
string url=tostring(xpath(dom(l[k]),"//URL/text()")[0]);
+
+
resource content = resource::content[url] in contentcollection{
+
isVirtualImport=false,
+
contentSourceLocator=url,
+
documentName= name,
+
hasMaterializedContent = false
+
};
+
print("Created content object "+k+" with content "+url);
+
resource metadata = resource::metadata["metadata"+url] in metadatacollection{
+
content = tostring(dom(l[k]))
+
};
+
+
relationship rel= relationship::metadata(metadata, content)["metadatarel"+url]{
+
};
+
}
+
}
+
</pre>
+
 
+
The script first creates a metadata and a content collection. Then it creates a file object representing a remote location (directory accessible through ftp). Using the function
+
getdescendants(), it then creates file objects for all the files contained in that collection. These files are XML files, that contain metadata and a reference to content objects.
+
The script then loops over all elements of the list. For each iteration, it creates a content object, a metadata object and a relationship between the two. The content of the metadata object is the content of the file considered at each iteration, serialized as xml. To create the content object, the dom function is used to parse each file and the xpath() function is used to extract some text from the file using an XPath. Notice that for each model object created, external identifiers are given. In this simple example they are obtained by using the location (url) of the files represented by content resources.
+
 
+
 
+
 
+
== Importers ==
+
The Archive Import Service perform the import of external resources by representing them in a Graph of Resources and passing this graph to a chain of software modules called "importers".
+
Each importer is responsible for treating specific kind of resource (e.g. metadata), and essentially is the bridge between the archive import service and the services of the Information Organization stack responsible for managing certain kind of internal resources (collections, metadata, documents etc). The precise way in which the importer performs the import is thus dependent on the specific subsystem the importer will interact with. Similarly, different importers will need to obtain different information about the resources to import. For instance, to import a document it is necessary to have its content or an url at which the content can be accessed. To create a metadata collection, it is necessary to specify some properties like the location of the schema of the objects contained in the collection. These values are passed to an importer by annotating objects in a graph of resource with appropriate properties. In order to constrain the kind of properties that the model objects it manipulates must have, an importer must define a set of subtypes of the model object types. For instance, the metadata importer (described below) defines a subtype for each basic type of the Resource Model types:
+
 
+
* collection::metadata,
+
* resource::metadata and
+
* relationship::metadata
+
 
+
Each of these subtypes has specific properties that are understood, used and manipulated by that importer. The way subtyping is accomplished is described in more detail later in the section
+
"writing new importers". The types defined by an importer and its properties must be publicly available, as AISL script developers must known which are the properties available for them and what is their
+
semantics. Furthermore, subsequent importers in the chain may also need to access some properties. For example, an importer for metadata needs to access also model objects representing content to
+
get their object id (internal identifier). Notice that in general an AISL script will not necessarily assign all properties defined by a subtype.
+
Some of these properties may be conditionally needed, while come will only be written by an importer at import time. For example, when a new content object is imported, the importer must record into the GoR object the OID
+
of the newly created object. For this reason, the specification of the subtypes defined by importers must also provide information about what properties are mandatory (i.e. they must be assigned during the creation
+
of a GoR) and which properties are private (i.e. they should NOT be assigned during the creation of the GoR).
+
 
+
 
+
=== Built-in importers ===
+
The AIS comes already with the capability to import documents and metadata. This is provided by two importers called the content importer and the metadata importer. The types defined by these importers are described below:
+
 
+
 
+
==== Content Importer ====
+
This importer defines two subtypes. In particular, it defines a new collection type and a new resource type:
+
* collection::content
+
* resource::content
+
 
+
The properties of these subtypes, their type and semantics are as follows:
+
 
+
<pre>
+
[collection::content]
+
collectionName : string  : mandatory -  The name of the collection.
+
isUser        : boolean : mandatory -  Denotes if a collection is or not a user collection
+
collectionId  : string  : private  -  The id assigned to the collection to the collection management service
+
</pre>
+
 
+
<pre>
+
[resource::content]
+
isVirtualImport        : boolean : mandatory
+
documentName          : string  : mandatory
+
hasMaterializedContent : boolean : mandatory
+
contentSourceLocator  : string  :
+
content                : file    :
+
documentId            : string  : private  -  The id assigned to the collection by the storage management service
+
</pre>
+
Note: the fields contentSourcelocator and content are alternative. They depend on the value of the field hasMaterializedContent
+
 
+
==== Metadata Importer ====
+
This importer defines three subtypes, one for each basic construct in the Resource Model. They are:
+
 
+
* collection::metadata
+
* resource::metadata
+
* relationship::metadata
+
 
+
<pre>
+
[collection::metadata]
+
relatedContentCollection: collection : mandatory - The content collection containing the objects to which this metadata collection refers
+
collectionName : string    : mandatory - The name of the collection
+
collectionDescription : string    : mandatory - A description of the collection
+
isUser : boolean    : mandatory - Indicates wheter this is a user collection
+
isIndexable : boolean    : mandatory - Indicates wheter this collection is indexable
+
metadataName : string    : mandatory - Name of the metadata schema in this collection
+
metadataLanguage : string    : mandatory - Language of the metadata in this collection
+
metadataSchemaURI : string    : mandatory - URI of the schema of the metadata in this collection
+
collectionId : string    : private -  The id assigned to the metadata collection during the import
+
</pre>
+
 
+
<pre>
+
[resource::metadata]
+
content                : string    : mandatory  - the content of this metadata object
+
objectID                : string    : private    - the id assigned to the metadata object during the import
+
</pre>
+
 
+
<pre>
+
[relationship::metadata](resource::metadata, resource::content)
+
</pre>
+
This subtype does not define any property. It denotes an edge from a metadata resource object to a content resource object
+
 
+
== Extension Mechanisms ==
+
The Archive Import Service can be extended in a number of ways, mostly via a plugin-based mechanisms. In particular, it is possible to extend the AISL language by adding new Functions and by adding the support for different protocols managed by the file built-in type, and it is possible to add to the AISL new importers that deal with new new kinds of data managed by gCube Information Orgainzation Services.
+
 
+
 
+
=== Extending the AISL ===
+
 
+
==== Creating new Functions ====
+
+
The language can be extended by adding functions. Adding a new function amounts to two steps:
+
# Creating a new java class implementing the AISLFunction class. This is more easily done by subclassing the AbstractAISLFunction class. See below for further details.
+
 
# Registering the function in the "Functions" class. This step will be removed in later released, which will implement automatic plugin-like registration
 
# Registering the function in the "Functions" class. This step will be removed in later released, which will implement automatic plugin-like registration
  
 
+
The AISLFunction interface provides a way to specify a number of signatures (number and type of arguments and return type) for an AISL function. It is design to allow for overloaded functions. The number and types of the parameters are used to perform a number of static checks on the invocation of the function. The method <tt>evaluate</tt> provides the main code to evaluated during an invocation of the function in an AISL script. In the case of overloaded functions, this method should redirect to appropriate methods based on the number and types of the arguments.  
The AISLFunction interface provides a way to specify a number of signatures (number and type of arguments and return type) for an AISL function. It is design to allow  
+
for overloaded functions. The number and types of the parameters are used to perform a number of static checks on the invocation of the function. The method evaluate
+
provides the main code to evaluated during an invocation of the function in an AISL script. In the case of overloaded functions, theis method should  
+
redirect to appropriate methods based on the number and types of the arguments.  
+
 
<pre>
 
<pre>
 
public interface AISLFunction {
 
public interface AISLFunction {
Line 510: Line 81:
 
}
 
}
 
</pre>
 
</pre>
A partial implementation of the AISLFunction interface is provided by the AbstractAISLFunction class. A developer can simply extend this class and then provide an appropriate constructor and  
+
A partial implementation of the <tt>AISLFunction</tt> interface is provided by the <tt>AbstractAISLFunction</tt> class. A developer can simply extend this class and then provide an appropriate constructor and implement the appropriate <tt>evaluate</tt> method. An example is given below. The function <tt>match</tt> returns a boolean value according to the match of a string value with a given regular expression pattern. Its signature is thus:  
implement the appropriate evaluate method. An example is given below. The function match returns a boolean value according to the match of a string value with a given regular expression pattern. Its signature is
+
thus:  
+
  
 
<pre>
 
<pre>
Line 518: Line 87:
 
</pre>
 
</pre>
  
the class Match.class below is an implementation of this function. In the constructor, the function declares its name and its parameters. The method evaluate(Object[] args), which must be implemented
+
the class <tt>Match.class</tt> below is an implementation of this function. In the constructor, the function declares its name and its parameters. The method <tt>evaluate(Object[] args)</tt>, which must be implemented to comply with the interface <tt>AISLFunction</tt>, performs some casting of the parameters and then redirects the evaluation to another evaluate function (Note, in this case, as the function is not overloaded, there is no actual need for a separate evaluate method, here it has been added for clarity).
to comply with the interface AISLFunction, performs some casting of the parameters and then redirects the evaluation to another evaluate function (Note, in this case, as the function is not overloaded,
+
there is no actual need for a separate evaluate method, here it has been added for clarity).
+
  
 
<pre>
 
<pre>
Line 543: Line 110:
 
}
 
}
 
</pre>
 
</pre>
==== Extending Support for Access To Remote Resources ====
 
To be added
 
  
 +
=== Writing RemoteFile Adapters ===
 +
When writing AISL scripts, the details of interaction with remote resources available on the network is hidden from the user, and encapsulated into the facilities related to a native data type of the language, the file type. The intention is to shield (almost) completely the user from such details, and presenting resources available through heterogeneous protocols via a homogeneous access mechanism.
  
=== Adding new Importers ===
+
A network resource is made available as file by invoking the <tt>getFile()</tt> function of the language. The function gets as argument a ''locator'', which is a string (and optionally some parameters needed for authentication), and resolves, based on the form of the locator, which protocol to use and how to access the resource. To allow for extensibility, the format of the locator is not fixed in advance, but depends on the specific remote file type, which must be able to recognize such format (see below). To avoid excessive resource consumption, remote resources are not downloaded straight away. Instead, a file object acts as a placeholder, and content is made available on demand. Other properties of the resource, like for instance its length, last modification date or hash signature, are instead gathered (and possibly cached so to limit network usage). Of course, the availability of this information is related to the capabilities offered by the network protocol at hand. Once downloaded, content is also cached.
To be added
+
  
 +
In order to '''''make a new protocol available to the AIS''''', it is sufficient to implement the <tt>RemoteFile</tt> interface and register the class to the <tt>RemoteFileFactory</tt> class. The class implementing the type must be able to "recognize" the format of locators that  it can deal with. More in detail, the class must have aa single argument constructor taking as parameter a locator, and the constructor must throw an exception if the format of the locator is not recognized. The function getFile() passes the locator to the RemoteFileFactory, which tries to instantiate a remote file of the most specific type by trying all registered remote file types (''late type binding'').
  
 +
Some network protocols allow for a hierarchical, directory-style structuring of resources. This means basically that it is possible, from a given resource, to get a list of their children. For hierarchical resources, it is possible to implement the <tt>HierarchicalRemoteFile</tt> interface. If basic caching capabilities are acceptable, it is possible (but not mandatory) to extend instead the <tt>AbstractRemoteFile</tt> and <tt>AbstractHierarchicalRemoteFile</tt> classes. These classes already provide a standard implementation for a number of methods defined by the corresponding interfaces.
  
== Using the Archive Import Service ==
+
=== Writing Importers ===
Note: The content of this section is temporary, and will be completely remove or changed in subsequent releases of the service.
+
Importer are software modules that process the graph of resources and decide about import actions, interfacing with some gCube component for content management, like for instance the [[Collection_Management|Collection Management Service]], the [[Content_Management|Content Management Service]] and the [[Metadata_Management|Metadata Management Service]]. Each importer is responsible for treating specific kind of resource (e.g. metadata), and essentially is the bridge between the AIS and the services of the Information Organization stack responsible for managing that kind of resource. The precise way in which the importer performs the import is thus dependent on the specific subsystem the importer will interact with. Similarly, different importers will need to obtain different information about the resources to import. For instance, to import a document it is necessary to have its content or an url at which the content can be accessed. To create a metadata collection, it is necessary to specify some properties like the location of the schema of the objects contained in the collection.
 +
The Archive import Service already includes importers dedicated to the creation of content and metadata collections and to the creation of complex documents and metadata objects. Thus, the creation of a new importer is an activity which is only needed if a new kind of content model is defined over the InfoObjectModel (see [[Storage_Management|Storage Management]]) and facilities for its manipulation are offered by some new gCube component.
  
The AISL is currently released as a standalone client. The software is available to the developers from the svn.
+
==== Defining Importer-Specific Types ====
The class org.gcube.contentmanagement.contentlayer.archiveimportservice.impl.AISLClient contains a client that performs the steps needed for the import: parsing and execution of the script, generation of the graph of resources, import of of the graph of resources. It accepts one or two arguments. The first one is the location (on the local file system) of a file containing an aisl script. The second argument is a boolean value. If it is set to true, the client will perform the creation of the graph of resources but will not start the importing. This is to ease debugging.
+
Writing a new importer requires to know how to interact with such component, and how to manipulate a Graph of Resources. The data model handled by AISL features three main types of constructs:
  
After the graph of resources is created, the client generates a dump of the graph in a file named resourcegraph.dump. The graph is serialized in an XML-like format. This is only for visualization and debugging purposes, and this format is not currently guaranteed to be valid (or even well formed) xml.
+
* Resource
 +
* Relationship
 +
* Collection
 +
 
 +
A graph of resources is a graph composed by nodes (resources) and edges (relationships). Furthermore, nodes (resources) can be organized into sets (collections), that can in turn be connected using relationships. All constructs of the model can be annotated with properties, which are name/value pairs. The constructs above correspond internally to the three classes <tt>Resource</tt>, <tt>ResourceCollection</tt> and <tt>ResourceRelationship</tt>. In order to constrain the kind of properties that the model objects it manipulates must have, an importer must define a set of subtypes of the model object types. This can be done by subclassing the above mentioned classes. Which subtypes to implement, and the precise semantics of their properties, depend on the specific importer. For instance, the MetadataCollection Importer declares only one new type, the collection::metadata type, that specialized the type collection to allow for specific metadata collection-related properties. Notice that importers can also manipulate objects belonging to subclasses defined by other importers. For instance, the MetadataCollection importer needs to access properties of the ContentCollection subtype, defined by the ContentCollection importer, in order to be able to create metadata collections.
  
 +
In order to define their own subtypes, importers must:
 +
#Subclass the basic types as needed;
 +
#Register the classes in the <tt>GraphOfResources</tt> class. This automatically extends the language with the new types;
 +
#Publish the properties allowed for the new subtypes;
  
== Appendix - AISL Grammar formal specification ==
+
Regarding the last point, notice that the types defined by an importer and its properties must be publicly available, as AISL script developers must known which are the properties available for them and what is their semantics. Furthermore, subsequent importers in the chain may also need to access some properties. For example, an importer for metadata needs to access also model objects representing content to get their object id (internal identifier). Notice that in general an AISL script will not necessarily assign all properties defined by a subtype. Some of these properties may be conditionally needed, while come will only be written by an importer at import time. For example, when a new content object is imported, the importer must record into the GoR object the OID of the newly created object. For this reason, the specification of the subtypes defined by importers must also provide information about what properties are mandatory (i.e. they must be assigned during the creation of a GoR) and which properties are private (i.e. they should NOT be assigned during the creation of the GoR).
+
The following EBNF rules define the syntax of the AISL scripting language
+
  
# Program ::= ( Instruction )*
+
==== Defining Importer Logics ====
# Instruction ::= ( VariableDeclaration ";" | Statement )
+
The actual logic of the import for a new importer is contained in a class that must simply implement the <tt>Importer</tt> interface, which is as follows:
# Statement ::= StatementExpression ";"  |  SwitchStatement  |  IfStatement  |  ForeachStatement
+
<pre>
# SwitchStatement ::= "switch" "(" Expression ")" "{" ( SwitchBlock )* "}"
+
public interface Importer{
# SwitchBlock ::= ( "case" Expression ":" ( Instruction )* "break;" | "default" ":" ( Instruction )* "break;" )
+
public String getName();
# IfStatement ::= "if" "(" Expression ")" IfBlock ( "else" ElseBlock )?
+
public void importRepresentationGraph(GraphOfResources graph) throws RemoteException, ExecutionInterruptedException;
# ElseBlock ::= "{" ( Instruction )* "}"
+
}
# IfBlock ::= "{" ( Instruction )* "}"
+
</pre>
# ForeachStatement ::= "foreach" <IDENTIFIER> "in" ( Expression | ForRange ) ForBlock
+
 
# ForRange ::= "[" Expression "to" Expression ( "," Expression )? "]"
+
The first method must provide a human-readable name for the importer (for logging and status notification purposes). The second method will be passed, during operation, a GraphOfResources object, and must contain the logic needed for manipulating the objects in the graph, selecting the ones of interest and perform the actual import tasks.
# ForBlock ::= "{" ( Instruction )* "}"
+
 
# VariableDeclaration ::= Type VariableDeclarator ( "," VariableDeclarator )*
+
== Current Limitations and Known Issues ==
# VariableDeclarator ::= <IDENTIFIER> ( "=" Expression )?
+
 
# Type ::= BuiltinType
+
The AIS is currently released as a '''''standalone client'''''. The class <tt>org.gcube.contentmanagement.contentlayer.archiveimportservice.impl.AISLClient</tt> contains a client that performs the steps needed for the import: parsing and execution of the script, generation of the graph of resources, import of of the graph of resources. It accepts one or two arguments. The first one is the location (on the local file system) of a file containing an AISL-based script. The second argument is a boolean value. If it is set to true, the client will perform the creation of the graph of resources but will not start the importing. This is to ease debugging.
# BuiltinType ::= ( "boolean" | "int" | "float" | "list" ( "<" Type ">" )? | "file" | "string" | "collection" | "resource" | "relationship" )
+
 
# StatementExpression ::= PrimaryExpression ( "=" Expression )?
+
After the graph of resources is created, the client generates a dump of the graph in a file named resourcegraph.dump. The graph is serialized in an XML-like format. This is only for visualization and debugging purposes, and this format is not currently guaranteed to be valid (or even well formed) xml.
# PrimaryExpression ::= ( Literal | Function | Variable | Constructor ) ( Selection )*
+
# Literal ::= ( <INTEGER_LITERAL> | <FLOATING_POINT_LITERAL> | <STRING_LITERAL> | BooleanLiteral )
+
# BooleanLiteral ::= ( "true" | "false" )
+
# Variable ::= Name
+
# Name ::= <IDENTIFIER>
+
# Function ::= Name Arguments
+
# Arguments ::= "(" ( Expression )? ( "," Expression )* ")"
+
# Constructor ::= ModelObjectConstructor  |  ListConstructor
+
# ModelObjectConstructor ::= CollectionConstructor |  ResourceConstructor    | RelationshipConstructor
+
# CollectionConstructor ::= "collection" "::" <IDENTIFIER> "[" Expression "]" "{" ( PropertyAssignment ( "," PropertyAssignment )* )? "}"
+
# ResourceConstructor ::= "resource" "::" <IDENTIFIER> "[" Expression "]" "in" CollectionsList "{" ( PropertyAssignment ( "," PropertyAssignment )* )? "}"
+
# CollectionsList ::= Expression ( "," Expression )*
+
# RelationshipConstructor ::= "relationship" "::" <IDENTIFIER> "(" Expression "," Expression ")" "[" Expression "]" "{" ( PropertyAssignment ( "," PropertyAssignment )* )? "}"
+
# PropertyAssignment ::= <IDENTIFIER> "=" Expression
+
# ListConstructor ::= "{" ( Expression )? ( "," Expression )* "}"
+
# Selection ::= PropertySelection  |  ElementSelection
+
# PropertySelection ::= "." <IDENTIFIER>
+
# ElementSelection ::= "[" Expression "]"
+
# Expression ::= OrExpression
+
# OrExpression ::= ExclusiveOrExpression ( "|" ExclusiveOrExpression )*
+
# ExclusiveOrExpression ::= AndExpression ( "^" AndExpression )*
+
# AndExpression ::= EqualityExpression ( "&" EqualityExpression )*
+
# EqualityExpression ::= RelationalExpression ( ( "==" | "!=" ) RelationalExpression )*
+
# RelationalExpression ::= AdditiveExpression ( ( "<" | ">" | "<=" | ">=" ) AdditiveExpression )*
+
# AdditiveExpression ::= MultiplicativeExpression ( ( "+" | "-" ) MultiplicativeExpression )*
+
# MultiplicativeExpression ::= UnaryExpression ( ( "*" | "/" | "%" ) UnaryExpression )*
+
# UnaryExpression ::= ( ( "+" | "-" | "!" ) PrimaryExpression | PrimaryExpression )
+

Latest revision as of 11:41, 18 June 2009

Aim and Scope of Component

The Archive Import Service(AIS) is dedicated to "batch" import of resources which are external to the gCube infrastructure into the infrastructure itself. The term "importing" refers to the description of such resources and their logical relationships (e.g. the association between an image and a file containing metadata referring to it) inside the Information Organization stack of services. While the AIS is not strictly necessary for the creation and management of collections of resources in gCube, it makes possible in practice the creation of large collections, and their automated maintenance.

Logical Architecture

The task of importing a set of external resources is performed by building a description of the resources to import, in the form of a graph labelled on nodes and arcs with key-value pairs, called a graph of resources (GoR), and then processing this description accomplishing all steps needed to import the resources represented therein. The GoR is based on a custom data model, similar to the Information Object Model, and is built following a procedural description of how to build it expressed in a scripting language, called the Archive Import Service Language (AISL). Full details about this language and how to write an import script are given in the Administrator's guide. The actual import task is performed by a chain of pluggable software modules, called "importers", that in turn process the graph and can manipulate it, annotating it with additional knowledge produced during the import. Each importer is dedicated to import resources interacting with a specific subpart of the Information Organization set of Services. For instance, the metadata importer is responsible for the import of metadata, and it handles their import by interacting with the Metadata Management Service. The precise way in which importers handle the import task, and in particular how they define a specific description of the resources they need to consume inside a GoR is left to the single importers. This, and the pluggable nature of importers, makes possible to enable the import of new kind of content resources which might be defined in the infrastructure in the future.

The logical architecture of the AIS can be then depicted as follows:

AISDiagram.jpeg

Import state and incremental import

During the import of some resources, the corresponding GoR is kept updated with information regarding the actual resources created, such as their OIDs. The Graph of Resources is stored persistently by the service, so that a subsequent execution of the same import script is aware of the status of the import and can perform only the differential operations needed to maintain the status of the resources up-to-date. While this solution involves a partial duplication of information inside the infrastructure, it has been chosen because it introduces a complete decoupling between the AIS and other gCube services, which are thus not forced to offer additional information needed for incremental import in their interfaces.

Service Implementation

Note: this section refers to the Architecture of the AIS as a stateful gCube service. The status of the compoenent is however currently that of a standalone tool. Please see the section current limitations and known issues for further details.

The AIS is a stateful gCube service, following the factory approach. A stateless factory porttype, the ArchiveImportService porttype, allows to create stateful instances of the ArchiveImportTask porttype. Each import task is responsible for the execution of a single import script. It performs the import, maintains internally the status of the import (under the form of an annotated graph of resources), and provides notification about the status of the task. The resources it is responsible for are kept up to date by re-executing the import script at suitable time intervals. Beside acting as a factory for import tasks, the ArchiveImportService porttype also offers additional functionality related to the management of import scripts. Import scripts are generic resources inside the infrastructure. The porttype allows to publish new scripts, list and edit existing scripts, and validate them from a syntactic point of view. The semantics of the methods offered by the two porttypes are described in the following.

Archive Import Service Porttype

  • String[] list(): returns a list of import scripts identifiers for scripts currently available in the VRE (as generic resources);
  • Script load(String scriptId): returns the import script corresponding to the given identifier;
  • void save(Script script): saves an import script, by creating or updating a corresponding generic resource;
  • void delete(String scriptId): deletes the import script corresponding to the given identifier;
  • ValidationReport validate(String scriptSpecification): this method performs validation of the given input string, which is treated as an AISL script and undergoes parsing and other syntactic validation steps. ;
  • EndpointReferenceType getTask(String importScriptIdentifier): this method gets an instance of the ArchiveImportTask service dedicated to a given script.

The Complex Types Script and ValidationReport used in the methods above represent respectively an AISL Script (characterized by a scriptId, a description and a content, i.e. the script itslef) and a syntactic validation report (containing a boolean validity flag, a message and other details like error row and column).

Archive Import Task Porttype

  • void start(ImportOptions options): this method starts the import task with the given options. Options include a run mode (validate, build, simulation import, import);
  • void stop(): this method stops the import task;
  • TaskStatus getStatus(): this method is included to get detailed information about the progress of an import task, including some information on errors happened during the import task execution. The Complex Type TaskStatus contains a number of informations about the task progress, e.g. execution state (new, running, stopped, failed), execution phase (parsing, building, importing) and eventually import phase (document, metadata...) the number and types of graph objects currently created, imported and failed. For these last ones, it is also available detailed error information.

Import Task Status Notification

The ArchiveImportTask service maintains part of its state as a WS-resource. The properties of this resource are used to notify interested service about:

  • the current phase of the import (e.g. graph building, importing etc.);
  • the current number of objects created in the GoR;
  • the current number of objects imported successfully;
  • the current number of objects whose import failed.

Implementation Details

The following UML diagram describes concisely how the AIS is organized from an implementation point of view. The service itself depends on a library whose packages are dedicated each to a specific functionality.

AISUML.jpg

In particular:

  • ais contains the classes related to the main functionality of the service, like defining the logical flow of a task execution
  • language contains the classes related to parsing and executing AISL scripts
  • datamodel contains the classes used in the description of graph of resources.
  • importing contains the classes used for the actual import, e.g. the so called importers.
  • remotefile contains all classes used to access remote resources.
  • util is a package collecting several different utility classes, like those used to manage the persistence of the GoR, caching, manipulating XML and HTML etc.

Each of these packages is further divided into multiple subpackages dedicated to specific functionality. These are not shown here to avoid visual cluttering.

Extensibility Features

The functionality of the Archive Import Service can be extended in three main ways. It is possible to define new functions for the AISL, and to plug in software modules to interact with external resources through additional network protocols and to interact with new gCube content-related components.

Defining Functions for the AISL

AISL is a scripting language intended to create graphs of resources for subsequent import. Its type system and grammar, together with usage examples, are fully described in the Administrator's guide. Its main features are a tight integration with the AIS, in the sense that the creation of model objects are first citizens in the language, and the ability to treat in a way as much as possible transparent to the user some tasks which are frequent during import, like accessing files at remote locations. Beside avoiding the complexity of full fledged programming languages like Java, its limited expressivity - tailored to import tasks only - prevents security issues related to the execution of code in remote systems. The language does not allow the definition of new functions/types of objects inside a program, but can be easily extended with new functionality by defining new functions as plugin modules. Adding a new function amounts to two steps:

  1. Creating a new java class implementing the AISLFunction class. This is more easily done by subclassing the AbstractAISLFunction class. See below for further details.
  2. Registering the function in the "Functions" class. This step will be removed in later released, which will implement automatic plugin-like registration

The AISLFunction interface provides a way to specify a number of signatures (number and type of arguments and return type) for an AISL function. It is design to allow for overloaded functions. The number and types of the parameters are used to perform a number of static checks on the invocation of the function. The method evaluate provides the main code to evaluated during an invocation of the function in an AISL script. In the case of overloaded functions, this method should redirect to appropriate methods based on the number and types of the arguments.

public interface AISLFunction {
	public String getName();
	
	public void setFunctionDefinitions(FunctionDefinition ... defs);
	public FunctionDefinition[] getFunctionDefinitions();

	public  Object evaluate(Object[] args) throws Exception;
	
	public interface FunctionDefinition{
		Class<?>[] getArgTypes();
		Class<?> getReturnType();
	}
	
}

A partial implementation of the AISLFunction interface is provided by the AbstractAISLFunction class. A developer can simply extend this class and then provide an appropriate constructor and implement the appropriate evaluate method. An example is given below. The function match returns a boolean value according to the match of a string value with a given regular expression pattern. Its signature is thus:

boolean match(string str, string pattern)

the class Match.class below is an implementation of this function. In the constructor, the function declares its name and its parameters. The method evaluate(Object[] args), which must be implemented to comply with the interface AISLFunction, performs some casting of the parameters and then redirects the evaluation to another evaluate function (Note, in this case, as the function is not overloaded, there is no actual need for a separate evaluate method, here it has been added for clarity).

public class Match extends AbstractAISLFunction{
		
	public Match(){
		setName("match");
		setFunctionDefinitions(
			new FunctionDefinitionImpl(Boolean.class, String.class, String.class)
		);
	}
			
	public Object evaluate(Object[] args) throws Exception{
		return evaluate((String)args[0], (String)args[1]);

	}
	
	private Boolean evaluate(String str, String pattern){
		return str.matches(pattern);
	}

}

Writing RemoteFile Adapters

When writing AISL scripts, the details of interaction with remote resources available on the network is hidden from the user, and encapsulated into the facilities related to a native data type of the language, the file type. The intention is to shield (almost) completely the user from such details, and presenting resources available through heterogeneous protocols via a homogeneous access mechanism.

A network resource is made available as file by invoking the getFile() function of the language. The function gets as argument a locator, which is a string (and optionally some parameters needed for authentication), and resolves, based on the form of the locator, which protocol to use and how to access the resource. To allow for extensibility, the format of the locator is not fixed in advance, but depends on the specific remote file type, which must be able to recognize such format (see below). To avoid excessive resource consumption, remote resources are not downloaded straight away. Instead, a file object acts as a placeholder, and content is made available on demand. Other properties of the resource, like for instance its length, last modification date or hash signature, are instead gathered (and possibly cached so to limit network usage). Of course, the availability of this information is related to the capabilities offered by the network protocol at hand. Once downloaded, content is also cached.

In order to make a new protocol available to the AIS, it is sufficient to implement the RemoteFile interface and register the class to the RemoteFileFactory class. The class implementing the type must be able to "recognize" the format of locators that it can deal with. More in detail, the class must have aa single argument constructor taking as parameter a locator, and the constructor must throw an exception if the format of the locator is not recognized. The function getFile() passes the locator to the RemoteFileFactory, which tries to instantiate a remote file of the most specific type by trying all registered remote file types (late type binding).

Some network protocols allow for a hierarchical, directory-style structuring of resources. This means basically that it is possible, from a given resource, to get a list of their children. For hierarchical resources, it is possible to implement the HierarchicalRemoteFile interface. If basic caching capabilities are acceptable, it is possible (but not mandatory) to extend instead the AbstractRemoteFile and AbstractHierarchicalRemoteFile classes. These classes already provide a standard implementation for a number of methods defined by the corresponding interfaces.

Writing Importers

Importer are software modules that process the graph of resources and decide about import actions, interfacing with some gCube component for content management, like for instance the Collection Management Service, the Content Management Service and the Metadata Management Service. Each importer is responsible for treating specific kind of resource (e.g. metadata), and essentially is the bridge between the AIS and the services of the Information Organization stack responsible for managing that kind of resource. The precise way in which the importer performs the import is thus dependent on the specific subsystem the importer will interact with. Similarly, different importers will need to obtain different information about the resources to import. For instance, to import a document it is necessary to have its content or an url at which the content can be accessed. To create a metadata collection, it is necessary to specify some properties like the location of the schema of the objects contained in the collection. The Archive import Service already includes importers dedicated to the creation of content and metadata collections and to the creation of complex documents and metadata objects. Thus, the creation of a new importer is an activity which is only needed if a new kind of content model is defined over the InfoObjectModel (see Storage Management) and facilities for its manipulation are offered by some new gCube component.

Defining Importer-Specific Types

Writing a new importer requires to know how to interact with such component, and how to manipulate a Graph of Resources. The data model handled by AISL features three main types of constructs:

  • Resource
  • Relationship
  • Collection

A graph of resources is a graph composed by nodes (resources) and edges (relationships). Furthermore, nodes (resources) can be organized into sets (collections), that can in turn be connected using relationships. All constructs of the model can be annotated with properties, which are name/value pairs. The constructs above correspond internally to the three classes Resource, ResourceCollection and ResourceRelationship. In order to constrain the kind of properties that the model objects it manipulates must have, an importer must define a set of subtypes of the model object types. This can be done by subclassing the above mentioned classes. Which subtypes to implement, and the precise semantics of their properties, depend on the specific importer. For instance, the MetadataCollection Importer declares only one new type, the collection::metadata type, that specialized the type collection to allow for specific metadata collection-related properties. Notice that importers can also manipulate objects belonging to subclasses defined by other importers. For instance, the MetadataCollection importer needs to access properties of the ContentCollection subtype, defined by the ContentCollection importer, in order to be able to create metadata collections.

In order to define their own subtypes, importers must:

  1. Subclass the basic types as needed;
  2. Register the classes in the GraphOfResources class. This automatically extends the language with the new types;
  3. Publish the properties allowed for the new subtypes;

Regarding the last point, notice that the types defined by an importer and its properties must be publicly available, as AISL script developers must known which are the properties available for them and what is their semantics. Furthermore, subsequent importers in the chain may also need to access some properties. For example, an importer for metadata needs to access also model objects representing content to get their object id (internal identifier). Notice that in general an AISL script will not necessarily assign all properties defined by a subtype. Some of these properties may be conditionally needed, while come will only be written by an importer at import time. For example, when a new content object is imported, the importer must record into the GoR object the OID of the newly created object. For this reason, the specification of the subtypes defined by importers must also provide information about what properties are mandatory (i.e. they must be assigned during the creation of a GoR) and which properties are private (i.e. they should NOT be assigned during the creation of the GoR).

Defining Importer Logics

The actual logic of the import for a new importer is contained in a class that must simply implement the Importer interface, which is as follows:

public interface Importer{
	public String getName();
	public void importRepresentationGraph(GraphOfResources graph) throws RemoteException, ExecutionInterruptedException;
}

The first method must provide a human-readable name for the importer (for logging and status notification purposes). The second method will be passed, during operation, a GraphOfResources object, and must contain the logic needed for manipulating the objects in the graph, selecting the ones of interest and perform the actual import tasks.

Current Limitations and Known Issues

The AIS is currently released as a standalone client. The class org.gcube.contentmanagement.contentlayer.archiveimportservice.impl.AISLClient contains a client that performs the steps needed for the import: parsing and execution of the script, generation of the graph of resources, import of of the graph of resources. It accepts one or two arguments. The first one is the location (on the local file system) of a file containing an AISL-based script. The second argument is a boolean value. If it is set to true, the client will perform the creation of the graph of resources but will not start the importing. This is to ease debugging.

After the graph of resources is created, the client generates a dump of the graph in a file named resourcegraph.dump. The graph is serialized in an XML-like format. This is only for visualization and debugging purposes, and this format is not currently guaranteed to be valid (or even well formed) xml.