Difference between revisions of "GCube Document Library (2.0)"

From Gcube Wiki
Jump to: navigation, search
(Advanced Projections)
(Utilities & F.A.Q.)
 
(67 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The '''gCube Document Library (gDL)''' is a client library for storing, updating, deleting and retrieving document description in a gCube infrastructure.  
+
The '''gCube Document Library (gDL)''' is a client library for adding, updating, deleting and retrieving document descriptions to, in, and from remote collections in a gCube infrastructure.  
  
 
The gDL is a high-level component of the subsystem of [[GCube_Information_Organisation_Services_(NEW)|gCube Information Services]] and it interacts with lower-level components of the subsystem to support document management processes within the infrastructure:
 
The gDL is a high-level component of the subsystem of [[GCube_Information_Organisation_Services_(NEW)|gCube Information Services]] and it interacts with lower-level components of the subsystem to support document management processes within the infrastructure:
  
 
* the [[GCube_Document_Model#Overview|gCube Document Model]] (gDM) defines the basic notion of document and the [[GCube_Document_Model#Implementation|gCube Model Library]] (gML) implements that notion into objects;
 
* the [[GCube_Document_Model#Overview|gCube Document Model]] (gDM) defines the basic notion of document and the [[GCube_Document_Model#Implementation|gCube Model Library]] (gML) implements that notion into objects;
* the objects of the gML can be exchanged in the infrastructure as edge-labelled trees, and the [[Content_Manager_Library|Content Manager Library]] (CML) can model such trees as objects and dispatch them to the read and write operations of the [[Content_Manager_(NEW)|Content Manager]] (CM) service;
+
* the objects of the gML can be exchanged in the infrastructure as edge-labelled trees, and the [[Content_Manager_Library|Content Manager Library]] (CML) can dispatch them to the read and write operations of the [[Content_Manager_(NEW)|Content Manager]] (CM) service;
* the CM implements these operations by translating trees to and from the content models of diverse repository back-ends.  
+
* the CM implements its operations by translating trees to and from the content models of diverse repository back-ends.  
  
 
The gDL builds on the gML and the CML to implement a local interface of <code>CRUD</code> operations that lift those of the CM to the domain of documents, efficiently and effectively.
 
The gDL builds on the gML and the CML to implement a local interface of <code>CRUD</code> operations that lift those of the CM to the domain of documents, efficiently and effectively.
Line 11: Line 11:
 
= Preliminaries =
 
= Preliminaries =
  
The core functionality of the gDL lies in its operations to read and write documents. The operations trigger interactions with remote services and the movement of potentially large volumes of data across the infrastructure. This may have a non-trivial and combined impact on the responsiveness of clients and the overall load of the infrastructure. The operations have been designed to minimise this impact. In particular:
+
The core functionality of the gDL lies in its operations to read and write document descriptions. The operations trigger interactions with the [[Content_Manager_(NEW)|Content Manager]] service and the movement of potentially large volumes of data across the infrastructure. This may have a non-trivial and combined impact on the responsiveness of clients and the overall load of the infrastructure. The operations have been designed to minimise this impact. In particular:
  
* when reading, clients can qualify the documents that are relevant to their queries, and indeed what properties of relevant documents should be actually retrieved. These retrieval directives are captured in the gDL by the notion of '''document projections'''.
+
* when reading, clients can qualify the documents that are relevant to their queries, and indeed what properties of those documents should be actually retrieved. These retrieval directives are captured in the gDL by the notion of [[#Projections|document projections]].
  
* when reading and writing, clients can move large numbers of documents across the infrastructure. The gDL ''streams'' this I/O movements so as to make efficient use of local and remote resources. It then defines a facilities with which clients can conveniently consume input streams, produce output streams, and more generally filter one stream into an other regardless of their origin. The facilities are collected into the '''stream DSL''', an embedded domain-specific language for stream processing.  
+
* when reading and writing, clients can move large numbers of documents across the infrastructure. The gDL ''streams'' this I/O movements so as to make efficient use of local and remote resources. It then defines a facilities with which clients can conveniently consume input streams, produce output streams, and more generally convert one stream into an other regardless of its origin. These facilities are collected into the [[#Streams|stream DSL]], an Embedded Domain-Specific Language (EDSL) for stream conversion and processing.  
  
Understanding document projections and the stream DSL is key to reading and writing documents effectively. We discuss these preliminary concepts first, and then consider their use as input and outputs of the operations of the gDL.
+
Understanding document projections and the stream DSL is key to reading and writing documents effectively with the gDL. We discuss these preliminary concepts first, and then consider their use as input and outputs in read and write the operations of the library.
  
 
== Projections ==
 
== Projections ==
  
A projection is a set of constraints over the properties of documents in the gDM. It can be used to ''match'' documents, i.e. identify documents whose properties satisfy the constraints of the projection.
+
A projection is a set of constraints over the properties of document descriptions. It can be be used in the [[#Reading Documents|read operations]] of the gDL to:
<br>
+
Projections and matching are used in the [[#Reading Documents|read operations]] of the gDL:
+
  
* as a means to characterise relevant documents (''projections as types'');
+
* characterise relevant descriptions as those that ''match'' the constraints (''projections as types'');
* as a means to specify what parts of relevant documents should be retrieved (''projections as retrieval directives'').
+
* specify what properties of relevant descriptions should be retrieved (''projections as retrieval directives'').
  
The constraints of a projection take accordingly two forms:
+
Constraints take accordingly two forms:
  
 
* '''include constraints''' apply to properties that must be matched ''and'' retrieved;
 
* '''include constraints''' apply to properties that must be matched ''and'' retrieved;
 
* '''filter constraints''' apply to properties that must be matched but ''not'' retrieved.  
 
* '''filter constraints''' apply to properties that must be matched but ''not'' retrieved.  
 
   
 
   
'''note''': in both cases, the constraints take the form of 'predicates' of the [[Content_Manager_Library|Content Manager Library] (CML)]]. The projection itself converts into a complex predicate which is amenable for processing by the Content Manager service in the execution of retrieval operations. In this sense, projections are a key part of the document-oriented layer that the gDL defines over lower-level components of the [[GCube_Information_Organisation_Services_(NEW)|service subsystem]] for content management.
+
'''note''': in both cases, the constraints take the form of ''predicates'' of the [[Content_Manager_Library|Content Manager Library]] (CML). The projection itself converts into a complex predicate which is amenable for processing by the Content Manager service in the execution of its retrieval operations. In this sense, projections are a key part of the document-oriented layer that the gDL defines over lower-level components of the gCube [[GCube_Information_Organisation_Services_(NEW)|subsystem dedicated]] to content management.
  
As a simple example, a projection may define an include constraint over the name of metadata elements and a filter constraint over the time of their last update.  
+
As a first example, a projection may specify an include constraint over the name of metadata elements and a filter constraint over the time of last update. It may then be used to:
<br>
+
It may then be used to:
+
  
* characterise documents with metadata elements that match both constraints;  
+
* characterise document descriptions with at least one metadata element that matches both constraints;  
* retrieve of those documents only the name of matching metadata elements, excluding any other document property, including other inner elements and their properties.  
+
* retrieve of those descriptions only the name of matching metadata elements, excluding the time of last update, any other metadata property, and any other document property, include other inner elements and their properties.  
  
All projections in the gDL have the <code>Projection</code> interface, which can be used in element-generic computations to access their constraints. To build projections, however, clients deal with one of the following implementation of the interface:
+
Projections have the <code>Projection</code> interface, which can be used to access their constraints in element-generic computations. To build projections, however, clients deal with one of the following implementation of the interface:
  
 
* <code>DocumentProjection</code>
 
* <code>DocumentProjection</code>
Line 54: Line 50:
 
* <code>PropertyProjection</code>
 
* <code>PropertyProjection</code>
  
allows clients to express constraints on the generic properties of any of the elements of the gDM.
+
allows clients to express constraints on the generic properties of documents and their inner elements.
  
=== Simple Projections ===
+
[[GDL_Projections_(2.0)|read more...]]
  
Clients create projections with the factory methods of the <code>Projections</code> companion class (a static import improves legibility and is recommended):
+
== Streams ==
  
<source lang="java5" highlight="1">
+
In some of its operations, the gDL relies on streams to model, process, and transfer large-scale data collections. Streams may consist of document descriptions, document identifiers, and document updates. More generally, they may consist of the outcomes of operations that take in turn large-scale collections in input. Streamed processing makes efficient use of both local and remote resources, from local memory to network bandwidth, promoting the overall responsiveness of clients and services through reduced latencies.
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document();
+
  
MetadataProjection mp = metadata();
+
Clients that use these operations will need to route streams towards and across the operations of the gDL, converting between different stream interfaces, often injecting application logic in the process. As a common example, a client may need to:
  
AnnotationProjection annp = annotation();
+
* route a remote result set of document identifiers towards the read operations of the gDL;
 +
* process the document descriptions returned by the read operations, e.g. in order to update some of their properties;
 +
* feed the modified document descriptions to the write operations of the gDL, so as to commit the changes;
 +
* inspect commit outcomes, so as to report or otherwise handle the failures that may have occurred in the process.
  
PartProjection pp = part();
+
Throughout the workflow, it is important that the client remains within the paradigm of streamed processing, avoiding the accumulation of data in memory in all cases but where strictly required. Document identifiers will be streaming from the remote location of the original result set ''as'' documents descriptions will be flowing back from yet another remote location, ''as'' updated document descriptions will be leaving towards the same remote location, and ''as'' failures will be steadily coming back for handling.
  
AlternativeProjection altp = alteranative();
+
Streaming raises significant opportunities for clients, as well as non-trivial challenges. In recognition of the difficulties, the gDL includes a set of general-purpose facilities for stream conversion that simplify the tasks of filtering, transforming, or otherwise processing streams. These facilities are cast as the sentences of the '''Stream DSL''', an Embedded Domain-Specific Language (EDSL).
  
</source>
+
[[GDL_Streams_(2.0)|read more...]]
  
The projections above do not specify any include or filter constraints on the elements of the corresponding type. For example, <code>dp</code> matches all documents, regardless of their properties, inner elements, and properties of their inner elements. Similarly, <code>mp</code> matches all metadata elements of any document, regardless of their properties, and <code>pp</code> matches all the parts of any document, regardless of their properties. Thus the factory methods of the <code>Projections</code> class return ''empty projections''.
+
= Operations =
  
Clients may add include constraints to a projection with the method <code>with()</code> declared by all projection classes. For document projections, for example:
+
The operations of the gDL allows clients to add, update, delete, and retrieve document descriptions to, in, and from remote collections within the infrastructure. These <code>CRUD</code> operations target (instances of) a specific back-end within the infrastructure,  the [[Content_Manager_(NEW)|Content Manager]] (CM) service. It is  a direct implication of the CM that the document descriptions may be stored in different forms within repositories which are inside or outside the strict boundaries of the infrastructure. While the gDL operations clearly expose the remote nature of document descriptions, the actual location of document descriptions, hosting repositories, and Content Manager instances is hidden to their clients.
  
<source lang="java5">
+
In what follows, we discuss first ''read operations'', i.e. operations that localise document descriptions from remote collections. We then discuss ''write operations'', i.e. operations that persist in remote collections document descriptions which have been created or else modified locally. In all cases, operations are overloaded to work with different forms of inputs and outputs. In particular, we distinguish between:
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME);
+
</source>
+
  
With the above, the client adds the simplest form of constraint, an ''existence constraint'' that requires the target elements to have given properties, here the document to have name. Since this is an include constraint, the client is expressing an interest only in this property, regardless of the existence and values of other properties. Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection is translated into a directive to retrieve ''only'' the names of document(s) that have one.  
+
* '''singleton operations''': these are operations that read, add, or change individual document descriptions. Singleton operations are used for punctual interactions with the infrastructure, most noticeably those required by front-end clients to implement user interfaces. All singleton operations that target existing document descriptions require the specifications of their identifiers;
 +
 +
* '''bulk operations''': these are operations that read, add, or change multiple document descriptions in a single interaction with the infrastructure. Bulk operations can be used for batch interactions with the infrastructure, most noticeably those required by back-end clients to implement workflows. They can also be used for real-time interactions with the infrastructure, such as those required by front-end clients that process user queries. Bulk operations may be further classified in:
 +
** '''by-value operations''' are defined over in-memory collections of document descriptions. Accordingly, these operations are indicated for small-scale data transfer scenarios. As we shall see, they may also be used to move segments of larger data collections, when the creation of such fragments is a functional requirement.
 +
** '''by-reference operations''' are defined over [[#Streams|streams]] of document descriptions. These operations are indicated for medium-scale to large-scale data transfer scenarios, where the streamed processing promotes the responsiveness of clients and the effective use of network resources.
  
'''note''': properties are conveniently represented by constants in the <code>Projections</code> class. The constants are not strings, however, but dedicated <code>Property</code> objects that are specific to the type of projection. Trying to use properties that are undefined for the type of elements targeted by the projection is illegal and the error is detected statically.
+
Read and write operations work with document descriptions that align with the [[GCube_Document Model|gCube document model]] (gDM) and its implementation in the [[GCube_Document_Model#Implementation|gCube Model Library]] (gML). In the terminology of the gML, in particular, operations that create document descriptions expect [[GCube_Document_Model#New_Elements_and_Element_Proxies|new elements]], all the others take or produce [[GCube_Document_Model#New_Elements_and_Element_Proxies|element proxies]].
  
Existence constraints may be expressed at once on multiple properties, e.g.:
+
Finally, read and write operations build on the facilities of the [[Content_Manager_Library| Content Manager Library]] (CML) to interact with the Content Manager service, including the adoption of [[Content_Manager_Library#High-Level Calls|best-effort strategies]] to discover and interact with instances of the service. These facilities are thus indirectly available to gDL clients as well.
  
<source lang="java5">
+
[[GDL_Operations_(2.0)|read more...]]
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME,LANGUAGE,BYTESTREAM);
+
</source>
+
  
 +
= Views =
  
Besides inclusion constraints, clients may specify filter constraints with the method <code>where()</code> on projections, e.g:
+
Some clients interact with remote collections to work exclusively with subsets of document descriptions that share certain properties, e.g. are in a given language, have changed in the last month, have metadata in a given schema, have parts of a given type, and so on. Their queries and updates are always resolved within these subsets, rather than the whole collection. Essentially, such clients have their own ''view'' of the collection.
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().where(NAME,LANGUAGE);
+
</source>
+
 
+
Now, the client still requires documents to have a name and a language but he retains an interest in the other properties of matching documents. Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection is translated into a directive to retrieve ''all'' the properties of documents with a name.
+
 
+
 
+
Include and filter constraints can be combined, and the projections classes follow a builder pattern to add readability to the combinations. In particular, <code>with()</code> and <code>where()</code> return the very projection on which they are invoked. They may then be used as follows:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME,SCHEMA_URI)
+
                                  .where(BYTESTREAM);
+
</source>
+
 
+
Here, the client requires documents to have a name and embed a bytestream that conforms to a schema, but he has an interest in processing only document names and schema URIs (e.g. for display purposes). Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection retrieves the requested information but avoids the transmission of bytestreams.
+
 
+
=== Optional Modifiers ===
+
 
+
Moving now beyond the simple existence of properties, another common requirement is to indicate the optionality of properties. Clients may wish to include certain properties, or equivalently filter by certain properties, if and only if these actually exists. In this case, clients can use the <code>opt()</code> of the <code>Projections</code> class as a constraint ''modifier'', as this example illustrates:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME,opt(SCHEMA_URI))
+
                                  .where(BYTESTREAM);
+
</source>
+
 
+
This projection differs from the previous one only because of the optionality constraint on the existence of a schema for the document's bytestream. Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection retrieves the name all documents that include a bytestream, but also their schema URI ''if'' they happen to have one.
+
 
+
A common use of optional modifier is with bytestream, which clients may wish either to find included in the document or else referred to with a URL:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(opt(BYTESTREAM),opt(URL));
+
</source>
+
 
+
Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection retrieves at most the bytestream and its URL for those documents that have both, only one of the two if the other is missing, and nothing at all if they are both missing.
+
 
+
'''note:''' The API allows optional modifiers in filter constraints too, but their application is rather pointless in this context (they will never elements from retrieval).
+
 
+
=== Deep Projections ===
+
 
+
In the examples above, we have considered existence constraints on simple element properties. The examples generalise easily to repeated structured properties, such as generic properties for all elements and inner element properties for documents.
+
 
+
Consider the following example:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(PART, opt(METADATA), PROPERTY);
+
</source>
+
 
+
Here the client adds three include constraints to the projection, all three for the existence of repeated properties. Documents that match this projection have ''at least'' one part, ''at least'' one generic property, and zero or more metadata elements. Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection retrieves ''all' the parts and ''all'' the generic properties of documents that have at least one of each, as well as ''all'' of their the metadata elements if they happen to have some.
+
 
+
Repeated properties such as generic properties and inner elements are also structured, i.e. have properties of their own. Clients that wish to constrain those properties too can use ''deep projections'', i.e. embed within the projection of a given type one or more projections built for the structured properties of elements of that type. The following example illustrates the concept for metadata elements:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
MetadataProjection mp = meatadata().with(LANGUAGE).where(BYTESTREAM);
+
 
+
DocumentProjection dp = document().with(NAME, PART)
+
                                  .with(METADATA,mp);
+
 
+
</source>
+
 
+
The first projection constraints the existence of language and bytestream for metadata elements. The second projection constraints the existence of name and parts for document, as well as the existence of metadata elements that match the constraints of the first projection. The usual implications of include constraints and filter constraints apply. Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection retrieves the name, parts, and metadata elements of documents that have a name, at least one part, and at least one metadata element that includes a bystream. For the metadata elements, in particular, it retrieves only the language property.
+
 
+
Note that optionality constraints apply to deep projections as well as they apply to flat projections, as the following example shows:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
MetadataProjection mp = meatadata().with(LANGUAGE).where(BYTESTREAM);
+
 
+
DocumentProjection dp = document().with(NAME, PART)
+
                                  .with(opt(METADATA,mp));
+
 
+
</source>
+
 
+
This projection differs from the previous one only because the existence of on metadata elements that match the inner projection is optional. Documents that have a name and at least one part match the outer projection even if the have  ''no'' metadata elements that match the inner projection (or no metadata elements at all).
+
 
+
=== Projections over Generic Properties ===
+
 
+
Generic properties are repeated and structured properties common to all elements. As for other properties with these characteristics, clients may wish to build deep projections that constraints their inner properties. For this purpose, the class <code>Projections</code> includes a dedicated factory method <code>property()</code>, as well as as specialised methods to express constraints. The following example illustrates the approach:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
 
+
PropertyProjection pp = property().withKey("somekey").with(PROPERTY_TYPE);
+
 
+
DocumentProjection dp = document().with(NAME, PART)
+
                                  .with(PROPERTY,pp);
+
 
+
</source>
+
 
+
Here, the client creates a document projection and embeds in it an inner projection that constrains its generic properties.
+
The inner projection uses the method <code>with()</code> to add an include constraint for the existence of a type for the generic property, as usual.
+
It also adds an include constraint to specify an exact value for the key of a generic property of interest. This relies on a method <code>withKey()</code> which is specific to projection over generic properties of elements. The reason for this specific construct is that, differently from other constrainable properties of elements, they key of a generic property serves as its identifier.
+
 
+
For the rest, property projections behave like other projections (e.g. can be used with optional modifiers). Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, the projection above matches documents with a name, at least one part, and a property with key <code>somekey</code> and some type.
+
 
+
=== Advanced Projections ===
+
 
+
In more advanced forms of projections, clients may wish to specify constraints on properties other than mere existence.
+
In these cases, they can use overloads of <code>with()</code> and <code>where()</code> that take as parameters <code>Predicate</code>s that capture the desired constraints.
+
As mentioned above, predicates are defined in the [[Content_Manager_Library|CML]] and gDL clients need to become acquainted with the range of available predicates and how to [[Content_Manager_(NEW)#Building_Predicates| build them]].
+
 
+
'''note''': Deep projections already make use of this customisability. When clients embed a projection into another, they constrain the corresponding structured property with the predicate into which the inner projection translates.
+
 
+
Commonly, clients may wish to constrain the value of a property, as in the following example:
+
 
+
<source lang="java5" highlight="2,3">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
import static org.gcube.contentmanagement.contentmanager.stubs.model.constraints.Constraints.*;
+
import static org.gcube.contentmanagement.contentmanager.stubs.model.predicates.Predicates.*;
+
...
+
DocumentProjection p = document().with(LANGUAGE,text(is("it"));
+
</source>
+
 
+
The client uses here the predicate <code>text(is("it"))</code> to constrain the language of documents to match the ISO639 code for the Italian language. As documented in the [[Content_Manager_(NEW)#Building_Predicates| CML]], the client builds the predicate with the static methods of the <code>Predicates</code> and <code>Constraints</code> classes, which he previously imports.
+
 
+
'''note''': in building predicate expressions with the API of the CML, clients take responsibility for associating properties with predicates that are compatible with their type. In the example above, the language of an element is a textual property and thus only <code>text()</code>-based predicates can successfully match it. The gDL relinquishes the ability to ensure the correct construction of projections so as to allow clients to use the full expressiveness of the predicate language of the CML.
+
 
+
The type of constraints that can be expressed on properties is thus bound by the expressiveness of the predicate language of the CML. We include here another example to illustrate some of the possibilities:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
import static org.gcube.contentmanagement.contentmanager.stubs.model.constraints.Constraints.*;
+
import static org.gcube.contentmanagement.contentmanager.stubs.model.predicates.Predicates.*;
+
...
+
Calendar from = ...
+
Calendar to = ....
+
DocumentProjection p = document().with(URL,uri(matches("^ftp.*")));
+
                                .where(CREATION_TIME,date(all(after(from),before(to))));
+
</source>
+
 
+
This projection is matched by documents that have been created at some point in between two dates, and with a bytestream available at some <code>ftp</code> server.  Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, the projection would retrieve only the URL of (the bytestream of) matching documents.
+
 
+
== Streams ==
+
 
+
=== Local and Remote Iterators ===
+
 
+
=== Stream Language ===
+
 
+
=== Pipes and Filters ===
+
 
+
=== Grouping and Unfolding ===
+
 
+
= Operations =
+
 
+
== Reading Documents ==
+
 
+
== Adding Documents ==
+
 
+
== Updating Documents ==
+
 
+
== Deleting Documents ==
+
 
+
= Views =
+
  
== Transient Views ==
+
The gDL offers support for working with two types of view:
  
== Persistent Views ==
+
* '''local views''': these are views defined by individual clients as the context for a number of subsequent queries and updates. Local views may have arbitrary long lifetimes, and may even outlive the client that created them, they are never used by multiple clients. Thus local views are commonly transient and if their definitions are somehow persisted, they are persisted locally to the 'owning' client and remain under its direct responsibility.
  
== Creating Views ==
+
* '''remote views''': these are views defined by some clients and used by many others within the system. Remote views outlive all such clients and persist in the infrastructure, typically for as long as the collection does. They are defined through the [[View_Manager|View Manager]] service (VM), which materialises them as WS-Resources. Each VM resource encapsulates the definition of the view as well as its descriptive properties, and it is responsible for managing its lifetime, e.g. keep track of its cardinality and notify interested clients of changes to its contents. However, VM resources are [[View_Manager#Motivations|passive'']], i.e. do not mediate access to those content resources.
  
== Discovering Views ==
+
Naturally, the gDL uses [[#Projections|projections]] as view definitions. It then offers specialised <code>Reader</code>s that encapsulate such projections to implicitly resolve all their operations in the scope of the view. This yields view-based access to collections and allows clients to work with local views. In addition, the gDL provides local proxies of VM resources with which clients can create, discover, and inspect remote views. As these proxies map remote view definitions onto projections, remote views can be accessed with the same reading mechanisms available for local views.
  
== Using Views ==
+
[[GDL_Views_(2.0)|read more...]]
  
= Advanced Topics =
+
= Utilities & F.A.Q. =
  
== Caches ==
+
The GCube Document Library offers utility classes to manage the collections and the views in the system.
  
== Buffers ==
+
[[GDL_Utilities_%26_F.A.Q._(2.0)|read more...]]

Latest revision as of 12:21, 21 March 2011

The gCube Document Library (gDL) is a client library for adding, updating, deleting and retrieving document descriptions to, in, and from remote collections in a gCube infrastructure.

The gDL is a high-level component of the subsystem of gCube Information Services and it interacts with lower-level components of the subsystem to support document management processes within the infrastructure:

  • the gCube Document Model (gDM) defines the basic notion of document and the gCube Model Library (gML) implements that notion into objects;
  • the objects of the gML can be exchanged in the infrastructure as edge-labelled trees, and the Content Manager Library (CML) can dispatch them to the read and write operations of the Content Manager (CM) service;
  • the CM implements its operations by translating trees to and from the content models of diverse repository back-ends.

The gDL builds on the gML and the CML to implement a local interface of CRUD operations that lift those of the CM to the domain of documents, efficiently and effectively.

Preliminaries

The core functionality of the gDL lies in its operations to read and write document descriptions. The operations trigger interactions with the Content Manager service and the movement of potentially large volumes of data across the infrastructure. This may have a non-trivial and combined impact on the responsiveness of clients and the overall load of the infrastructure. The operations have been designed to minimise this impact. In particular:

  • when reading, clients can qualify the documents that are relevant to their queries, and indeed what properties of those documents should be actually retrieved. These retrieval directives are captured in the gDL by the notion of document projections.
  • when reading and writing, clients can move large numbers of documents across the infrastructure. The gDL streams this I/O movements so as to make efficient use of local and remote resources. It then defines a facilities with which clients can conveniently consume input streams, produce output streams, and more generally convert one stream into an other regardless of its origin. These facilities are collected into the stream DSL, an Embedded Domain-Specific Language (EDSL) for stream conversion and processing.

Understanding document projections and the stream DSL is key to reading and writing documents effectively with the gDL. We discuss these preliminary concepts first, and then consider their use as input and outputs in read and write the operations of the library.

Projections

A projection is a set of constraints over the properties of document descriptions. It can be be used in the read operations of the gDL to:

  • characterise relevant descriptions as those that match the constraints (projections as types);
  • specify what properties of relevant descriptions should be retrieved (projections as retrieval directives).

Constraints take accordingly two forms:

  • include constraints apply to properties that must be matched and retrieved;
  • filter constraints apply to properties that must be matched but not retrieved.

note: in both cases, the constraints take the form of predicates of the Content Manager Library (CML). The projection itself converts into a complex predicate which is amenable for processing by the Content Manager service in the execution of its retrieval operations. In this sense, projections are a key part of the document-oriented layer that the gDL defines over lower-level components of the gCube subsystem dedicated to content management.

As a first example, a projection may specify an include constraint over the name of metadata elements and a filter constraint over the time of last update. It may then be used to:

  • characterise document descriptions with at least one metadata element that matches both constraints;
  • retrieve of those descriptions only the name of matching metadata elements, excluding the time of last update, any other metadata property, and any other document property, include other inner elements and their properties.

Projections have the Projection interface, which can be used to access their constraints in element-generic computations. To build projections, however, clients deal with one of the following implementation of the interface:

  • DocumentProjection
  • MetadataProjection
  • AnnotationProjection
  • PartProjection
  • AlternativeProjection

A further implementation of the interface:

  • PropertyProjection

allows clients to express constraints on the generic properties of documents and their inner elements.

read more...

Streams

In some of its operations, the gDL relies on streams to model, process, and transfer large-scale data collections. Streams may consist of document descriptions, document identifiers, and document updates. More generally, they may consist of the outcomes of operations that take in turn large-scale collections in input. Streamed processing makes efficient use of both local and remote resources, from local memory to network bandwidth, promoting the overall responsiveness of clients and services through reduced latencies.

Clients that use these operations will need to route streams towards and across the operations of the gDL, converting between different stream interfaces, often injecting application logic in the process. As a common example, a client may need to:

  • route a remote result set of document identifiers towards the read operations of the gDL;
  • process the document descriptions returned by the read operations, e.g. in order to update some of their properties;
  • feed the modified document descriptions to the write operations of the gDL, so as to commit the changes;
  • inspect commit outcomes, so as to report or otherwise handle the failures that may have occurred in the process.

Throughout the workflow, it is important that the client remains within the paradigm of streamed processing, avoiding the accumulation of data in memory in all cases but where strictly required. Document identifiers will be streaming from the remote location of the original result set as documents descriptions will be flowing back from yet another remote location, as updated document descriptions will be leaving towards the same remote location, and as failures will be steadily coming back for handling.

Streaming raises significant opportunities for clients, as well as non-trivial challenges. In recognition of the difficulties, the gDL includes a set of general-purpose facilities for stream conversion that simplify the tasks of filtering, transforming, or otherwise processing streams. These facilities are cast as the sentences of the Stream DSL, an Embedded Domain-Specific Language (EDSL).

read more...

Operations

The operations of the gDL allows clients to add, update, delete, and retrieve document descriptions to, in, and from remote collections within the infrastructure. These CRUD operations target (instances of) a specific back-end within the infrastructure, the Content Manager (CM) service. It is a direct implication of the CM that the document descriptions may be stored in different forms within repositories which are inside or outside the strict boundaries of the infrastructure. While the gDL operations clearly expose the remote nature of document descriptions, the actual location of document descriptions, hosting repositories, and Content Manager instances is hidden to their clients.

In what follows, we discuss first read operations, i.e. operations that localise document descriptions from remote collections. We then discuss write operations, i.e. operations that persist in remote collections document descriptions which have been created or else modified locally. In all cases, operations are overloaded to work with different forms of inputs and outputs. In particular, we distinguish between:

  • singleton operations: these are operations that read, add, or change individual document descriptions. Singleton operations are used for punctual interactions with the infrastructure, most noticeably those required by front-end clients to implement user interfaces. All singleton operations that target existing document descriptions require the specifications of their identifiers;
  • bulk operations: these are operations that read, add, or change multiple document descriptions in a single interaction with the infrastructure. Bulk operations can be used for batch interactions with the infrastructure, most noticeably those required by back-end clients to implement workflows. They can also be used for real-time interactions with the infrastructure, such as those required by front-end clients that process user queries. Bulk operations may be further classified in:
    • by-value operations are defined over in-memory collections of document descriptions. Accordingly, these operations are indicated for small-scale data transfer scenarios. As we shall see, they may also be used to move segments of larger data collections, when the creation of such fragments is a functional requirement.
    • by-reference operations are defined over streams of document descriptions. These operations are indicated for medium-scale to large-scale data transfer scenarios, where the streamed processing promotes the responsiveness of clients and the effective use of network resources.

Read and write operations work with document descriptions that align with the gCube document model (gDM) and its implementation in the gCube Model Library (gML). In the terminology of the gML, in particular, operations that create document descriptions expect new elements, all the others take or produce element proxies.

Finally, read and write operations build on the facilities of the Content Manager Library (CML) to interact with the Content Manager service, including the adoption of best-effort strategies to discover and interact with instances of the service. These facilities are thus indirectly available to gDL clients as well.

read more...

Views

Some clients interact with remote collections to work exclusively with subsets of document descriptions that share certain properties, e.g. are in a given language, have changed in the last month, have metadata in a given schema, have parts of a given type, and so on. Their queries and updates are always resolved within these subsets, rather than the whole collection. Essentially, such clients have their own view of the collection.

The gDL offers support for working with two types of view:

  • local views: these are views defined by individual clients as the context for a number of subsequent queries and updates. Local views may have arbitrary long lifetimes, and may even outlive the client that created them, they are never used by multiple clients. Thus local views are commonly transient and if their definitions are somehow persisted, they are persisted locally to the 'owning' client and remain under its direct responsibility.
  • remote views: these are views defined by some clients and used by many others within the system. Remote views outlive all such clients and persist in the infrastructure, typically for as long as the collection does. They are defined through the View Manager service (VM), which materialises them as WS-Resources. Each VM resource encapsulates the definition of the view as well as its descriptive properties, and it is responsible for managing its lifetime, e.g. keep track of its cardinality and notify interested clients of changes to its contents. However, VM resources are passive, i.e. do not mediate access to those content resources.

Naturally, the gDL uses projections as view definitions. It then offers specialised Readers that encapsulate such projections to implicitly resolve all their operations in the scope of the view. This yields view-based access to collections and allows clients to work with local views. In addition, the gDL provides local proxies of VM resources with which clients can create, discover, and inspect remote views. As these proxies map remote view definitions onto projections, remote views can be accessed with the same reading mechanisms available for local views.

read more...

Utilities & F.A.Q.

The GCube Document Library offers utility classes to manage the collections and the views in the system.

read more...