Difference between revisions of "GCube Document Library (2.0)"

From Gcube Wiki
Jump to: navigation, search
(Local Views)
(Utilities & F.A.Q.)
 
(14 intermediate revisions by 2 users not shown)
Line 13: Line 13:
 
The core functionality of the gDL lies in its operations to read and write document descriptions. The operations trigger interactions with the [[Content_Manager_(NEW)|Content Manager]] service and the movement of potentially large volumes of data across the infrastructure. This may have a non-trivial and combined impact on the responsiveness of clients and the overall load of the infrastructure. The operations have been designed to minimise this impact. In particular:
 
The core functionality of the gDL lies in its operations to read and write document descriptions. The operations trigger interactions with the [[Content_Manager_(NEW)|Content Manager]] service and the movement of potentially large volumes of data across the infrastructure. This may have a non-trivial and combined impact on the responsiveness of clients and the overall load of the infrastructure. The operations have been designed to minimise this impact. In particular:
  
* when reading, clients can qualify the documents that are relevant to their queries, and indeed what properties of those documents should be actually retrieved. These retrieval directives are captured in the gDL by the notion of [[#Projections|'document projections]].
+
* when reading, clients can qualify the documents that are relevant to their queries, and indeed what properties of those documents should be actually retrieved. These retrieval directives are captured in the gDL by the notion of [[#Projections|document projections]].
  
 
* when reading and writing, clients can move large numbers of documents across the infrastructure. The gDL ''streams'' this I/O movements so as to make efficient use of local and remote resources. It then defines a facilities with which clients can conveniently consume input streams, produce output streams, and more generally convert one stream into an other regardless of its origin. These facilities are collected into the [[#Streams|stream DSL]], an Embedded Domain-Specific Language (EDSL) for stream conversion and processing.  
 
* when reading and writing, clients can move large numbers of documents across the infrastructure. The gDL ''streams'' this I/O movements so as to make efficient use of local and remote resources. It then defines a facilities with which clients can conveniently consume input streams, produce output streams, and more generally convert one stream into an other regardless of its origin. These facilities are collected into the [[#Streams|stream DSL]], an Embedded Domain-Specific Language (EDSL) for stream conversion and processing.  
Line 52: Line 52:
 
allows clients to express constraints on the generic properties of documents and their inner elements.
 
allows clients to express constraints on the generic properties of documents and their inner elements.
  
=== Empty Projections ===
+
[[GDL_Projections_(2.0)|read more...]]
 
+
Clients create projections with the factory methods of the <code>Projections</code> companion class. A static import improves legibility and is recommended:
+
 
+
<source lang="java5" highlight="1">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document();
+
 
+
MetadataProjection mp = metadata();
+
 
+
AnnotationProjection annp = annotation();
+
 
+
PartProjection pp = part();
+
 
+
AlternativeProjection altp = alteranative();
+
 
+
</source>
+
 
+
The projections above do not specify any include constraint or filter constraints on the elements of the corresponding type. For example, <code>dp</code> matches all document descriptions, regardless of their properties, inner elements, and properties of their inner elements. Similarly, <code>mp</code> matches all metadata elements of any document description, regardless of their properties, and <code>pp</code> matches all the parts of any document description, regardless of their properties. In this sense, the factory methods of the <code>Projections</code> class return ''empty projections''.
+
 
+
=== Include Constraints ===
+
 
+
Clients may add include constraints to a projection with the method <code>with()</code>. For document projections, for example:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME);
+
</source>
+
 
+
With the above, the client adds the simplest form of constraint, an ''existence constraint'' that requires matching document descriptions to have given properties, here only a name. Since this is an include constraint, the client is expressing an interest only in this property, regardless of the existence and values of other properties. Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection is interpreted into a directive to retrieve '''only''' the names of document descriptions that have one. To reiterate this important point: any other descriptive property will ''not'' be retrieved. Thus, <code>with()</code> allows clients to specify precisely what they need to work with, no more and no less.
+
 
+
'''note''': properties are conveniently represented by constants in the <code>Projections</code> class. The constants are not strings, however, but dedicated <code>Property</code> objects specific to the type of projection. Trying to use properties that are undefined for the type of elements targeted by the projection is illegal and the error is detected statically.
+
 
+
Note that existence constraints may be expressed at once on multiple properties, e.g.:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME,LANGUAGE,BYTESTREAM);
+
</source>
+
 
+
=== Filter Constraints ===
+
 
+
Along with inclusion constraints, clients may specify filter constraints with the method <code>where()</code>. Projections classes follow a builder pattern, i.e. their methods to be chained for increased readability. In particular, e.g.:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME,LANGUAGE)
+
                                  .where(BYTESTREAM);
+
</source>
+
 
+
As in the previous example, the client requires document descriptions to have a name, a language, and to embed a bytestream. Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, however, the projection is interpreted into a different retrieval directive: of the matching descriptions, retrieve only the name and the language, '''not''' their bytestream. Thus <code>with()</code> allows clients to specify what they need to be true but do not need to work with.
+
 
+
'''note''': As for <code>with()</code>, <code>where()</code> accepts multiple properties as parameters.
+
 
+
'''note''': Constraining the same property in <code>with()</code> and <code>where()</code> parameter lists, or else across methods, has a destructive effect: the constraint specified last overrides those specified earlier on the same property. This allows clients to stage the construction of a projection across multiple components, where a component may wish to override what the constraints set by an upstream component. Clients should be careful to avoid this repetition in all the other scenarios.
+
 
+
Filter constraints are typically used in combination with include constraints, as in the example above. However, a projection ''may'' include only filter constraints, e.g.:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().where(NAME);
+
</source>
+
 
+
The projection requires document descriptions to have a name but it also indicates that none of their properties should be actually retrieved. Projections of this kind can be used to verify the existence of matching descriptions, or else to count the number of matching descriptions, whilst moving the minimum amount of data over the network.
+
 
+
=== Optional Modifiers ===
+
 
+
Another common requirement is to indicate the optionality of constraints. Clients may wish to retrieve certain properties only if they satisfy given constraints. In this case, clients can use the <code>opt()</code> method of the <code>Projections</code> class as a constraint ''modifier''. Consider this variation on a previous example:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME,opt(LANGUAGE))
+
                                  .where(BYTESTREAM);
+
</source>
+
 
+
This projection differs from the previous one only for the optional modifier on (the existence of) a language. Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection retrieves the name all document descriptions that include a bytestream, but also their language '''if''' they do have one. If they do not have a language, only the name will be retrieved. In other words, name and bytestream are conditions that descriptions must match to be relevant, the language is instead only optional.
+
 
+
A common use of optional modifier is with bytestream, which clients may wish either to find included in the document description or else referred to from within it:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(opt(BYTESTREAM),opt(BYTESTREAM_URI));
+
</source>
+
 
+
Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection retrieves at most the bytestream and its URI for those document descriptions that have both, only one of the two if the other is missing, and nothing at all if they are both missing.
+
 
+
'''note:''' Using optional modifiers in filter constraints, i.e. as arguments to the <code>where()</code> method, is nonsensical since an optional constraint does not discriminate any document description. Worse, optional filters can slow down the execution of retrieval as the service back-end may not be able to optimise them away.
+
 
+
=== Catch-All Constraints ===
+
 
+
Clients may combine include constraints, filter constraints, and optional modifiers to build ''any'' projection that can be possibly built with the gDL. With these constructs, they can pinpoint exactly ''what'' properties are to be retrieved and ''when'' they should be retrieved. This accuracy is a main goal of the gDL, but it may be inconvenient when clients wish to express existence constraints on a number of properties at once. Some common projections, in particular, cannot be conveniently built with these constructs alone.
+
 
+
For example, clients may wish to constrain only a few properties but retrieve them all. Consider the following example:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME), opt(LANGUAGE), opt(BYTESTREAM), opt(METADATA), ...);
+
</source>
+
 
+
Here, the client requires document descriptions to have a name but wishes to retrieve any other property that they may have. To express this, the client must explicitly add optional existence constraints on all these properties.  Clearly, this is cumbersome and will break if the model is extended in the future.
+
 
+
To improve matters, clients may use the method <code>etc()</code>, which adds such existence constraints automatically:
+
 
+
<source lang="java5" highlight="3">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME).etc();
+
</source>
+
 
+
Similarly, clients may wish to add ''catch-all'' existence constraints on all properties but for a few ones, which they do not wish to retrieve. For this, they can use the method <code>allexcept()</code>:
+
 
+
<source lang="java5" highlight="3">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(NAME).allexcept(BYTESTREAM,PART);
+
</source>
+
 
+
Here, bytestreams and parts are excluded from retrieval, if they exist.
+
 
+
Note that explicit <code>with()</code> and <codewhere()</code> constraints have precedence over those automatically generated by <code>etc()</code> and <code>allexcept()</code>, regardless of the order of method invocation. The following example illustrates the point:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp1 = document().with(NAME).etc().where(LANGUAGE);
+
DocumentProjection dp2 = document().etc().with(NAME).where(LANGUAGE);
+
DocumentProjection dp3 = document().with(NAME).where(LANGUAGE).etc();
+
 
+
assert(dp1.equals(dp2));
+
assert(dp2.equas(dp3));
+
</source>
+
 
+
A similar example could be repeated for <code>exceptall()</code>.
+
 
+
On the other hand, <code>etc()</code> and <code>exceptAll()</code> are intended as mutually exclusive alternatives and should ''not'' be used together in a projection. Doing so may produce undesired effects:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
//silly projection: allexcept() has no effect
+
DocumentProjection dp1 = document().etc().allexcept(NAME); 
+
 
+
assert(dp1.equals(document().etc()));
+
 
+
 
+
//silly projection: etc() reintroduces name...
+
DocumentProjection dp2 = document().allexcept(NAME).etc();
+
 
+
assert(dp2.equals(document().etc()));
+
 
+
</source>
+
 
+
Finally, note that <code>document()</code> and <code>documen().etc()</code> are different projections even if they have the same implications for retrieval:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
//silly projection: allexcept() has no effect
+
DocumentProjection dp1 = document(); 
+
DocumentProjection dp2 = document().etc();
+
 
+
assert(!dp1.equals(dp1));
+
 
+
</source>
+
 
+
The difference is in fact substantial: <code>document()</code> adds no constraints on properties, while the <code>document().etc()</code> adds an optional constraint on each and every property.
+
Clients should prefer empty projections in all cases, as they travel faster over the network and are more likely to be executed faster by remote content manager services.
+
 
+
=== Deep Projections ===
+
 
+
In the examples above, we have considered existence constraints on simple element properties. The examples generalise easily to repeated structured properties, such as generic properties for all elements and inner element properties for document descriptions.
+
 
+
Consider the following example:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection dp = document().with(PART, opt(METADATA), PROPERTY);
+
</source>
+
 
+
Here the client adds three include constraints to the projection, all three for the existence of repeated properties. Document descriptions that match this projection have ''at least'' one part, ''at least'' one generic property, and zero or more metadata elements. Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection retrieves ''all' the parts and ''all'' the generic properties of descriptions that have at least one of each, as well as ''all'' of their the metadata elements if they happen to have some.
+
 
+
Repeated properties such as generic properties and inner elements are also structured, i.e. have properties of their own. Clients that wish to constrain those properties too can use ''deep projections'', i.e. embed within the projection of a given type one or more projections built for the structured properties of elements of that type. The following example illustrates the concept for metadata elements:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
MetadataProjection mp = meatadata().with(LANGUAGE).where(BYTESTREAM);
+
 
+
DocumentProjection dp = document().with(NAME, PART)
+
                                  .with(METADATA,mp);
+
 
+
</source>
+
 
+
The first projection constraints the existence of language and bytestream for metadata elements. The second projection constraints the existence of name and parts for document descriptions, as well as the existence of metadata elements that match the constraints of the first projection. The usual implications of include constraints and filter constraints apply. Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, this projection retrieves the name, parts, and metadata elements of document descriptions that have a name, at least one part, and at least one metadata element that includes a bystream. For the metadata elements, in particular, it retrieves only the language property.
+
 
+
Note that optionality constraints apply to deep projections as well as they apply to flat projections, as the following example shows:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
MetadataProjection mp = meatadata().with(LANGUAGE).where(BYTESTREAM);
+
 
+
DocumentProjection dp = document().with(NAME, PART)
+
                                  .with(opt(METADATA,mp));
+
 
+
</source>
+
 
+
This projection differs from the previous one only because the existence of on metadata elements that match the inner projection is optional. Document descriptions that have a name and at least one part match the outer projection even if the have  ''no'' metadata elements that match the inner projection (or no metadata elements at all).
+
 
+
=== Projecting over Generic Properties ===
+
 
+
Generic properties are repeated and structured properties common to all elements. As for other properties with these characteristics, clients may wish to build deep projections that constraints their inner properties. For this purpose, the class <code>Projections</code> includes a dedicated factory method <code>property()</code>, as well as as specialised methods to express constraints. The following example illustrates the approach:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
 
+
PropertyProjection pp = property().withKey("somekey").with(PROPERTY_TYPE);
+
 
+
DocumentProjection dp = document().with(NAME, PART)
+
                                  .with(PROPERTY,pp);
+
 
+
</source>
+
 
+
Here, the client creates a document projection and embeds in it an inner projection that constrains its generic properties.
+
The inner projection uses the method <code>with()</code> to add an include constraint for the existence of a type for the generic property, as usual.
+
It also adds an include constraint to specify an exact value for the key of a generic property of interest. This relies on a method <code>withKey()</code> which is specific to projection over generic properties of elements. The reason for this specific construct is that, differently from other constrainable properties of elements, they key of a generic property serves as its identifier.
+
 
+
For the rest, property projections behave like other projections (e.g. can be used with optional modifiers). Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, the projection above matches document descriptions with a name, at least one part, and a property with key <code>somekey</code> and some type.
+
 
+
=== Equivalence Constraints ===
+
 
+
In all the examples above, we have relied on simple existence constraints to present the mechanisms available for building projections. Moving beyond the existence of properties, another common type of constraint is based on text equivalences over simple element properties, which form the majority of properties of document descriptions and inner elements. The gDL offer dedicated mechanisms to specify these type of constraints in the form of overloads of the <code>with()</code> and <code>where()</code> methods of projection classes. The following example illustrates usage:
+
 
+
<source lang="java5" highlight="3,4">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
...
+
DocumentProjection p = document().with(LANGUAGE,"it").
+
                                .where(BYTESTREAM_URI,new URI("...."));
+
</source>
+
 
+
Here the client asks for the language of document descriptions only if this matches the ISO639 code for the Italian language and only if the descriptions have a bytestream with a given URI. As the example shows, the overloads of <code>with()</code> and <code>where()</code> accept arbitrary objects and use their <code>toString()</code> serialisations for the equivalence (the serialisations are in fact refined in special cases such as dates, so that clients should alway pass objects rather than invoke <code>toString()</code> upon them).
+
 
+
=== Complex Constraints ===
+
 
+
In more advanced forms of projections, clients may wish to specify constraints on properties other than existence and equivalence.
+
For these cases, gDL offers overloads of <code>with()</code> and <code>where()</code> that take as parameters <code>Predicate</code>s that capture the desired constraints.
+
As mentioned above, predicates are defined in the [[Content_Manager_Library|CML]] and gDL clients need to become acquainted with the range of available predicates and how to [[Content_Manager_(NEW)#Building_Predicates| build them]].
+
 
+
As an example of the possibilities, consider the following:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
import static org.gcube.contentmanagement.contentmanager.stubs.model.constraints.Constraints.*;
+
import static org.gcube.contentmanagement.contentmanager.stubs.model.predicates.Predicates.*;
+
...
+
Calendar from = ...
+
Calendar to = ....
+
DocumentProjection p = document().with(BYTESTREAM_URI,uri(matches("^ftp.*")));
+
                                .where(CREATION_TIME,date(all(after(from),before(to))));
+
</source>
+
 
+
This projection is matched by document descriptions that have been created at some point in between two dates, and with a bytestream available at some <code>ftp</code> server.  Used as a parameter in the [[#Reading Documents|read operations]] of the gDL, the projection would retrieve only the URI of (the bytestream of) matching descriptions. As documented in the [[Content_Manager_(NEW)#Building_Predicates| CML]], the client builds the predicate with the static methods of the <code>Predicates</code> and <code>Constraints</code> classes, which he previously imports.
+
 
+
'''note''': in building predicate expressions with the API of the CML, clients take responsibility for associating properties with predicates that are compatible with their type. In the example above, the creation time of an element is a temporal property and thus only <code>date()</code>-based predicates can successfully match it. The gDL relinquishes the ability to ensure the correct construction of projections so as to allow clients to use the full expressiveness of the predicate language of the CML.
+
 
+
'''note''': Deep projections and overloads for equivalence constraints equivalence already make use of this customisability. When clients embed a projection into another, they constrain the corresponding structured property with the predicate into which the inner projection translates. Similarly, when they specify equivalence constraints they are implicitly using predicates of the form <code>text(is(...))</code>.
+
  
 
== Streams ==
 
== Streams ==
Line 344: Line 69:
 
Streaming raises significant opportunities for clients, as well as non-trivial challenges. In recognition of the difficulties, the gDL includes a set of general-purpose facilities for stream conversion that simplify the tasks of filtering, transforming, or otherwise processing streams. These facilities are cast as the sentences of the '''Stream DSL''', an Embedded Domain-Specific Language (EDSL).
 
Streaming raises significant opportunities for clients, as well as non-trivial challenges. In recognition of the difficulties, the gDL includes a set of general-purpose facilities for stream conversion that simplify the tasks of filtering, transforming, or otherwise processing streams. These facilities are cast as the sentences of the '''Stream DSL''', an Embedded Domain-Specific Language (EDSL).
  
=== Standard and Remote Iterators ===
+
[[GDL_Streams_(2.0)|read more...]]
 
+
As all the sentences of the Stream DSL take and return streams, we begin by looking look at how streams are represented in the gDL.
+
 
+
Streams have the interface of ''iterators'', i.e. yield elements on demand and are typically  consumed within loops. There are two such interfaces:
+
 
+
* <code>Iterator&lt;T&gt;</code>, the standard Java interface for iterations.
+
* <code>RemoteIterator&lt;T&gt;</code>, a variation over <code>Iterator&lt;T&gt;</code> which marks explicitly the remote origin of the stream.
+
 
+
In particular, a <code>RemoteIterator</code> differs from a standard <code>Iterator</code> in two respects:
+
 
+
* the method <code>next()</code> may throw a checked <code>Exception</code>. This witnesses to the fact that iterating over the stream involves fallible I/O operations;
+
* there is a method <code>locator()</code> that returns a reference to the remote stream as a plain <code>String</code> in some implementation-specific syntax.
+
 
+
Locators aside, the key difference between the two interfaces is in their assumptions about the possibility of iteration failures. A standard <code>Iterator</code> does not present failures to its clients other than for requests made past end of the stream (an unchecked <code>NoSuchElementException</code>). This may be because failures do not occur at all, e.g. the iteration is over an in-memory collection; it may also be because the iterator knows how to handle failures when these occur. In this sense, <code>Iterator&lt;T&gt;</code> may well be defined over external, even remote collections, but it assumes that all failure handling policies are responsibilities of its implementations.
+
 
+
In contrast, <code>RemoteIterator&lt;T&gt;</code> makes it clear that:
+
 
+
* failures are likely to occur;
+
* ''clients'' are expected to handle them.
+
 
+
The operations of the gDL make use of both interfaces:
+
 
+
* when they ''take'' streams, they expect them as standard <code>Iterator</code>s;
+
* when they ''return'' streams, the provide them as <code>RemoteIterator</code>s.
+
 
+
This choice emphasises two points:
+
 
+
* streams that are provided by clients are of unknown origin, those provided by the library originate in remote services of the gCube Content Management infrastructure.
+
* all fault handling policies are in the hands of clients, where they should be. When clients provide an <code>Iterator</code> to the library, they will have embedded a fault handling policy in its implementation. When they receive a <code>RemoteIterator</code> from the library, they will apply a fault handling policy when consuming the stream.
+
 
+
=== Simple Conversions ===
+
 
+
The sentences of the DSL begin with ''verbs'', which can be statically imported from the <code>Streams</code> class:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
</source>
+
 
+
The verb <code>convert</code> introduces the simplest of sentences, those that convert between <code>Iterator</code>s and <code>RemoteIterator</code>s. The following example shows the conversion of an <code>Iterator</code> into a <code>RemoteIterator</code>:
+
 
+
<source lang="java5" highlight="4">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
Iterator<SomeType> it = ...
+
RemoteIterator<SomeType> rit = convert(it);
+
</source>
+
 
+
The result is a <code>RemoteIterator</code> that promises to return failures but never does. The implementation is just a wrapper around the standard <code>Iterator</code> which returns <code>it.toString()</code> as the locator of the underlying collection.
+
 
+
Converting a <code>RemoteIterator</code> to an <code>Iterator</code> is more interesting because it requires the encapsulation of a fault handling policy. The following example shows the possibilities:
+
 
+
<source lang="java5" highlight="6,9,12,14">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RemoteIterator<SomeType> rit = ...
+
 
+
//iterator will return any fault raised by the remote iterator
+
Iterator<SomeType> it1 = convert(rit).with(IGNORE_POLICY);
+
 
+
//iterator will stop at the first fault raised by the remote iterator
+
Iterator<SomeType> it2 = convert(rit).with(FAILFAST_POLICY);
+
 
+
//iterator will handle fault as specified by given policy
+
FaultPolicy policy = new FaultPolicy() {...};
+
 
+
Iterator<SomeType> it3 = convert(rit).with(policy);
+
</source>
+
 
+
In this example, the clause <code>with()</code> introduces the fault handling policy to encapsulate in the resulting <code>Iterator</code>. Two common policies are predefined and can be named directly, as shown for <code>it1</code> and <code>it2</code> above:
+
 
+
* <code>IGNORE_POLICY</code>: any faults raised by the <code>RemoteIterator</code> are discarded by the resulting <code>Iterator<code>, which will ensure that <code>hasNext()>/code> and <code>next()</code> behave as if they had not occurred;
+
* <code>FAILFAST_POLICY</code>: the first fault raised by the <code>RemoteIterator</code> halts the resulting <code>Iterator</code>, which will ensure that <code>hasNext()>/code> and <code>next()</code> behave as if they stream had reached its natural end;
+
 
+
Custom policies can be defined by implementing the interface <code>FaultPolicy</code>:
+
 
+
<source lang="java5" highlight="3">
+
public interface FaultPolicy ... {
+
+
boolean onFault(Exception e, int count);
+
 
+
}
+
</source>
+
 
+
In <code>onFault()</code>, clients are passed the fault raised by the <code>RemoteIterator</code>, as well as the count of faults raised so far during the iteration (this will be greater than <code>1</code> only if the policy will have tolerated some previous faults during the iteration). Clients apply the policy and return <code>true</code> if the fault should be tolerated and the iteration continue, <code>false</code> if they instead wish the iteration to stop.  Here's an example of a fault handling policy that tolerates only the first error and uses two aliases for the boolean values to improve the legibility of the policy (<code>CONTINUE</code> and <code>STOP</code>, also defined in the <code>Streams</code> class and statically imported):
+
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
FaultPolicy policy = new FaultPolicy() {
+
   
+
      public boolean onFault(Exception e, int count) {
+
            if (count=1) {
+
                  ....dealing with fault ...
+
  return CONTINUE;
+
      }
+
            else
+
                  return STOP;
+
        }
+
};
+
</source>
+
 
+
Note also that the <code>IGNORE_POLICY</code> is the default policy from conversion to standard iterators. Clients can use the clause <code>withDefaults()</code> to avoid naming it.
+
 
+
<source lang="java5" highlight="6">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RemoteIterator<SomeType> rit = ...
+
 
+
//iterator will handle faults with the default policy: IGNORE_POLICY
+
Iterator<SomeType> it = convert(rit).withDefaults();
+
</source>
+
 
+
Finally, note that stream conversions may also be applied between <code>RemoteIterator</code>s, so as to change their <code>FaultPolicy</code>:
+
 
+
<source lang="java5" highlight="6">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RemoteIterator<SomeType> rit1 = ...
+
 
+
//iterator will handle faults with the default policy: IGNORE_POLICY
+
RemoteIterator<SomeType> rit2 = convert(rit1).withRemote(IGNORE_POLICY);
+
</source>
+
 
+
Here, the clause <code>withRemote()</code> introduces a fault policy for the <code>RemoteIterator</code> in output.  Fault policies for <code>RemoteIterator</code> are a superset of those that can be configured on standard <code>Iterator</code>s. In particular, they implement the interface <code>RemoteFaultPolicy</code>:
+
 
+
<source lang="java5" highlight="1,3">
+
public interface RemoteFaultPolicy ... {
+
+
boolean onFault(Exception e, int count) throws Exception;
+
 
+
}
+
</source>
+
 
+
Note that the only difference between a <code>FaultPolicy</code> and a <code>RemoteFaultPolicy</code> is that the latter has the additional option to raise a fault of its own in <code>onFault()</code>. Thus, when a fault occurs during iteration, the <code>RemoteIterator</code> can continue iterating, stop the iteration, but also ''re-throw'' the same or another fault to the iterating client, which is indeed what makes a <code>RemoteIterator</code> different from a standard <code>Iterator</code>.
+
 
+
In particular, the Stream DSL predefines a third policy which is available only for <code>RemoteIterator</code>s:
+
 
+
* <code>RETHROW_POLICY</code>: any faults raised during iteration will be immediately propagated to clients;
+
 
+
This is the in fact the default policy for <code>RemoteIterator</code>s and clients can use the clause <code>withRemoteDefaults()</code> to avoid naming it:
+
 
+
<source lang="java5" highlight="5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RemoteIterator<SomeType> rit1 = ...
+
 
+
RemoteIterator<SomeType> rit2 = convert(rit1).withRemoteDefaults();
+
</source>
+
 
+
 
+
In summary, the Stream DSL allows clients to formulate the following sentences for simple stream conversion:
+
 
+
* <code>convert(Iterator)</code>: converts a standard <code>Iterator</code> into a <code>RemoteIterator</code>;
+
* <code>convert(RemoteIterator).with(FaultPolicy)</code>: converts a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates a given <code>FaultPolicy</code>;
+
* <code>convert(RemoteIterator).withDefaults()</code>: converts a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates the <code>IGNORE_POLICY</code> for faults;
+
* <code>convert(RemoteIterator).withRemote(RemoteFaultPolicy)</code>: converts a <code>RemoteIterator</code> into another <code>RemoteIterator</code> that encapsulates a given <code>RemoteFaultPolicy</code>;
+
* <code>convert(RemoteIterator).withRemoteDefaults()</code>: converts a <code>RemoteIterator</code> into another <code>RemoteIterator</code> that encapsulates the <code>RETHROW_POLICY</code> for faults;
+
 
+
 
+
==== ResultSet Conversions ====
+
 
+
A different but very common form of conversion is between gCube [[GCube_ResultSet_(gRS)|result sets]] and <code>RemoteIterator</code>s. While result sets are the preferred way of modelling remote streams within the system, their iterators do not natively implement the <code>RemoteIterator&lt;T&gt;</code> interface, which has been independently defined in the [[Content_Manager_Library|CML]] as an abstraction over an underlying result set mechanism. The CML defines an initial set of [[Content_Manager_Library#Iterators_and_Collections|facilities]] to perform the conversion from result sets of untyped string payloads to <code>RemoteIterator</code>s of typed objects. The Stream DSL builds on these facilities to cater for a few common conversions:
+
 
+
 
+
* <code>payloadsIn(RSLocator)</code>: converts an arbitrary result set into a <code>RemoteIterator<String></code> defined over the record payloads in the result set;
+
* <code>urisIn(RSLocator)</code>: converts a result set of <code>URI</code> serialisations into a <code>RemoteIterator<URI></code>;
+
* <code>documentIn(RSLocator)</code>: converts a result set of <code>GCubeDocument</code> serialisations into a <code>RemoteIterator&lt;GCubeDocument&gt;</code>;
+
* <code>metadataIn(RSLocator)</code>: converts a result set of <code>GCubeDocument</code> serialisations into a <code>RemoteIterator&lt;GCubeMetadata&gt;</code> defined over the metadata elements of the <code>GCubeDocument</code>s in the result set;
+
* <code>annotationsIn(RSLocator)</code>: converts a result set of <code>GCubeDocument</code> serialisations into a <code>RemoteIterator&lt;GCubeAnnotations&gt;</code> defined over the annotations of the <code>GCubeDocument</code>s in the result set;
+
* <code>partsIn(RSLocator)</code>: converts a result set of <code>GCubeDocument</code> serialisations into a <code>RemoteIterator&lt;GCubePart&gt;</code> defined over the parts of the <code>GCubeDocument</code>s in the result set;
+
* <code>alternativesIn(RSLocator)</code>: converts a result set of <code>GCubeDocument</code> serialisations into a <code>RemoteIterator&lt;GCubeAlternative&gt;</code> defined over the alternatives of the <code>GCubeDocument</code>s in the result set;
+
 
+
 
+
Essentially, <code>documentsIn()</code> adapts the result set to a <code>RemoteIterator&lt;T&gt;</code> that parses documents as it iterates over their serialisations. The following methods do the same, but extract the corresponding <code>GCubeElement</code>s from the <code>GCubeDocument</code>s obtained from parsing. All the methods are based on the last one, <code>payloadsIn</code>, which is also immediately useful to feed result set of <code>GCubeDocument</code> identifiers to the [[#Reading_Documents|read operations]] the gDL that perform stream-based document lookups.  
+
 
+
'''note''': all the conversions above produce <code>RemoteIterator</code>s that return the locator of the original result set from invocations of <code>locator()</code>. Clients can use the locator to process the stream with standard set-based APIs, as usual.
+
 
+
The usage pattern is straightforward and combines with the previous conversions. The following example illustrates:
+
 
+
<source lang="java5" highlight="4">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RSLocator rs = ...
+
Iterator<GCubeDocument> it = convert(documentsIn(rs)).with(FAILFAST_POLICY);
+
</source>
+
 
+
=== Piped Conversions ===
+
 
+
The conversions introduced [[#Stream_Conversions|above]] do not alter the original streams, i.e. the output iterators produce the same elements of the input iterators. The exception is with result set-based conversions: <code>documentsIn()</code> parses the untyped payloads of the input result sets into typed objects, while methods such as <code>metadataIn()</code> extract <code>GCubeMetadata</code> elements from <code>GCubeDocument</code>s. Parsing and extraction are only examples of the kind of post-processing that clients may wish to apply to the elements of existing stream  to produce a new stream of post-processed elements. All the remaining sentences of the Stream DSL are dedicated precisely to this kind of conversions.
+
 
+
Sentences introduced by the verb <code>pipe</code> take a stream and return a second stream that applies an arbitrary ''filter'' to the elements of the first stream, encapsulating a fault handing policy in the process. The following example illustrates basic usage:
+
 
+
<source lang="java5" highlight="5,7,12">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
Iterator<GCubeDocument> it1 = ...
+
 
+
Filter<GCubeDocument,String> filter = new Filter<GCubeDocument,String>() {
+
 
+
                  public String apply(GCubeDocument doc) throws Exception {
+
                          return doc.name();
+
                  }
+
};
+
 
+
Iterator<GCubeDocument> it2 = pipe(it1).though(filter).withDefaults();
+
</source>
+
 
+
Here, a standard <code>Iterator</code> of <code>GCubeDocument</code>s is piped through a filter that extracts the names of <code>GCubeDocument</code>s. The result is another standard <code>Iterator</code> that produces document names from the original stream. The clause <code>through()</code> introduces the filter on the output stream and, as already discussed for conversion methods, the clause <code>withDefaults()</code> configures the default <code>IGNORE_POLICY</code> for it.
+
 
+
As shown in the example, filters are implementations of the <code>Filter&lt;FROM,TO&gt;</code> interface. The method <code>apply()</code> is self-explanatory: clients are given the elements of the unfiltered stream as the filtered stream is being iterated over, and they have the onus to produce and return an element of the filtered stream. The only point worth stressing is that <code>apply()</code>s can throw a fault if it cannot produce an element of the filtered stream. The filtered stream passes these faults to the <code>FaultPolicy</code> configured for it. In this example, faults clearly cannot occur. If they did, however, the configured policy would simply ignore them, i.e. the problematic elements of the input stream would not contribute to the contents of the filtered stream.
+
 
+
In the example the input stream and the filtered one are both standard <code>Iterator</code>s. The construct, however, is generic and can be used to filter any form of stream into any other. In this sense, the construct embeds stream conversions within its clauses. As an example, consider the common case in which a <code>RemoteIterator</code> is filtered into a standard <code>Iterator</code>:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RemoteIterator<GCubeDocument> rit = ...
+
 
+
Filter<GCubeDocument,SometType> filter = ....;
+
 
+
Iterator<SomeType> it = pipe(rit).though(filter).with(FAILFAST_POLICY);
+
</source>
+
 
+
Here, <code>filter</code> is applied to the elements of a <code>RemoteIterator</code> to produce a standard <code>Iterator</code> that stops as soon as the input stream raises a fault.
+
Conversely, in the following example:
+
 
+
<source lang="java5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RemoteIterator<GCubeDocument> rit1 = ...
+
 
+
Filter<GCubeDocument,SometType> filter = ....;
+
 
+
RemoteIterator<SomeType> rit2 = pipe(rit1).though(filter).withRemote(IGNORE_POLICY);
+
</source>
+
 
+
Here, <code>filter</code> is applied to the elements of a <code>RemoteIterator</code> to produce yet another <code>RemoteIterator</code> that ignores any fault raised by the input iterator.
+
 
+
 
+
To conclude with <code>pipe</code>-based sentences, note that the Stream DSL includes <code>Processor&lt;T&gt;</code>, a base implementation of <code>Filter&ltFROM,TO&gt;</code> that simplifies the declaration of filters that simply mutate the input and return it. The following example illustrates usage:
+
 
+
<source lang="java5" highlight="5,7">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RemoteIterator<GCubeDocument> rit1 = ...
+
 
+
Processor<GCubeDocument> processor = new Processor<GCubeDocument>() {
+
 
+
            public void process(GCubeDocument doc) throws Exception {
+
                      doc.setName(doc.name()+"-modified");
+
            }
+
} ;
+
 
+
RemoteIterator<GCUBEDocument> rit2 = pipe(rit1).though(processor).withRemoteDefaults();
+
</source>
+
 
+
Here, the <code>processor</code> simply updates the <code>GCubeDocument</code>s in the input stream by changing their name. The output stream thus returns the same elements of the input stream, albeit updated. During iteration, its policy is simply to re-throw any fault that may be raised by the input iterator.
+
 
+
 
+
In summary, the Stream DSL allows clients to formulate the following sentences for piped stream conversion:
+
 
+
* <code>pipe(Iterator|RemoteIterator).through(Filter|Processor).with(FaultPolicy)</code>: uses a given <code>Filter</code> or <code>Processor</code> to convert a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates a given <code>FaultPolicy</code>;
+
* <code>pipe(Iterator|RemoteIterator).through(Filter|Processor).withDefaults()</code>: uses a given <code>Filter</code> or <code>Processor</code> to convert a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates a <code>IGNORE_POLICY</code> for faults;
+
* <code>pipe(Iterator|RemoteIterator).through(Filter|Processor).withRemote(RemoteFaultPolicy)</code>: uses a given <code>Filter</code> or <code>Processor</code> to convert a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a <code>RemoteIterator</code> that encapsulates a given <code>RemoteFaultPolicy</code>;
+
* <code>pipe(Iterator|RemoteIterator).through(Filter|Processor).withRemoteDefaults()</code>: uses a given <code>Filter</code> or <code>Processor</code> to convert a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a <code>RemoteIterator</code> that encapsulates the <code>RETHROW_POLICY</code> for faults;
+
 
+
=== Folding Conversions ===
+
 
+
With <code>pipe</code>-based sentences, clients can filter the elements of a stream into the elements of another streams. While the elements of the two stream can vary arbitrarily in type, the correspondence between elements of the two streams is fairly strict: for each element of the input stream there may be at most one element of the output stream (elements that raise iteration failures in the input stream may have no counterpart in the output stream, i.e. may be discarded). In this sense, the streams are always consumed ''in phase''.
+
 
+
In some cases, however, clients may wish to:
+
 
+
* ''fold'' a stream, i.e. produce another stream that has one <code>List</code> element for each ''N'' elements of the original stream;
+
* ''unfold'' a stream, i.e. produce another stream that has ''N'' elements for each element in the original stream.
+
 
+
Conceptually, these requirements are still within the scope of filtering, but the fact that the consumption of the filtered stream is  ''out of phase'' with respect to the unfiltered stream requires a rather different treatment. For this reason, the Stream DSL offers two dedicated classes of sentences:
+
 
+
* <code>group</code>-based sentences for stream folding;
+
* <code>unfold</code>-based sentences for stream unfolding.
+
 
+
To fold a stream, clients indicate how many elements of the stream should be grouped into elements of the folded stream, what filter should be applied to each of the elements of the stream and, as usual, what fault handling policy should be used for the folded stream. The following example illustrates usage in the common case in which a <code>RemoteIterator</code> is folded into a standard <code>Iterator</code>:
+
 
+
<source lang="java5" highlight="7">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RemoteIterator<GCubeDocument> rit = ...
+
 
+
Filter<GCubeDocument,SometType> filter = ....;
+
 
+
Iterator<List<SomeType>> it = group(rit).in(10).pipingThrough(filter).withDefaults();
+
</source>
+
 
+
The <code>RemoteIterator</code> is here folded in <code>List</code>s of <code>10</code> elements, (or smaller, if the end of the input stream is reached before a <code>List</code> of The clause <code>in()</code> indicates the maximum size of the output <code>List</code>s. Each of the <code>GCubeDocument</code>s in the original stream is then passed through <code>filter</code>, which produces one of the <code>List</code> elements for it. The clause <code>pipingThrough</code> allows the configuration of the filer. Finally, the default <code>IGNORE_POLICY</code> is set on the folded stream with the clause <code>withDefaults()</code>, meaning that any fault raised by the <code>RemoteIterator</code> ''or'' <code>filter</code> will be tolerated and the element that caused the failure will simply not contribute to the accumulation of the next <code>10</code> elements of the folded stream.
+
 
+
'''note''': the example shows the folding of a <code>RemoteIterator</code> into a standard <code>Iterator</code> but, as for all the sentences of the DSL, all combinations of input and output streams are possible, with the usual implications on the fault handing policies that can be set on the folded stream and with the optional choice of <code>Processor</code>s over <code>Filter</code>s in cases where folding simply groups updated elements of the stream.
+
 
+
It is a common requirement to fold a stream without transforming or altering otherwise its elements. In this case, the clause <code>pipingThrough</code> can be omitted altogether from the sentence:
+
 
+
<source lang="java5" highlight="5">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RemoteIterator<GCubeDocument> rit = ...
+
 
+
Iterator<List<GCubeDocument>> it = group(rit).in(10).withDefaults();
+
</source>
+
 
+
Effectively, the stream is here being filtered with a ''pass-through'' filter that simply returns the elements of the unfolded streams. As we shall see, t his kind of folding is particularly useful to 'slice' a stream in small in-memory collections that can be used with  the [[#Adding_Documents|write operations]] of the gDL that work in bulk and by-value.
+
 
+
 
+
In summary, the Stream DSL allows clients to formulate the following sentences for folding stream conversion:
+
 
+
* <code>group(Iterator|RemoteIterator).in(N).pipingThrough(Filter|Processor).with(FaultPolicy)</code>: uses a given <code>Filter</code> or <code>Processor</code> to <code>N</code>-fold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates a given <code>FaultPolicy</code>;
+
* <code>group(Iterator|RemoteIterator).in(N).pipingThrough(Filter|Processor).withDefaults()</code>: uses a given <code>Filter</code> or <code>Processor</code> to <code>N</code>-fold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates the <code>IGNORE_POLICY</code> for faults;
+
* <code>group(Iterator|RemoteIterator).in(N).pipingThrough(Filter|Processor).withRemote(RemoteFaultPolicy)</code>: uses a given <code>Filter</code> or <code>Processor</code> to <code>N</code>-fold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a <code>RemoteIterator</code> that encapsulates a given <code>RemoteFaultPolicy</code>;
+
* <code>group(Iterator|RemoteIterator).in(N).pipingThrough(Filter|Processor).withRemoteDefaults()</code>: uses a given <code>Filter</code> or <code>Processor</code> to <code>N</code>-fold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a <code>RemoteIterator</code> that encapsulates the <code>RETHROW_POLICY</code> for faults;
+
* <code>group(Iterator|RemoteIterator).in(N).with(FaultPolicy)</code>: uses a ''pass-through'' filter to <code>N</code>-fold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates a given <code>FaultPolicy</code>;
+
* <code>group(Iterator|RemoteIterator).in(N).withDefaults()</code>: uses a ''pass-through'' filter to <code>N</code>-fold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates the <code>IGNORE_POLICY</code> for faults;
+
* <code>group(Iterator|RemoteIterator).in(N).withRemote(RemoteFaultPolicy)</code>: uses a ''pass-through'' filter to <code>N</code>-fold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a <code>RemoteIterator</code> that encapsulates a given <code>RemoteFaultPolicy</code>;
+
* <code>group(Iterator).in(N).withRemoteDefaults()</code>: uses a ''pass-through'' filter to <code>N</code>-fold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a <code>RemoteIterator</code> that encapsulates the <code>RETHROW_POLICY</code> for faults
+
 
+
 
+
=== Unfolding Conversions ===
+
 
+
Unfolding a stream follows a similar pattern, as shown in the following example:
+
 
+
<source lang="java5" highlight="7">
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
...
+
RemoteIterator<GCubeDocument> rit = ...
+
 
+
Filter<GCubeDocument,List<SometType>> filter = ....;
+
 
+
Iterator<SomeType> it = unfold(rit).pipingThrough(filter).withDefaults();
+
</source>
+
 
+
This time we cannot dispense with using a <code>Filter</code>, which is necessary to map a single element of the stream into a <code>List</code> of elements that the unfolded stream, a standard <code>Iterator</code> in this example, will then yield one at the time at the client's demand. As usual, all combinations of standard <code>Iterator</code>s, <code>RemoteIterator</code>s, and fault handling policies are allowed. Using <code>Processor</code>s is instead disallowed here, as it's in the nature of unfolding to convert a element into a number of different elements. Unfolding and updates, in other words, do not interact well.
+
 
+
The most common application of unfolding is for the extraction of inner elements from documents, e.g. unfold a stream of <code>GCubeDocument</code>s into a stream of <code>GCubeMetadata</code> elements, where each element in the unfolded stream belongs to some <code>GCubeDocument</code> in the document stream. Accordingly, the Stream DSL predefines a comprehensive number of these unfoldings. We have seen some of them [[#ResultSet_Conversions|already]], where the document input stream was in the form of a result set (e.g. <code>metadataIn(RSLocator)</code>). Similar unfoldings are directly available on <code>RemoteIterator<GCubeDocument></code>s.
+
 
+
 
+
In summary, the Stream DSL allows clients to formulate the following sentences for unfolding stream conversion:
+
 
+
* <code>unfold(Iterator|RemoteIterator).pipingThrough(Filter).with(FaultPolicy)</code>: uses a given <code>Filter</code> to unfold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates a given <code>FaultPolicy</code>;
+
** <code>unfold(Iterator|RemoteIterator).pipingThrough(Filter).withDefaults()</code>: uses a given <code>Filter</code> to unfold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates the <code>IGNORE_POLICY</code> for faults;
+
* <code>unfold(Iterator|RemoteIterator).pipingThrough(Filter).withRemote(RemoteFaultPolicy)</code>: uses a given <code>Filter</code> to unfold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a standard <code>Iterator</code> that encapsulates a given <code>RemoteFaultPolicy</code> for faults
+
* <code>unfold(Iterator|RemoteIterator).pipingThrough(Filter).withRemoteDefaults()</code>: uses a given <code>Filter</code> to unfold a standard <code>Iterator</code> or a <code>RemoteIterator</code> into a <code>RemoteIterator</code> that encapsulates the <code>RETHROW_POLICY</code> for faults;
+
* <code>metadataIn(Iterator<GCubeDocument>|RemoteIterator<GCubeDocument>)</code>: unfolds a standard <code>Iterator&lt;GCubeDocument&gt;</code> or a <code>RemoteIterator&lt;GCubeDocument&gt;</code> into, respectively, a <code>Iterator&lt;GCubeMetadata&gt;</code> or a <code>RemoteIterator&lt;GCubeMetadata&gt;</code> defined over the metadata elements of the <code>GCubeDocument</code>s in the original stream;
+
* <code>annotationsIn(Iterator<GCubeDocument>|RemoteIterator<GCubeDocument>)</code>: unfolds a standard <code>Iterator&lt;GCubeDocument&gt;</code> or a <code>RemoteIterator&lt;GCubeDocument&gt;</code> into, respectively, a <code>Iterator&lt;GCubeAnnotation&gt;</code> or a <code>RemoteIterator&lt;GCubeAnnotation&gt;</code> defined over the annotations of the <code>GCubeDocument</code>s in the original stream;
+
* <code>partsIn(Iterator<GCubeDocument>|RemoteIterator<GCubeDocument>)</code>: unfolds a standard <code>Iterator&lt;GCubeDocument&gt;</code> or a <code>RemoteIterator&lt;GCubeDocument&gt;</code> into, respectively, a <code>Iterator&lt;GCubePart&gt;</code> or a <code>RemoteIterator&lt;GCubePart&gt;</code> defined over the parts of the <code>GCubeDocument</code>s in the original stream;
+
* <code>alternativesIn(Iterator<GCubeDocument>|RemoteIterator<GCubeDocument>)</code>: unfolds a standard <code>Iterator&lt;GCubeDocument&gt;</code> or a <code>RemoteIterator&lt;GCubeDocument&gt;</code> into, respectively, a <code>Iterator&lt;GCubeAlternative&gt;</code> or a <code>RemoteIterator&lt;GCubeAlternative&gt;</code> defined over the alternatives of the <code>GCubeDocument</code>s in the original stream;
+
  
 
= Operations =
 
= Operations =
Line 713: Line 87:
 
Finally, read and write operations build on the facilities of the [[Content_Manager_Library| Content Manager Library]] (CML) to interact with the Content Manager service, including the adoption of [[Content_Manager_Library#High-Level Calls|best-effort strategies]] to discover and interact with instances of the service. These facilities are thus indirectly available to gDL clients as well.
 
Finally, read and write operations build on the facilities of the [[Content_Manager_Library| Content Manager Library]] (CML) to interact with the Content Manager service, including the adoption of [[Content_Manager_Library#High-Level Calls|best-effort strategies]] to discover and interact with instances of the service. These facilities are thus indirectly available to gDL clients as well.
  
== Reading Documents ==
+
[[GDL_Operations_(2.0)|read more...]]
 
+
Clients retrieve document descriptions from remote collections with the operations of a <code>DocumentReader</code>.
+
Readers are created in the scope of the target collection, as follows:
+
 
+
<source lang="java5" highlight="4" >
+
GCubeScope scope = ...
+
String collectionID =...
+
 
+
DocumentReader reader = new DocumentReader(collectionID,scope);
+
</source>
+
 
+
In a secure infrastructure, a security manager is also required:
+
 
+
<source lang="java5" highlight="5" >
+
GCUBEScope scope = ...
+
GCUBESecurityManager manager = ...
+
String collectionID =...
+
 
+
DocumentReader reader = new DocumentReader(collectionID,scope,manager);
+
</source>
+
 
+
Readers expose three <code>get()</code> operations to retrieve document descriptions from target collections, two ''lookup'' operations and one ''query'' operation:
+
 
+
* <code>get(String,Projection)</code>: retrieves the description of a document with a given identifier, where the description matches a given projection and reflects the retrieval directives therein;
+
* <code>get(Iterator<String>,Projection)</code>: retrieves a stream of document descriptions from a stream with their identifiers, where the descriptions match a given projection and reflect the retrieval directives therein;
+
* <code>get(Projection)</code>: returns a stream of document descriptions, where the descriptions match a given projection and reflect the retrieval directives therein.
+
 
+
The operations and their return values can be illustrated as follows:
+
 
+
<source lang="java5" highlight="6,9,12" >
+
DocumentReader reader = ...
+
 
+
DocumentProjection p = ....
+
 
+
String id = ...
+
GCubeDocument doc = reader.get(id,p);
+
 
+
Iterator<String> ids = ...
+
RemoteIterator<GCubeDocument> docs = reader.get(ids,p);
+
 
+
 
+
RemoteIterator<GCubeDocument> docs = reader.get(p);
+
</source>
+
 
+
A few points are worth emphasising:
+
 
+
* <code>get(Iterator<String>,Projection)</code> takes a stream of identifiers under the standard <code>Iterator</code> interface. As discussed at length [[#Local_And_Remote_Iterators|above]], this indicates that the operation makes no assumption as to the origin of the stream and that it has no policy of its own to deal with possible iteration failures; clients need to provide one in the implementation of the <code>Iterator</code>. Conversely, <code>get(Projection)</code> returns a <code>RemoteIterator</code> because it can guarantee the remote origin of the stream, though it still has no policy of its own to handler possible iteration failures. Again, clients are responsible for providing one. Clients can use the [[GCube_Document_Library_(2.0)#Piped_Conversions|pipe sentences]] of the [[#Streams|Stream DSL]], to derive the <code>Iterator</code>s in input from other form of streams and to post-process the <code>RemoteIterator</code>s in output.
+
 
+
* as a convenience, all the retrieval operations can take projections other than <code>DocumentProjection</code>s. Projections over the inner elements of documents are equally accepted, e.g.:
+
 
+
<source lang="java5" highlight="4" >
+
DocumentReader reader = ...
+
MetadataProjection mp = ....
+
 
+
RemoteIterator<GCubeDocument> docs = reader.get(mp);
+
</source>
+
 
+
Here, matched documents are characterised directly with a <code>MetadataProjection</code>. The operation will derive a corresponding <code>DocumentProjection</code> with a single include constraint that requires matching documents to have that ''at least'' one metadata element that satisfy the projection. As usual, the output stream will retrieve of such documents no more than what the original <code>MetadataProjection</code> specifies in its include constraints. Again, clients are recommended to use the [[#Streams|Stream DSL]] to extract the metadata elements from the output stream and possibly to process it further, e.g.:
+
 
+
<source lang="java5" highlight="4" >
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
 
+
DocumentReader reader = ...
+
MetadataProjection mp = ....
+
 
+
RemoteIterator<GCubeMetadata> metadata = metadataIn(reader.get(mp));
+
</source>
+
 
+
Similarly, the [[#Streams|Stream DSL]] can be relied upon in the common case in which input streams originate in remote result sets, or when the output streams must be computed over using the result set API. The following example illustrates some of the possibilities:
+
 
+
<source lang="java5" highlight="10,12,15" >
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
 
+
DocumentReader reader = ...
+
MetadataProjection mp = ....
+
 
+
//a result set of document identifiers
+
RSLocator idRS = ....
+
 
+
//extracts identifiers from result set into remote iterator and converts it into a local iterator
+
Iterator<String> ids = convert(payloadsIn(idRS)).withDefaults();
+
 
+
RemoteIterator<GCubeMetadata> metadata = metadataIn(reader.get(ids,mp));
+
 
+
//extract result set locator from remote iterator
+
RSLocator docRS = new RSLocator(metadata.locator());
+
 
+
//use locator with result set API
+
...
+
</source>
+
 
+
Finally, note that the example above does not handle possible failures. Clients may consult the code documentation for a list of the faults that the individual operations may raise.
+
 
+
== Resolving Elements ==
+
 
+
A <code>DocumentReader</code> may also be used to resolve [[Content_Manager_Library#Content_URIs|content URIs]] into individual elements of document descriptions. It offers two operations for this purpose:
+
 
+
* <code>resolve(URI,Projection)</code>: takes a content URI and returns the description of the document element identified by it, where the description matches a given projection and reflects its retrieval directives;
+
* <code>resolve(Iterator<URI>,Projection)</code>: takes a stream of content URIs and returns a stream of description of the document elements identified by it, where the descriptions match a given projection and reflect its retrieval directives.
+
 
+
The operations and their return values can be illustrated as follows:
+
 
+
<source lang="java5" highlight="7,12" >
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
 
+
//a reader over a given collection
+
DocumentReader = ...
+
 
+
//a sample content URI to a metadata element of a document
+
URI metadataURI = new URI("cms://somecollection/somedocument/somemetadata");
+
 
+
//a sample projection over metadata elements
+
MetadataProjection mp = metadata().with(BYTESTREAM);
+
 
+
GCubeMetadata element = reader.resolve(metadataURI,mp);
+
</source>
+
 
+
Here the client resolves a metadata element and uses a projection to limit retrieval to its bytestream alone.
+
 
+
Do note that the following points:
+
 
+
* the URI must point to an element within the target collection of the <code>DocumentReader</code>. In this example, the reader must be bound to <code>somecollection</code> or the operation will fail;
+
 
+
* the resolution is typed, i.e. the client must know the type of element identified by the URI. Providing a projection gives a hint to the reader as to what type of element is expected. Resolution will fail if the URI points to an element of a different type as much as it fails if it points to an unknown element;
+
 
+
* as usual, [[#Empty_Projections| empty projections]] can be used for conventional resolution, i.e. to retrieve the whole element unconditionally;
+
 
+
* clients can resolve content URIs that identify to whole documents, in combination with document projections. In this case, <code>resolve()</code> behaves exactly like <code>get()</code> when the latter is invoked with the document identifier inside the URI;
+
 
+
* remember that the [[Content_Manager_Library|CML]] defines a set of [[Content_Manager_Library#Constructing.2C_Deconstructing.2C_and_Converting_URIs|facilities]] to compose and decompose content URIs;
+
 
+
* remember also that the [[GCube_Document_Model#Implementation|gML]] defines a method <code>uri()</code> on documents and their elements. Clients that work with [[GCube_Document_Model#New_Elements_and_Element_Proxies|element proxies]] can use it to obtain their content URI and then store it, move it over the network, etc. until it becomes available to the same or different clients for resolution.
+
 
+
As an example of stream-based resolution consider the following:
+
 
+
<source lang="java5" highlight="7,12" >
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
 
+
//a reader over a given collection
+
DocumentReader = ...
+
 
+
//an iterator over content URIs of annotations
+
Iterator<URI> annotationURIs = ...;
+
 
+
//a sample projection over annotations
+
AnnotationProjection ap =...
+
 
+
RemoteIterator<GCubeAnnotation> annotations = reader.resolve(annotationURIs,mp);
+
</source>
+
 
+
Like all the stream-based operations of the <code>DocumentReader</code>, <code>resolve()</code> takes stream as standard <code>Iterator</code>s and returns streams as <code>RemoteIterator</code>s. As usual, clients can use the facilities of the [[#Streams|Stream DSL]] to convert to and from these iterators and other models of streams. In particular:
+
 
+
*  remember the method <code>urisIn()</code> of the <code>Streams</code> class can transparently convert a result set of content URI serialisations into an equivalent <code>RemoteIterator&lt;URI&gt;</code>.
+
 
+
== Adding Documents ==
+
 
+
Clients add document descriptions to remote collections with the operations of a <code>DocumentWriter</code>.
+
Writers are created in the scope of the target collection, as follows:
+
 
+
<source lang="java5" highlight="4" >
+
GCubeScope scope = ...
+
String collectionID =...
+
 
+
DocumentWriter writer = new DocumentWriter(collectionID,scope);
+
</source>
+
 
+
In a secure infrastructure, a security manager is also required:
+
 
+
<source lang="java5" highlight="5" >
+
GCUBEScope scope = ...
+
GCUBESecurityManager manager = ...
+
String collectionID =...
+
 
+
DocumentWriter writer = new DocumentWriter(collectionID,scope,manager);
+
</source>
+
 
+
Writers expose four operations to add document descriptions to the target collections, two singleton operations and two bulk operations. All the operations take [[GCube_Document_Model#New_Elements_and_Element_Proxies|new]] document descriptions built with the [[GCube_Document_Model#Creating_Elements|APIs]] of the gML. In addition, each description must satisfy certain basic criteria, including:
+
 
+
* the consistency between the collection bound to it and the collection bound to the writer;
+
* other constraints specific to its inner elements.
+
 
+
We say that the description must be <em>valid</em> for addition. The operations are the following:
+
 
+
* <code>add(GCubeDocument)</code>: adds a valid document description to the target collection and returns an identifier for it;
+
 
+
* <code>addAndSychronize(GCubeDocument)</code>: adds a valid document description to the target collection and returns a [[GCube_Document_Model#New_Elements_and_Element_Proxies|proxy]] for it. The proxy is synchronised with the changes  applied to the description at the point of addition, including the assignment of identifiers to the whole description and its inner elements;
+
 
+
* <code>add(Iterator<GCubeDocument>)</code>: adds a stream of valid document descriptions to the target collection and returns a <code>Future<?></code> of the completion of the operation.
+
 
+
* <code>add(List<GCubeDocument>)</code>: adds a list of valid document descriptions to the target collection and returns of a list of corresponding outcomes, where each outcome is either an identifier for the description or else a processing failure.
+
 
+
The operations and their return values can be illustrated as follows:
+
 
+
<source lang="java5" highlight="5,8,12,16,19" >
+
DocumentWriter writer = ...
+
 
+
//singleton add
+
GCubeDocument doc = ...
+
String id = writer.add(doc);
+
 
+
//singleton with synchronization
+
GCubeDocument synchronized = writer.addAndSynchronize(doc);
+
 
+
//bulk by-value
+
List<GCubeDocument>  docs = ...
+
List<AddOutcome> outcomes = writer.add(docs);
+
 
+
//bulk by-ref
+
Iterator<GCubeDocument> docStream = ...
+
Future<?> future = writer.add(docStream);
+
....
+
//poll for completion (see also other polling methods of Futures)
+
if (future.get()==null)
+
    ...submission is completed...
+
</source>
+
 
+
A few points are worth emphasising:
+
 
+
* <code>addAndSynchronize(GCubeDocument)</code> requires ''two'' remote interactions, one to add the document description and a one to retrieve its synchronised proxy. Clients are responsible for replacing the input description with the proxy in any local structure that may already contain references to the former;
+
 
+
* <code>add(Iterator<GCubeDocument>)</code> follows the same pattern for stream-based operations already discussed for [[#Reading_Documents|read operations]]. Invalid document descriptions found in the input stream are silently discarded. Due to the possibility of these pre-processing failures and its non-blocking nature, the operation cannot guarantee the fidelity of outcome reports. For this reason, the operation returns only a <code>Future<?></code>  that clients can poll to know when all the proxies in input have been submitted for addition. Clients that work with large or remote streams, ''and'' are interested in processing outcomes, are responsible for grouping the elements of the stream in 'chunks' of acceptable size and use <code>add(List<GCubeDocument>)</code>;
+
 
+
* <code>add(List<GCubeDocument>)</code> is indicated for small input collections and/or when reports on outcomes are important to clients. Invalid document descriptions found in the input list fail the operation ''before'' any attempt is made to add any document description to the target collection. Clients can use the [[GCube_Document_Library_(2.0)#Folding_Conversions|group sentences]] of the [[#Streams|Stream DSL]] to conveniently convert an arbitrarily large stream of <code>GCubeDocument</code>s into a stream of <code>List<GCubeDocument></code>, which can then be fed to the operation element by element. The following example illustrates this processing pattern:
+
 
+
<source lang="java5" highlight="1,7,10,12,13,16" >
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
 
+
DocumentWriter writer = ...
+
Iterator<GCubeDocument> docs = ...
+
 
+
//fold stream in chunks of 30 descriptions
+
Iterator<List<GCubeDocument>> chunks = group(docs).in(30).withDefaults();
+
 
+
while (chunks.hasNext()) {
+
List<AddOutcome> outcomes = writer.add(chunks.next());
+
for (AddOutcome outcome : outcomes) {
+
  if (outcome.getSuccess()!=null) {
+
      ...outcome.getSuccess().getId()...
+
    }
+
    else {
+
      ...outcome.getFailure().getFault()... 
+
}
+
}
+
</source>
+
 
+
* <code>add(List<GCubeDocument>)</code> and <code>add(Iterator<GCubeDocument>)</code> uses result set mechanisms to interact with remote services and thus can be invoked only inside a container;
+
 
+
* the processing model for <code>AddOutcome</code>s is [[Content_Manager_Library#Adding_Documents|already defined]] in the CML.
+
 
+
Finally, note that the code in the examples above does not handle possible failures. Clients may consult the code documentation for an enumeration of the faults that the individual operations may raise.
+
 
+
== Updating Documents ==
+
 
+
A <code>DocumentWriter</code> may also be used to update document descriptions already in the target collection.
+
 
+
It offers four operations for this purpose, two singleton operations and two bulk operations. The operations mirror those that add document descriptions to the target collection, as discussed [[#Adding_Documents|above]]. Like the add operations, the update operations take document descriptions built with the [[GCube_Document_Model#Creating_Elements|APIs]] of the gML. However, the descriptions must be [[GCube_Document_Model#New_Elements_and_Element_Proxies|proxies]] of remote descriptions and each proxy must satisfy certain basic criteria, including:
+
 
+
* the consistency between the collection bound to it and the collection bound to the writer;
+
* the existence of [[GCube_Document_Model#Updating_Elements|tracked changes]] on it;
+
* other constraints that are specific to its inner elements.
+
 
+
We say that the proxy must be <em>valid</em> for update. The operations are the following:
+
 
+
* <code>update(GCubeDocument)</code>: updates a document description in the target collection with the properties of a valid proxy;
+
* <code>updateAndSychronize(GCubeDocument)</code>: updates a document description in the target collection with the properties of a valid proxy, returning another proxy that is fully synchronised with the description, i.e. reflects all its properties after the update, including updates times of last update for the description and its inner elements;
+
* <code>update(Iterator<GCubeDocument>)</code>: updates multiple document descriptions in the target collection with the properties of a stream of valid proxies, returning a <code>Future<?></code> for the future completion of the operation;
+
* <code>update(List<GCubeDocument>)</code>: updates multiple document descriptions in the target collection with the properties of a list of valid proxies, returning a map of processing failures indexed by the identifier of the corresponding description.
+
 
+
The operations and their return values can be illustrated as follows:
+
 
+
<source lang="java5" highlight="5,8,12,16,19" >
+
DocumentWriter writer = ...
+
 
+
//singleton add
+
GCubeDocument proxy = ...
+
writer.update(proxy);
+
 
+
//singleton with synchronization
+
GCubeDocument synchronized = writer.updateAndSynchronize(proxy);
+
 
+
//bulk by-value
+
List<GCubeDocument>  proxies = ...
+
Map<String,Throwable> failures = writer.update(proxies);
+
 
+
//bulk by-ref
+
Iterator<GCubeDocument> proxyStream = ...
+
Future<?> future = writer.update(proxyStream);
+
....
+
//poll for completion (see also other polling methods of Futures)
+
if (future.get()==null)
+
    ...submission is completed...
+
</source>
+
 
+
A few points are worth emphasising:
+
 
+
* <code>updateAndSynchronize(GCubeDocument)</code> requires ''two'' remote interactions, one to update the document description and one to retrieve its synchronised proxy;
+
 
+
* <code>update(Iterator<GCubeDocument>)</code> follows the same pattern for stream-based operations already discussed for [[#Adding_Documents|add operations]]. Invalid proxies found in the stream are silently discarded.  Due to the possibility of these pre-processing failures and its non-blocking nature, the operation cannot guarantee the fidelity of outcome reports. For this reason, the operation returns only a <code>Future<?></code> that clients can poll to know when all the proxies in input have been submitted for update. Clients that work with large or remote streams, ''and'' are interested in processing outcomes, are responsible for grouping the elements of the stream in 'chunks' of acceptable size and use <code>add(List<GCubeDocument>)</code>;
+
 
+
* <code>update(List<GCubeDocument>)</code> is indicated for small input collections and/or when outcome reports are important to clients.  Invalid proxies found in the input list fail the operation ''before'' any attempt is made to update any document description in the target collection. Clients can use the [[GCube_Document_Library_(2.0)#Folding_Conversions|group sentences]] of the [[#Streams|Stream DSL]] to conveniently convert an arbitrarily large stream of <code>GCubeDocument</code>s into a stream of <code>List<GCubeDocument></code>, which can then be fed to the operation element by element. The following example illustrates this processing pattern:
+
 
+
<source lang="java5" highlight="1,7,9,10" >
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
 
+
DocumentWriter writer = ...
+
Iterator<GCubeDocument> proxies = ...
+
 
+
//fold stream in chunks of 30 proxies
+
Iterator<List<GCubeDocument>> chunks = group(proxies).in(30).withDefaults();
+
 
+
while (chunks.hasNext()) {
+
Map<String,Throwable> failures = writer.update(chunks.next());
+
...process failures...
+
}
+
</source>
+
 
+
* <code>update(List<GCubeDocument>)</code> and <code>update(Iterator<GCubeDocument>)</code> uses result set mechanisms to interact with remote services and thus can be invoked only inside a container.
+
 
+
Finally, note that the code in the examples above does not handle possible failures. Clients may consult the code documentation for a list of the faults that the individual operations may raise.
+
 
+
== Deleting Documents ==
+
 
+
A <code>DocumentWriter</code> may also be used to delete document descriptions already in the target collection.
+
 
+
It offers three operations for this purpose, one singleton operation and two bulk operations. The operations mirror those that update document descriptions in the target collection, as discussed [[#Updating_Documents|above]]. Like the update operations, the delete operations take [[GCube_Document_Model#New_Elements_and_Element_Proxies|proxies]] of remote descriptions built with the [[GCube_Document_Model#Creating_Elements|APIs]] of the gML. The operations are the following:
+
 
+
* <code>delete(GCubeDocument)</code>: deletes a document description from the target collection using a proxy for it;
+
* <code>delete(Iterator<GCubeDocument>)</code>: deletes multiple document descriptions from the target collection using a stream of proxies for them, returning a <code>Future<?></code> for the future completion of the operation;
+
* <code>update(List<GCubeDocument>)</code>: deletes multiple document descriptions from the target collection using a list of proxies for them and returning a map of processing failures indexed by the identifier of the corresponding description.
+
 
+
The operations and their return values can be illustrated as follows:
+
 
+
<source lang="java5" highlight="5,8,12,16,19" >
+
DocumentWriter writer = ...
+
 
+
//singleton add
+
GCubeDocument proxy = ...
+
writer.delete(proxy);
+
 
+
//bulk by-value
+
List<GCubeDocument>  proxies = ...
+
Map<String,Throwable> failures = writer.delete(proxies);
+
 
+
//bulk by-ref
+
Iterator<GCubeDocument> proxyStream = ...
+
Future<?> future = writer.delete(proxyStream);
+
....
+
//poll for completion (see also other polling methods of Futures)
+
if (future.get()==null)
+
    ...submission is completed...
+
</source>
+
 
+
A few points are worth emphasising:
+
 
+
* <code>delete(Iterator<GCubeDocument>)</code> follows the same pattern for stream-based operations already discussed for [[#Adding_Documents|add operations]] and [[#Updating_Documents|update operations]]. Document descriptions found in the stream which are not proxies are silently discarded.  Due to the possibility of these pre-processing failures and its non-blocking nature, the operation cannot guarantee the fidelity of outcome reports. For this reason, the operation returns only a <code>Future<?></code> that clients can poll to know when all the proxies in input have been submitted for deletion. Clients that work with large or remote streams, ''and'' are interested in processing outcomes, are responsible for grouping the elements of the stream in 'chunks' of acceptable size and use <code>delete(List<GCubeDocument>)</code>;
+
 
+
* <code>delete(List<GCubeDocument>)</code> is indicated for small input collections and/or when outcome reports are important to clients.  Document descriptions  found in the input list which are not proxies fail the operation ''before'' any attempt is made to delete any document description in the target collection. Clients can use the [[GCube_Document_Library_(2.0)#Folding_Conversions|group sentences]] of the [[#Streams|Stream DSL]] to conveniently convert an arbitrarily large stream of <code>GCubeDocument</code>s into a stream of <code>List<GCubeDocument></code>, which can then be fed to the operation element by element. The following example illustrates this processing pattern:
+
 
+
<source lang="java5" highlight="1,7,9,10" >
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*;
+
 
+
DocumentWriter writer = ...
+
Iterator<GCubeDocument> proxies = ...
+
 
+
//fold stream in chunks of 30 proxies
+
Iterator<List<GCubeDocument>> chunks = group(proxies).in(30).withDefaults();
+
 
+
while (chunks.hasNext()) {
+
Map<String,Throwable> failures = writer.delete(chunks.next());
+
...process failures...
+
}
+
</source>
+
 
+
* <code>delete(List<GCubeDocument>)</code> and <code>delete(Iterator<GCubeDocument>)</code> uses result set mechanisms to interact with remote services and thus can be invoked only inside a container.
+
 
+
Finally, note that the code in the examples above does not handle possible failures. Clients may consult the code documentation for a list of the faults that the individual operations may raise.
+
  
 
= Views =
 
= Views =
Line 1,099: Line 97:
 
* '''local views''': these are views defined by individual clients as the context for a number of subsequent queries and updates. Local views may have arbitrary long lifetimes, and may even outlive the client that created them, they are never used by multiple clients. Thus local views are commonly transient and if their definitions are somehow persisted, they are persisted locally to the 'owning' client and remain under its direct responsibility.  
 
* '''local views''': these are views defined by individual clients as the context for a number of subsequent queries and updates. Local views may have arbitrary long lifetimes, and may even outlive the client that created them, they are never used by multiple clients. Thus local views are commonly transient and if their definitions are somehow persisted, they are persisted locally to the 'owning' client and remain under its direct responsibility.  
  
* '''remote views''': these are views defined by some clients and used by many others within the system. Remote views outlive all such clients and persist in the infrastructure, typically for as long as the collection does. They are defined through the [[View_Manager|View Manager]] service (VM), which materialises them as WS-Resources. Each VM resource encapsulates the definition of the view as well as its descriptive properties, and it is responsible for managing its lifetime, e.g. keep track of its cardinality and notify interested clients of changes to its contents. However, VM resources are [[View_Manager#Motivations|'passive'']], i.e. do not mediate access to those content resources.
+
* '''remote views''': these are views defined by some clients and used by many others within the system. Remote views outlive all such clients and persist in the infrastructure, typically for as long as the collection does. They are defined through the [[View_Manager|View Manager]] service (VM), which materialises them as WS-Resources. Each VM resource encapsulates the definition of the view as well as its descriptive properties, and it is responsible for managing its lifetime, e.g. keep track of its cardinality and notify interested clients of changes to its contents. However, VM resources are [[View_Manager#Motivations|passive'']], i.e. do not mediate access to those content resources.
  
 
Naturally, the gDL uses [[#Projections|projections]] as view definitions. It then offers specialised <code>Reader</code>s that encapsulate such projections to implicitly resolve all their operations in the scope of the view. This yields view-based access to collections and allows clients to work with local views. In addition, the gDL provides local proxies of VM resources with which clients can create, discover, and inspect remote views. As these proxies map remote view definitions onto projections, remote views can be accessed with the same reading mechanisms available for local views.  
 
Naturally, the gDL uses [[#Projections|projections]] as view definitions. It then offers specialised <code>Reader</code>s that encapsulate such projections to implicitly resolve all their operations in the scope of the view. This yields view-based access to collections and allows clients to work with local views. In addition, the gDL provides local proxies of VM resources with which clients can create, discover, and inspect remote views. As these proxies map remote view definitions onto projections, remote views can be accessed with the same reading mechanisms available for local views.  
  
== Local Views ==
+
[[GDL_Views_(2.0)|read more...]]
 
+
To work with a local view of a remote collection, a gDL client creates first the projection that defines the view. The client then injects the projection into a <code>ViewReader</code>, along with a <code>DocumentReader</code> already configured to access the target collection. Like the <code>DocumentReader</code>, the <code>ViewReader</code> implements the <code>Reader</code> interface, offering all the read operations discussed [[#Reading_Documents|above]]. When any of its operations is called, however, the <code>ViewReader</code> ''merges'' the view definition and the input projection, combining their constraints. It then passes the merged projection to the inner <code>DocumentReader</code>, which executes the operation. Effectively, this resolves the operation in the scope of the view.
+
 
+
The following example illustrates the approach:
+
 
+
<source lang="java5" highlight="7,9,13,14">
+
import static java.util.Locale.*;
+
import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
+
 
+
//a reader configured to access a target collection.
+
DocumentReader reader = ...
+
 
+
//define a local view
+
DocumentProjection view = document().withValue(LANGUAGE,FRENCH);
+
 
+
//inject view and reader it into a view reader
+
ViewReader viewReader = new ViewReader(view,reader);
+
 
+
GCubeDocument doc = viewReader.get("...some id...", document().with(NAME));
+
 
+
assert(doc.language()!=null);
+
assert(doc.language().equals(FRENCH));
+
assert(doc.name()!=null);
+
</source>
+
 
+
Here, the view includes only the document descriptions (with a bytestream) in a given language. The lookup operation retrieves the target description only if it has a name ''and'' is in the view. At runtime, the <code>DocumentProjection</code> that defines the view is merged with the projection passed to the lookup operation. This produces the same effect that the following projection would produce if it was executed by a plain <code>DocumentReader</code>:
+
 
+
<source lang="java5">
+
document().withValue(LANGUAGE,FRENCH).with(NAME);
+
</source>
+
 
+
The example above defines the view as a <code>DocumentProjection</code> but ''any'' projection can be used for the purpose (e.g. a <code>MetadataProjection</code>, an <code>AnnotationProjection</code>, etc). In general, clients have the same flexibility in defining views as they do in invoking the operations of  <code>Reader</code>s: any projection that can be used in one context can also be used in the other. Clients will choose <code>DocumentProjection</code>s when the view needs to characterise properties of entire documents and/or inner elements of different types. They will instead prefer more specific projections when the view is predicated on properties of inner elements of the same type. For example, a view that characterises document descriptions based only on the schema of their metadata elements is more conveniently defined with a <code>MetadataProjection</code>:
+
 
+
<source lang="java5" highlight="3">
+
...
+
//define a local view
+
DocumentProjection view = metadata().withValue(SCHEMA_URI,"..some schema..");
+
...
+
GCubeDocument doc = viewReader.get("...some id...", document().with(NAME));
+
...
+
</source>
+
 
+
The operation above would lookup the target document description only if it has a name and at least one metadata element in the given schema.
+
 
+
Similarly, clients are free to pass any projection with the operations of the <code>ViewReader</code>, including those that "diverge" arbitrarily from the view:
+
 
+
<source lang="java5" highlight="3,5">
+
...
+
//define a local view
+
DocumentProjection view = document().with(PART);
+
...
+
GCubeDocument doc = viewReader.get("...some id...", metadata().with(BYTESTREAM));
+
...
+
</source>
+
 
+
The operation above would lookup the target document descriptions only if it has at least one part and one metadata element with an inlined bytestream.
+
 
+
The freedom in merging view definitions with other projections is limited only by the obvious requirement: the merged projection must ''not'' retrieve documents that are outside the view. The <code>ViewReader</code> will detect projections that ''break'' the view abstraction and, in case, refuse them as parameters of its operations. For example:
+
 
+
<source lang="java5" highlight="3,6,9">
+
...
+
//define a local view
+
DocumentProjection view = document().withValue(LANGUAGE,FRENCH);
+
...
+
try {
+
  GCubeDocument doc = viewReader.get("...some id...", document().withValue(LANGUAGE,ENGLISH));
+
  assert(false);
+
}
+
catch(InvalidProjectionException e) {
+
assert(true);
+
}
+
...
+
</source>
+
 
+
This attempts generates an <code>InvalidProjectionException</code> as document descriptions in English are not part of the view.
+
 
+
== Remote Views ==
+
 
+
=== Publishing Views ===
+
 
+
=== Discovering Views ===
+
 
+
=== Using Views ===
+
  
= Advanced Topics =
+
= Utilities & F.A.Q. =
  
== Caches ==
+
The GCube Document Library offers utility classes to manage the collections and the views in the system.
  
== Buffers ==
+
[[GDL_Utilities_%26_F.A.Q._(2.0)|read more...]]

Latest revision as of 13:21, 21 March 2011

The gCube Document Library (gDL) is a client library for adding, updating, deleting and retrieving document descriptions to, in, and from remote collections in a gCube infrastructure.

The gDL is a high-level component of the subsystem of gCube Information Services and it interacts with lower-level components of the subsystem to support document management processes within the infrastructure:

  • the gCube Document Model (gDM) defines the basic notion of document and the gCube Model Library (gML) implements that notion into objects;
  • the objects of the gML can be exchanged in the infrastructure as edge-labelled trees, and the Content Manager Library (CML) can dispatch them to the read and write operations of the Content Manager (CM) service;
  • the CM implements its operations by translating trees to and from the content models of diverse repository back-ends.

The gDL builds on the gML and the CML to implement a local interface of CRUD operations that lift those of the CM to the domain of documents, efficiently and effectively.

Preliminaries

The core functionality of the gDL lies in its operations to read and write document descriptions. The operations trigger interactions with the Content Manager service and the movement of potentially large volumes of data across the infrastructure. This may have a non-trivial and combined impact on the responsiveness of clients and the overall load of the infrastructure. The operations have been designed to minimise this impact. In particular:

  • when reading, clients can qualify the documents that are relevant to their queries, and indeed what properties of those documents should be actually retrieved. These retrieval directives are captured in the gDL by the notion of document projections.
  • when reading and writing, clients can move large numbers of documents across the infrastructure. The gDL streams this I/O movements so as to make efficient use of local and remote resources. It then defines a facilities with which clients can conveniently consume input streams, produce output streams, and more generally convert one stream into an other regardless of its origin. These facilities are collected into the stream DSL, an Embedded Domain-Specific Language (EDSL) for stream conversion and processing.

Understanding document projections and the stream DSL is key to reading and writing documents effectively with the gDL. We discuss these preliminary concepts first, and then consider their use as input and outputs in read and write the operations of the library.

Projections

A projection is a set of constraints over the properties of document descriptions. It can be be used in the read operations of the gDL to:

  • characterise relevant descriptions as those that match the constraints (projections as types);
  • specify what properties of relevant descriptions should be retrieved (projections as retrieval directives).

Constraints take accordingly two forms:

  • include constraints apply to properties that must be matched and retrieved;
  • filter constraints apply to properties that must be matched but not retrieved.

note: in both cases, the constraints take the form of predicates of the Content Manager Library (CML). The projection itself converts into a complex predicate which is amenable for processing by the Content Manager service in the execution of its retrieval operations. In this sense, projections are a key part of the document-oriented layer that the gDL defines over lower-level components of the gCube subsystem dedicated to content management.

As a first example, a projection may specify an include constraint over the name of metadata elements and a filter constraint over the time of last update. It may then be used to:

  • characterise document descriptions with at least one metadata element that matches both constraints;
  • retrieve of those descriptions only the name of matching metadata elements, excluding the time of last update, any other metadata property, and any other document property, include other inner elements and their properties.

Projections have the Projection interface, which can be used to access their constraints in element-generic computations. To build projections, however, clients deal with one of the following implementation of the interface:

  • DocumentProjection
  • MetadataProjection
  • AnnotationProjection
  • PartProjection
  • AlternativeProjection

A further implementation of the interface:

  • PropertyProjection

allows clients to express constraints on the generic properties of documents and their inner elements.

read more...

Streams

In some of its operations, the gDL relies on streams to model, process, and transfer large-scale data collections. Streams may consist of document descriptions, document identifiers, and document updates. More generally, they may consist of the outcomes of operations that take in turn large-scale collections in input. Streamed processing makes efficient use of both local and remote resources, from local memory to network bandwidth, promoting the overall responsiveness of clients and services through reduced latencies.

Clients that use these operations will need to route streams towards and across the operations of the gDL, converting between different stream interfaces, often injecting application logic in the process. As a common example, a client may need to:

  • route a remote result set of document identifiers towards the read operations of the gDL;
  • process the document descriptions returned by the read operations, e.g. in order to update some of their properties;
  • feed the modified document descriptions to the write operations of the gDL, so as to commit the changes;
  • inspect commit outcomes, so as to report or otherwise handle the failures that may have occurred in the process.

Throughout the workflow, it is important that the client remains within the paradigm of streamed processing, avoiding the accumulation of data in memory in all cases but where strictly required. Document identifiers will be streaming from the remote location of the original result set as documents descriptions will be flowing back from yet another remote location, as updated document descriptions will be leaving towards the same remote location, and as failures will be steadily coming back for handling.

Streaming raises significant opportunities for clients, as well as non-trivial challenges. In recognition of the difficulties, the gDL includes a set of general-purpose facilities for stream conversion that simplify the tasks of filtering, transforming, or otherwise processing streams. These facilities are cast as the sentences of the Stream DSL, an Embedded Domain-Specific Language (EDSL).

read more...

Operations

The operations of the gDL allows clients to add, update, delete, and retrieve document descriptions to, in, and from remote collections within the infrastructure. These CRUD operations target (instances of) a specific back-end within the infrastructure, the Content Manager (CM) service. It is a direct implication of the CM that the document descriptions may be stored in different forms within repositories which are inside or outside the strict boundaries of the infrastructure. While the gDL operations clearly expose the remote nature of document descriptions, the actual location of document descriptions, hosting repositories, and Content Manager instances is hidden to their clients.

In what follows, we discuss first read operations, i.e. operations that localise document descriptions from remote collections. We then discuss write operations, i.e. operations that persist in remote collections document descriptions which have been created or else modified locally. In all cases, operations are overloaded to work with different forms of inputs and outputs. In particular, we distinguish between:

  • singleton operations: these are operations that read, add, or change individual document descriptions. Singleton operations are used for punctual interactions with the infrastructure, most noticeably those required by front-end clients to implement user interfaces. All singleton operations that target existing document descriptions require the specifications of their identifiers;
  • bulk operations: these are operations that read, add, or change multiple document descriptions in a single interaction with the infrastructure. Bulk operations can be used for batch interactions with the infrastructure, most noticeably those required by back-end clients to implement workflows. They can also be used for real-time interactions with the infrastructure, such as those required by front-end clients that process user queries. Bulk operations may be further classified in:
    • by-value operations are defined over in-memory collections of document descriptions. Accordingly, these operations are indicated for small-scale data transfer scenarios. As we shall see, they may also be used to move segments of larger data collections, when the creation of such fragments is a functional requirement.
    • by-reference operations are defined over streams of document descriptions. These operations are indicated for medium-scale to large-scale data transfer scenarios, where the streamed processing promotes the responsiveness of clients and the effective use of network resources.

Read and write operations work with document descriptions that align with the gCube document model (gDM) and its implementation in the gCube Model Library (gML). In the terminology of the gML, in particular, operations that create document descriptions expect new elements, all the others take or produce element proxies.

Finally, read and write operations build on the facilities of the Content Manager Library (CML) to interact with the Content Manager service, including the adoption of best-effort strategies to discover and interact with instances of the service. These facilities are thus indirectly available to gDL clients as well.

read more...

Views

Some clients interact with remote collections to work exclusively with subsets of document descriptions that share certain properties, e.g. are in a given language, have changed in the last month, have metadata in a given schema, have parts of a given type, and so on. Their queries and updates are always resolved within these subsets, rather than the whole collection. Essentially, such clients have their own view of the collection.

The gDL offers support for working with two types of view:

  • local views: these are views defined by individual clients as the context for a number of subsequent queries and updates. Local views may have arbitrary long lifetimes, and may even outlive the client that created them, they are never used by multiple clients. Thus local views are commonly transient and if their definitions are somehow persisted, they are persisted locally to the 'owning' client and remain under its direct responsibility.
  • remote views: these are views defined by some clients and used by many others within the system. Remote views outlive all such clients and persist in the infrastructure, typically for as long as the collection does. They are defined through the View Manager service (VM), which materialises them as WS-Resources. Each VM resource encapsulates the definition of the view as well as its descriptive properties, and it is responsible for managing its lifetime, e.g. keep track of its cardinality and notify interested clients of changes to its contents. However, VM resources are passive, i.e. do not mediate access to those content resources.

Naturally, the gDL uses projections as view definitions. It then offers specialised Readers that encapsulate such projections to implicitly resolve all their operations in the scope of the view. This yields view-based access to collections and allows clients to work with local views. In addition, the gDL provides local proxies of VM resources with which clients can create, discover, and inspect remote views. As these proxies map remote view definitions onto projections, remote views can be accessed with the same reading mechanisms available for local views.

read more...

Utilities & F.A.Q.

The GCube Document Library offers utility classes to manage the collections and the views in the system.

read more...