GDL Operations (2.0)

From Gcube Wiki
Jump to: navigation, search

The operations of the gDL allows clients to add, update, delete, and retrieve document descriptions to, in, and from remote collections within the infrastructure. These CRUD operations target (instances of) a specific back-end within the infrastructure, the Content Manager (CM) service. It is a direct implication of the CM that the document descriptions may be stored in different forms within repositories which are inside or outside the strict boundaries of the infrastructure. While the gDL operations clearly expose the remote nature of document descriptions, the actual location of document descriptions, hosting repositories, and Content Manager instances is hidden to their clients.

In what follows, we discuss first read operations, i.e. operations that localise document descriptions from remote collections. We then discuss write operations, i.e. operations that persist in remote collections document descriptions which have been created or else modified locally. In all cases, operations are overloaded to work with different forms of inputs and outputs. In particular, we distinguish between:

  • singleton operations: these are operations that read, add, or change individual document descriptions. Singleton operations are used for punctual interactions with the infrastructure, most noticeably those required by front-end clients to implement user interfaces. All singleton operations that target existing document descriptions require the specifications of their identifiers;
  • bulk operations: these are operations that read, add, or change multiple document descriptions in a single interaction with the infrastructure. Bulk operations can be used for batch interactions with the infrastructure, most noticeably those required by back-end clients to implement workflows. They can also be used for real-time interactions with the infrastructure, such as those required by front-end clients that process user queries. Bulk operations may be further classified in:
    • by-value operations are defined over in-memory collections of document descriptions. Accordingly, these operations are indicated for small-scale data transfer scenarios. As we shall see, they may also be used to move segments of larger data collections, when the creation of such fragments is a functional requirement.
    • by-reference operations are defined over streams of document descriptions. These operations are indicated for medium-scale to large-scale data transfer scenarios, where the streamed processing promotes the responsiveness of clients and the effective use of network resources.

Read and write operations work with document descriptions that align with the gCube document model (gDM) and its implementation in the gCube Model Library (gML). In the terminology of the gML, in particular, operations that create document descriptions expect new elements, all the others take or produce element proxies.

Finally, read and write operations build on the facilities of the Content Manager Library (CML) to interact with the Content Manager service, including the adoption of best-effort strategies to discover and interact with instances of the service. These facilities are thus indirectly available to gDL clients as well.

Reading Documents

Clients retrieve document descriptions from remote collections with the operations of a DocumentReader. Readers are created in the scope of the target collection, as follows:

GCubeScope scope = ...
String collectionID =...
 
DocumentReader reader = new DocumentReader(collectionID,scope);

In a secure infrastructure, a security manager is also required:

GCUBEScope scope = ...
GCUBESecurityManager manager = ...
String collectionID =...
 
DocumentReader reader = new DocumentReader(collectionID,scope,manager);

Note: Since version 2.1.0, DocumentReaders can be created in the current scope, i.e. the scope set for the current thread on the default scope manager (GCUBEScopeManager.DEFAULT):

GCubeScope scope = ...
GCUBEScopeManager.DEFAULT.setScope(scope).....
 
//in an unsecure infrastructure
String collectionID =...
DocumentReader reader = new DocumentReader(collectionID); 
.....
 
//in a secure infrastructure
GCUBESecurityManager manager = ...
reader = new DocumentReader(collectionID,manager);

Readers expose three get() operations to retrieve document descriptions from target collections, two lookup operations and one query operation:

  • get(String,Projection): retrieves the description of a document with a given identifier, where the description matches a given projection and reflects the retrieval directives therein;
  • get(Iterator<String>,Projection): retrieves a stream of document descriptions from a stream with their identifiers, where the descriptions match a given projection and reflect the retrieval directives therein;
  • get(Projection): returns a stream of document descriptions, where the descriptions match a given projection and reflect the retrieval directives therein.

The operations and their return values can be illustrated as follows:

DocumentReader reader = ...
 
DocumentProjection p = ....
 
String id = ...
GCubeDocument doc = reader.get(id,p); 
Iterator<String> ids = ...
RemoteIterator<GCubeDocument> docs = reader.get(ids,p); 
 
RemoteIterator<GCubeDocument> docs = reader.get(p);

A few points are worth emphasising:

  • get(Iterator<String>,Projection) takes a stream of identifiers under the standard Iterator interface. As discussed at length above, this indicates that the operation makes no assumption as to the origin of the stream and that it has no policy of its own to deal with possible iteration failures; clients need to provide one in the implementation of the Iterator. Conversely, get(Projection) returns a RemoteIterator because it can guarantee the remote origin of the stream, though it still has no policy of its own to handler possible iteration failures. Again, clients are responsible for providing one. Clients can use the pipe sentences of the Stream DSL, to derive the Iterators in input from other form of streams and to post-process the RemoteIterators in output.
  • as a convenience, all the retrieval operations can take projections other than DocumentProjections. Projections over the inner elements of documents are equally accepted, e.g.:
DocumentReader reader = ...
MetadataProjection mp = ....
 
RemoteIterator<GCubeDocument> docs = reader.get(mp);

Here, matched documents are characterised directly with a MetadataProjection. The operation will derive a corresponding DocumentProjection with a single include constraint that requires matching documents to have that at least one metadata element that satisfy the projection. As usual, the output stream will retrieve of such documents no more than what the original MetadataProjection specifies in its include constraints. Again, clients are recommended to use the Stream DSL to extract the metadata elements from the output stream and possibly to process it further, e.g.:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
 
DocumentReader reader = ...
MetadataProjection mp = .... 
RemoteIterator<GCubeMetadata> metadata = metadataIn(reader.get(mp));

Similarly, the Stream DSL can be relied upon in the common case in which input streams originate in remote result sets, or when the output streams must be computed over using the result set API. The following example illustrates some of the possibilities:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
 
DocumentReader reader = ...
MetadataProjection mp = ....
 
//a result set of document identifiers
RSLocator idRS = ....
 
//extracts identifiers from result set into remote iterator and converts it into a local iterator
Iterator<String> ids = convert(payloadsIn(idRS)).withDefaults(); 
RemoteIterator<GCubeMetadata> metadata = metadataIn(reader.get(ids,mp)); 
//extract result set locator from remote iterator
RSLocator docRS = new RSLocator(metadata.locator()); 
//use locator with result set API
...

Finally, note that the example above does not handle possible failures. Clients may consult the code documentation for a list of the faults that the individual operations may raise.

Resolving Elements

A DocumentReader may also be used to resolve content URIs into individual elements of document descriptions. It offers two operations for this purpose:

  • resolve(URI,Projection): takes a content URI and returns the description of the document element identified by it, where the description matches a given projection and reflects its retrieval directives;
  • resolve(Iterator<URI>,Projection): takes a stream of content URIs and returns a stream of description of the document elements identified by it, where the descriptions match a given projection and reflect its retrieval directives.

The operations and their return values can be illustrated as follows:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
 
//a reader over a given collection
DocumentReader = ...
 
//a sample content URI to a metadata element of a document
URI metadataURI = new URI("cms://somecollection/somedocument/somemetadata"); 
//a sample projection over metadata elements
MetadataProjection mp = metadata().with(BYTESTREAM);
 
GCubeMetadata element = reader.resolve(metadataURI,mp);

Here the client resolves a metadata element and uses a projection to limit retrieval to its bytestream alone.

Do note that the following points:

  • the URI must point to an element within the target collection of the DocumentReader. In this example, the reader must be bound to somecollection or the operation will fail;
  • the resolution is typed, i.e. the client must know the type of element identified by the URI. Providing a projection gives a hint to the reader as to what type of element is expected. Resolution will fail if the URI points to an element of a different type as much as it fails if it points to an unknown element;
  • as usual, empty projections can be used for conventional resolution, i.e. to retrieve the whole element unconditionally;
  • remember that the CML defines a set of facilities to compose and decompose content URIs;
  • only inner elements can be resolved. Document URIs can be "resolved" with get(URIs.documentID(uri)), using the URI-manipulation facilities of the CML to extract the document identifiers from URIs;
  • remember also that the gML defines a method uri() on documents and their elements. Clients that work with element proxies can use it to obtain their content URI and then store it, move it over the network, etc. until it becomes available to the same or different clients for resolution.

As an example of stream-based resolution consider the following:

import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
 
//a reader over a given collection
DocumentReader = ...
 
//an iterator over content URIs of annotations
Iterator<URI> annotationURIs = ...; 
//a sample projection over annotations
AnnotationProjection ap =...
 
RemoteIterator<GCubeAnnotation> annotations = reader.resolve(annotationURIs,mp);

Like all the stream-based operations of the DocumentReader, resolve() takes stream as standard Iterators and returns streams as RemoteIterators. As usual, clients can use the facilities of the Stream DSL to convert to and from these iterators and other models of streams. In particular:

  • remember the method urisIn() of the Streams class can transparently convert a result set of content URI serialisations into an equivalent RemoteIterator<URI>.

Adding Documents

Clients add document descriptions to remote collections with the operations of a DocumentWriter. Writers are created in the scope of the target collection, as follows:

GCubeScope scope = ...
String collectionID =...
 
DocumentWriter writer = new DocumentWriter(collectionID,scope);

In a secure infrastructure, a security manager is also required:

GCUBEScope scope = ...
GCUBESecurityManager manager = ...
String collectionID =...
 
DocumentWriter writer = new DocumentWriter(collectionID,scope,manager);

Note: Since version 2.1.0, DocumentWriters can be created in the current scope, i.e. the scope set for the current thread on the default scope manager (GCUBEScopeManager.DEFAULT):

GCubeScope scope = ...
GCUBEScopeManager.DEFAULT.setScope(scope).....
 
//in an unsecure infrastructure
String collectionID =...
DocumentWriter writer = new DocumentWriter(collectionID); 
.....
 
//in a secure infrastructure
GCUBESecurityManager manager = ...
writer = new DocumentWriter(collectionID,manager);

Writers expose four operations to add document descriptions to the target collections, two singleton operations and two bulk operations. All the operations take new document descriptions built with the APIs of the gML. In addition, each description must satisfy certain basic criteria, including:

  • the consistency between the collection bound to it and the collection bound to the writer;
  • other constraints specific to its inner elements.

We say that the description must be valid for addition. The operations are the following:

  • add(GCubeDocument): adds a valid document description to the target collection and returns an identifier for it;
  • addAndSychronize(GCubeDocument): adds a valid document description to the target collection and returns a proxy for it. The proxy is synchronised with the changes applied to the description at the point of addition, including the assignment of identifiers to the whole description and its inner elements;
  • add(Iterator<GCubeDocument>): adds a stream of valid document descriptions to the target collection and returns a Future<?> of the completion of the operation.
  • add(List<GCubeDocument>): adds a list of valid document descriptions to the target collection and returns of a list of corresponding outcomes, where each outcome is either an identifier for the description or else a processing failure.

The operations and their return values can be illustrated as follows:

DocumentWriter writer = ...
 
//singleton add
GCubeDocument doc = ...
String id = writer.add(doc); 
//singleton with synchronization
GCubeDocument synchronizedProxy = writer.addAndSynchronize(doc); 
//bulk by-value
List<GCubeDocument>  docs = ...
List<AddOutcome> outcomes = writer.add(docs); 
//bulk by-ref
Iterator<GCubeDocument> docStream = ...
Future<?> future = writer.add(docStream);....
//poll for completion (see also other polling methods of Futures)
if (future.get()==null)    ...submission is completed...

A few points are worth emphasising:

  • addAndSynchronize(GCubeDocument) requires two remote interactions, one to add the document description and a one to retrieve its synchronised proxy. Clients are responsible for replacing the input description with the proxy in any local structure that may already contain references to the former;
  • add(Iterator<GCubeDocument>) follows the same pattern for stream-based operations already discussed for read operations. Invalid document descriptions found in the input stream are silently discarded. Due to the possibility of these pre-processing failures and its non-blocking nature, the operation cannot guarantee the fidelity of outcome reports. For this reason, the operation returns only a Future<?> that clients can poll to know when all the proxies in input have been submitted for addition. Clients that work with large or remote streams, and are interested in processing outcomes, are responsible for grouping the elements of the stream in 'chunks' of acceptable size and use add(List<GCubeDocument>);
  • add(List<GCubeDocument>) is indicated for small input collections and/or when reports on outcomes are important to clients. Invalid document descriptions found in the input list fail the operation before any attempt is made to add any document description to the target collection. Clients can use the group sentences of the Stream DSL to conveniently convert an arbitrarily large stream of GCubeDocuments into a stream of List<GCubeDocument>, which can then be fed to the operation element by element. The following example illustrates this processing pattern:
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*; 
DocumentWriter writer = ...
Iterator<GCubeDocument> docs = ...
 
//fold stream in chunks of 30 descriptions
Iterator<List<GCubeDocument>> chunks = group(docs).in(30).withDefaults(); 
while (chunks.hasNext()) {
 List<AddOutcome> outcomes = writer.add(chunks.next()); for (AddOutcome outcome : outcomes) {
   if (outcome.getSuccess()!=null) {       ...outcome.getSuccess().getId()...    }
    else {
       ...outcome.getFailure().getFault()...    }
}
  • add(List<GCubeDocument>) and add(Iterator<GCubeDocument>) uses result set mechanisms to interact with remote services and thus can be invoked only inside a container;

Finally, note that the code in the examples above does not handle possible failures. Clients may consult the code documentation for an enumeration of the faults that the individual operations may raise.

Updating Documents

A DocumentWriter may also be used to update document descriptions already in the target collection.

It offers four operations for this purpose, two singleton operations and two bulk operations. The operations mirror those that add document descriptions to the target collection, as discussed above. Like the add operations, the update operations take document descriptions built with the APIs of the gML. However, the descriptions must be proxies of remote descriptions and each proxy must satisfy certain basic criteria, including:

  • the consistency between the collection bound to it and the collection bound to the writer;
  • the existence of tracked changes on it;
  • other constraints that are specific to its inner elements.

We say that the proxy must be valid for update. The operations are the following:

  • update(GCubeDocument): updates a document description in the target collection with the properties of a valid proxy;
  • updateAndSychronize(GCubeDocument): updates a document description in the target collection with the properties of a valid proxy, returning another proxy that is fully synchronised with the description, i.e. reflects all its properties after the update, including updates times of last update for the description and its inner elements;
  • update(Iterator<GCubeDocument>): updates multiple document descriptions in the target collection with the properties of a stream of valid proxies, returning a Future<?> for the future completion of the operation;
  • update(List<GCubeDocument>): updates multiple document descriptions in the target collection with the properties of a list of valid proxies, returning a map of processing failures indexed by the identifier of the corresponding description.

The operations and their return values can be illustrated as follows:

DocumentWriter writer = ...
 
//singleton add
GCubeDocument proxy = ...
writer.update(proxy); 
//singleton with synchronization
GCubeDocument synchronized = writer.updateAndSynchronize(proxy); 
//bulk by-value
List<GCubeDocument>  proxies = ...
Map<String,Throwable> failures = writer.update(proxies); 
//bulk by-ref
Iterator<GCubeDocument> proxyStream = ...
Future<?> future = writer.update(proxyStream);....
//poll for completion (see also other polling methods of Futures)
if (future.get()==null)    ...submission is completed...

A few points are worth emphasising:

  • updateAndSynchronize(GCubeDocument) requires two remote interactions, one to update the document description and one to retrieve its synchronised proxy;
  • update(Iterator<GCubeDocument>) follows the same pattern for stream-based operations already discussed for add operations. Invalid proxies found in the stream are silently discarded. Due to the possibility of these pre-processing failures and its non-blocking nature, the operation cannot guarantee the fidelity of outcome reports. For this reason, the operation returns only a Future<?> that clients can poll to know when all the proxies in input have been submitted for update. Clients that work with large or remote streams, and are interested in processing outcomes, are responsible for grouping the elements of the stream in 'chunks' of acceptable size and use add(List<GCubeDocument>);
  • update(List<GCubeDocument>) is indicated for small input collections and/or when outcome reports are important to clients. Invalid proxies found in the input list fail the operation before any attempt is made to update any document description in the target collection. Clients can use the group sentences of the Stream DSL to conveniently convert an arbitrarily large stream of GCubeDocuments into a stream of List<GCubeDocument>, which can then be fed to the operation element by element. The following example illustrates this processing pattern:
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*; 
DocumentWriter writer = ...
Iterator<GCubeDocument> proxies = ...
 
//fold stream in chunks of 30 proxies
Iterator<List<GCubeDocument>> chunks = group(proxies).in(30).withDefaults(); 
while (chunks.hasNext()) { Map<String,Throwable> failures = writer.update(chunks.next()); ...process failures...
}
  • update(List<GCubeDocument>) and update(Iterator<GCubeDocument>) uses result set mechanisms to interact with remote services and thus can be invoked only inside a container.

Finally, note that the code in the examples above does not handle possible failures. Clients may consult the code documentation for a list of the faults that the individual operations may raise.

Deleting Documents

A DocumentWriter may also be used to delete document descriptions already in the target collection.

It offers three operations for this purpose, one singleton operation and two bulk operations. The operations mirror those that update document descriptions in the target collection, as discussed above. Like the update operations, the delete operations take proxies of remote descriptions built with the APIs of the gML. The operations are the following:

  • delete(GCubeDocument): deletes a document description from the target collection using a proxy for it;
  • delete(Iterator<GCubeDocument>): deletes multiple document descriptions from the target collection using a stream of proxies for them, returning a Future<?> for the future completion of the operation;
  • update(List<GCubeDocument>): deletes multiple document descriptions from the target collection using a list of proxies for them and returning a map of processing failures indexed by the identifier of the corresponding description.

The operations and their return values can be illustrated as follows:

DocumentWriter writer = ...
 
//singleton add
GCubeDocument proxy = ...
writer.delete(proxy); 
//bulk by-value
List<GCubeDocument>  proxies = ...Map<String,Throwable> failures = writer.delete(proxies);
 
//bulk by-ref
Iterator<GCubeDocument> proxyStream = ...Future<?> future = writer.delete(proxyStream);
....
//poll for completion (see also other polling methods of Futures)
if (future.get()==null)    ...submission is completed...

A few points are worth emphasising:

  • delete(Iterator<GCubeDocument>) follows the same pattern for stream-based operations already discussed for add operations and update operations. Document descriptions found in the stream which are not proxies are silently discarded. Due to the possibility of these pre-processing failures and its non-blocking nature, the operation cannot guarantee the fidelity of outcome reports. For this reason, the operation returns only a Future<?> that clients can poll to know when all the proxies in input have been submitted for deletion. Clients that work with large or remote streams, and are interested in processing outcomes, are responsible for grouping the elements of the stream in 'chunks' of acceptable size and use delete(List<GCubeDocument>);
  • delete(List<GCubeDocument>) is indicated for small input collections and/or when outcome reports are important to clients. Document descriptions found in the input list which are not proxies fail the operation before any attempt is made to delete any document description in the target collection. Clients can use the group sentences of the Stream DSL to conveniently convert an arbitrarily large stream of GCubeDocuments into a stream of List<GCubeDocument>, which can then be fed to the operation element by element. The following example illustrates this processing pattern:
import static org.gcube.contentmanagement.gcubedocumentlibrary.streams.dsl.Streams.*; 
DocumentWriter writer = ...
Iterator<GCubeDocument> proxies = ...
 
//fold stream in chunks of 30 proxies
Iterator<List<GCubeDocument>> chunks = group(proxies).in(30).withDefaults(); 
while (chunks.hasNext()) { Map<String,Throwable> failures = writer.delete(chunks.next()); ...process failures...
}
  • delete(List<GCubeDocument>) and delete(Iterator<GCubeDocument>) uses result set mechanisms to interact with remote services and thus can be invoked only inside a container.

Finally, note that the code in the examples above does not handle possible failures. Clients may consult the code documentation for a list of the faults that the individual operations may raise.

Caching Documents

Projections and streams are key mechanisms in the gDL to promote retrieval performance. Caching document descriptions is another way to reduce latencies and network interactions. The gDL includes a few of abstractions that can serve as the basis for the implementation of caching strategies.

  • CachingReader is an implementation of the Reader interface which interjects a cache between the client and a DocumentReader. Documents in the cache that match the input projections are returned immediately to the client, those that are not found in the cache are retrieved with the DocumentReader and injected in the cache before they are returned to the client.
  • DocumentCache is an interface for bridging to arbitrary cache implementations. Any such implementation can be handed off to a CachingReader as part of its configuration.The interface is extremely simple:
public interface DocumentCache {
 
 GCubeDocument get(String docId);
 
 void put(GCubeDocument doc);
}
  • SimpleLRUCache is an implementation of DocumentCache that uses a standard LinkedHashMap to implement an in-memory, costant-time access cache with a Last-Recently-Used (LRU) eviction policy. It is the DocumentCache that CachingReaders use when clients do not express a choice. At construction time, clients can specify a size for the cache or else rely on defaults (cf. SimpleLRUCache#DEFAULT_SIZE).
//a cache of default size
SimpleLRUCache defaultCache = new SimpleLRUCache(); 
...
//a cache of 100 elements
SimpleLRUCache defaultCache = new SimpleLRUCache(100);

The following example illustrates common usage patterns:

DocumentReader dr = ....
DocumentCache cache =...
//uses a specific cache implementation, specifically configured
CachingReader reader =  new CachingReader(dr,cache);...
//uses a default cache of default size
CachingReader reader = new CachingReader(dr);

After creation, <code>CachingReaders are used like other Readers. A couple of points are worth noticing however:

  • CachingReaders compose well with projections. A document description is stored in the cache because it matches the retrieval directives of some projection used in read operations. However, the same description may not match the directives of later projections. Accordingly, a cache hit will always be checked with the current input projection and it will not be retuned to the client if it does not match that projection. For this reason, CachingReaders cannot be configured with Readers that post-process input projections, such the ViewReaders discussed later on. Thus CachingReaders may only wrap DocumentReaders, though they may wrapped in turn by other Reader implementations.
  • CachingReaders maintain and use the cache across all the methods of the Reader interface, including URI resolutions and for both singleton and stream-based operations. Notice, however, that the execution of the get(Projection) method can only take limited advantage of the cache. The projection needs to be sent out for remote execution, regardless of the existence of cache hits. Hence, document descriptions that correspond to cache hits will be retrieved nonetheless, though the CachingReader will take care to not return them to the client as "duplicates" of cache hits. Here, caching can only be used to reduce latencies, i.e. offer query results that are locally available before executing the query remotely.