Revision as of 02:40, 9 March 2011

The gCube Document Library (gDL) is a client library for adding, updating, deleting and retrieving document descriptions to, in, and from remote collections in a gCube infrastructure.

The gDL is a high-level component of the subsystem of gCube Information Services and it interacts with lower-level components of the subsystem to support document management processes within the infrastructure:

the gCube Document Model (gDM) defines the basic notion of document and the gCube Model Library (gML) implements that notion into objects;
the objects of the gML can be exchanged in the infrastructure as edge-labelled trees, and the Content Manager Library (CML) can dispatch them to the read and write operations of the Content Manager (CM) service;
the CM implements its operations by translating trees to and from the content models of diverse repository back-ends.

The gDL builds on the gML and the CML to implement a local interface of CRUD operations that lift those of the CM to the domain of documents, efficiently and effectively.

Preliminaries

The core functionality of the gDL lies in its operations to read and write document descriptions. The operations trigger interactions with the Content Manager service and the movement of potentially large volumes of data across the infrastructure. This may have a non-trivial and combined impact on the responsiveness of clients and the overall load of the infrastructure. The operations have been designed to minimise this impact. In particular:

when reading, clients can qualify the documents that are relevant to their queries, and indeed what properties of those documents should be actually retrieved. These retrieval directives are captured in the gDL by the notion of document projections.

when reading and writing, clients can move large numbers of documents across the infrastructure. The gDL streams this I/O movements so as to make efficient use of local and remote resources. It then defines a facilities with which clients can conveniently consume input streams, produce output streams, and more generally convert one stream into an other regardless of its origin. These facilities are collected into the stream DSL, an Embedded Domain-Specific Language (EDSL) for stream conversion and processing.

Understanding document projections and the stream DSL is key to reading and writing documents effectively with the gDL. We discuss these preliminary concepts first, and then consider their use as input and outputs in read and write the operations of the library.

Projections

A projection is a set of constraints over the properties of document descriptions. It can be be used in the read operations of the gDL to:

characterise relevant descriptions as those that match the constraints (projections as types);
specify what properties of relevant descriptions should be retrieved (projections as retrieval directives).

Constraints take accordingly two forms:

include constraints apply to properties that must be matched and retrieved;
filter constraints apply to properties that must be matched but not retrieved.

note: in both cases, the constraints take the form of predicates of the Content Manager Library (CML). The projection itself converts into a complex predicate which is amenable for processing by the Content Manager service in the execution of its retrieval operations. In this sense, projections are a key part of the document-oriented layer that the gDL defines over lower-level components of the gCube subsystem dedicated to content management.

As a first example, a projection may specify an include constraint over the name of metadata elements and a filter constraint over the time of last update. It may then be used to:

characterise document descriptions with at least one metadata element that matches both constraints;
retrieve of those descriptions only the name of matching metadata elements, excluding the time of last update, any other metadata property, and any other document property, include other inner elements and their properties.

Projections have the Projection interface, which can be used to access their constraints in element-generic computations. To build projections, however, clients deal with one of the following implementation of the interface:

DocumentProjection
MetadataProjection
AnnotationProjection
PartProjection
AlternativeProjection

A further implementation of the interface:

PropertyProjection

allows clients to express constraints on the generic properties of documents and their inner elements.

Streams

In some of its operations, the gDL relies on streams to model, process, and transfer large-scale data collections. Streams may consist of document descriptions, document identifiers, and document updates. More generally, they may consist of the outcomes of operations that take in turn large-scale collections in input. Streamed processing makes efficient use of both local and remote resources, from local memory to network bandwidth, promoting the overall responsiveness of clients and services through reduced latencies.

Clients that use these operations will need to route streams towards and across the operations of the gDL, converting between different stream interfaces, often injecting application logic in the process. As a common example, a client may need to:

route a remote result set of document identifiers towards the read operations of the gDL;
process the document descriptions returned by the read operations, e.g. in order to update some of their properties;
feed the modified document descriptions to the write operations of the gDL, so as to commit the changes;
inspect commit outcomes, so as to report or otherwise handle the failures that may have occurred in the process.

Throughout the workflow, it is important that the client remains within the paradigm of streamed processing, avoiding the accumulation of data in memory in all cases but where strictly required. Document identifiers will be streaming from the remote location of the original result set as documents descriptions will be flowing back from yet another remote location, as updated document descriptions will be leaving towards the same remote location, and as failures will be steadily coming back for handling.

Streaming raises significant opportunities for clients, as well as non-trivial challenges. In recognition of the difficulties, the gDL includes a set of general-purpose facilities for stream conversion that simplify the tasks of filtering, transforming, or otherwise processing streams. These facilities are cast as the sentences of the Stream DSL, an Embedded Domain-Specific Language (EDSL).

Operations

The operations of the gDL allows clients to add, update, delete, and retrieve document descriptions to, in, and from remote collections within the infrastructure. These CRUD operations target (instances of) a specific back-end within the infrastructure, the Content Manager (CM) service. It is a direct implication of the CM that the document descriptions may be stored in different forms within repositories which are inside or outside the strict boundaries of the infrastructure. While the gDL operations clearly expose the remote nature of document descriptions, the actual location of document descriptions, hosting repositories, and Content Manager instances is hidden to their clients.

In what follows, we discuss first read operations, i.e. operations that localise document descriptions from remote collections. We then discuss write operations, i.e. operations that persist in remote collections document descriptions which have been created or else modified locally. In all cases, operations are overloaded to work with different forms of inputs and outputs. In particular, we distinguish between:

singleton operations: these are operations that read, add, or change individual document descriptions. Singleton operations are used for punctual interactions with the infrastructure, most noticeably those required by front-end clients to implement user interfaces. All singleton operations that target existing document descriptions require the specifications of their identifiers;

bulk operations: these are operations that read, add, or change multiple document descriptions in a single interaction with the infrastructure. Bulk operations can be used for batch interactions with the infrastructure, most noticeably those required by back-end clients to implement workflows. They can also be used for real-time interactions with the infrastructure, such as those required by front-end clients that process user queries. Bulk operations may be further classified in:
- by-value operations are defined over in-memory collections of document descriptions. Accordingly, these operations are indicated for small-scale data transfer scenarios. As we shall see, they may also be used to move segments of larger data collections, when the creation of such fragments is a functional requirement.
- by-reference operations are defined over streams of document descriptions. These operations are indicated for medium-scale to large-scale data transfer scenarios, where the streamed processing promotes the responsiveness of clients and the effective use of network resources.

Read and write operations work with document descriptions that align with the gCube document model (gDM) and its implementation in the gCube Model Library (gML). In the terminology of the gML, in particular, operations that create document descriptions expect new elements, all the others take or produce element proxies.

Finally, read and write operations build on the facilities of the Content Manager Library (CML) to interact with the Content Manager service, including the adoption of best-effort strategies to discover and interact with instances of the service. These facilities are thus indirectly available to gDL clients as well.

Views

Some clients interact with remote collections to work exclusively with subsets of document descriptions that share certain properties, e.g. are in a given language, have changed in the last month, have metadata in a given schema, have parts of a given type, and so on. Their queries and updates are always resolved within these subsets, rather than the whole collection. Essentially, such clients have their own view of the collection.

The gDL offers support for working with two types of view:

local views: these are views defined by individual clients as the context for a number of subsequent queries and updates. Local views may have arbitrary long lifetimes, and may even outlive the client that created them, they are never used by multiple clients. Thus local views are commonly transient and if their definitions are somehow persisted, they are persisted locally to the 'owning' client and remain under its direct responsibility.

remote views: these are views defined by some clients and used by many others within the system. Remote views outlive all such clients and persist in the infrastructure, typically for as long as the collection does. They are defined through the View Manager service (VM), which materialises them as WS-Resources. Each VM resource encapsulates the definition of the view as well as its descriptive properties, and it is responsible for managing its lifetime, e.g. keep track of its cardinality and notify interested clients of changes to its contents. However, VM resources are 'passive, i.e. do not mediate access to those content resources.

Naturally, the gDL uses projections as view definitions. It then offers specialised Readers that encapsulate such projections to implicitly resolve all their operations in the scope of the view. This yields view-based access to collections and allows clients to work with local views. In addition, the gDL provides local proxies of VM resources with which clients can create, discover, and inspect remote views. As these proxies map remote view definitions onto projections, remote views can be accessed with the same reading mechanisms available for local views.

Difference between revisions of "GCube Document Library (2.0)"

Revision as of 02:40, 9 March 2011

Contents

Preliminaries

Projections

Streams

Operations

Views

Advanced Topics

Caches

Buffers

Navigation menu

Views

Personal tools

gCube Wiki

gCube features

gCube documentation

Integration and Distribution

Search

Tools

@@ Line 101: / Line 101: @@
 Naturally, the gDL uses [[#Projections|projections]] as view definitions. It then offers specialised <code>Reader</code>s that encapsulate such projections to implicitly resolve all their operations in the scope of the view. This yields view-based access to collections and allows clients to work with local views. In addition, the gDL provides local proxies of VM resources with which clients can create, discover, and inspect remote views. As these proxies map remote view definitions onto projections, remote views can be accessed with the same reading mechanisms available for local views.
-== Local Views ==
+[[GDL_Views_(2.0)|read more...]]
-To work with a local view of a remote collection, a gDL client creates first the projection that defines the view. The client then injects the projection into a <code>ViewReader</code>, along with a <code>DocumentReader</code> already configured to access the target collection. Like the <code>DocumentReader</code>, the <code>ViewReader</code> implements the <code>Reader</code> interface, offering all the read operations discussed [[#Reading_Documents|above]]. When any of its operations is called, however, the <code>ViewReader</code> ''merges'' the view definition and the input projection, combining their constraints. It then passes the merged projection to the inner <code>DocumentReader</code>, which executes the operation. Effectively, this resolves the operation in the scope of the view.
-The following example illustrates the approach:
-<source lang="java5" highlight="7,9,13,14">
-import static java.util.Locale.*;
-import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
-//a reader configured to access a target collection.
-DocumentReader reader = ...
-//define a local view
-DocumentProjection view = document().withValue(LANGUAGE,FRENCH);
-//inject view and reader it into a view reader
-ViewReader viewReader = new ViewReader(view,reader);
-GCubeDocument doc = viewReader.get("...some id...", document().with(NAME));
-assert(doc.language()!=null);
-assert(doc.language().equals(FRENCH));
-assert(doc.name()!=null);
-</source>
-Here, the view includes only the document descriptions (with a bytestream) in a given language. The lookup operation retrieves the target description only if it has a name ''and'' is in the view. At runtime, the <code>DocumentProjection</code> that defines the view is merged with the projection passed to the lookup operation. This produces the same effect that the following projection would produce if it was executed by a plain <code>DocumentReader</code>:
-<source lang="java5">
-document().withValue(LANGUAGE,FRENCH).with(NAME);
-</source>
-The example above defines the view as a <code>DocumentProjection</code> but ''any'' projection can be used for the purpose (e.g. a <code>MetadataProjection</code>, an <code>AnnotationProjection</code>, etc). In general, clients have the same flexibility in defining views as they do in invoking the operations of  <code>Reader</code>s: any projection that can be used in one context can also be used in the other. Clients will choose <code>DocumentProjection</code>s when the view needs to characterise properties of entire documents and/or inner elements of different types. They will instead prefer more specific projections when the view is predicated on properties of inner elements of the same type. For example, a view that characterises document descriptions based only on the schema of their metadata elements is more conveniently defined with a <code>MetadataProjection</code>:
-<source lang="java5" highlight="3">
-...
-//define a local view
-DocumentProjection view = metadata().withValue(SCHEMA_URI,"..some schema..");
-...
-GCubeDocument doc = viewReader.get("...some id...", document().with(NAME));
-...
-</source>
-The operation above would lookup the target document description only if it has a name and at least one metadata element in the given schema.
-Similarly, clients are free to pass any projection with the operations of the <code>ViewReader</code>, including those that "diverge" arbitrarily from the view:
-<source lang="java5" highlight="3,5">
-...
-//define a local view
-DocumentProjection view = document().with(PART);
-...
-GCubeDocument doc = viewReader.get("...some id...", metadata().with(BYTESTREAM));
-...
-</source>
-The operation above would lookup the target document descriptions only if it has at least one part and one metadata element with an inlined bytestream.
-The freedom in merging view definitions with other projections is limited only by the obvious requirement: the merged projection must ''not'' retrieve documents that are outside the view. The <code>ViewReader</code> will detect projections that ''break'' the view abstraction and, in case, refuse them as parameters of its operations. For example:
-<source lang="java5" highlight="3,6,9">
-...
-//define a local view
-DocumentProjection view = document().withValue(LANGUAGE,FRENCH);
-...
-try {
-  GCubeDocument doc = viewReader.get("...some id...", document().withValue(LANGUAGE,ENGLISH));
-  assert(false);
-}
-catch(InvalidProjectionException e) {
-	assert(true);
-}
-...
-</source>
-This attempts generates an <code>InvalidProjectionException</code> as document descriptions in English are not part of the view.
-The possibility of an invalid projection emerges as soon as there is an ''overlap'' between the properties specified in view definitions and those specified in the input projections (as for <code>LANGUAGE</code> above). A crude policy to enforce views would be to prevent overlaps altogether. This policy is too inflexible, for two main reasons:
-* like all projections, view definitions specify retrieval directives (i.e. <code>with()</code>, <code>where</code>, <code>opt()</code>), i.e. what is to be retrieved and what it should not move over the network. While clients can exploit this directives to define ''default directives'' within view definitions, it's important that clients may freely override them on a per-operation basis, e.g. replace a <code>with()</code> in the view definition with a <code>where()</code> in the input projection (e.g. to avoid retrieval of unnecessary data). For this reason, it is important to handle overlaps that signal overriding of defaults, provided that the operation can still be guaranteed to remain within the scope of the view. An example illustrates the point:
-<source lang="java5" highlight="3,5">
-...
-//define a local view
-DocumentProjection view = document().withValue(LANGUAGE,FRENCH);
-...
-GCubeDocument doc = viewReader.get("...some id...", document().whereValue(LANGUAGE,FRENCH).with(NAME));
-...
-</source>
-: Here the input projection is allowed to overlap with the view definition on <code>LANGUAGE</code> for overriding purposes. Notice that overriding is allowed here because the constraints imposed by the view are preserved. In general, it is good practice to use <code>where()</code> directives in view definitions, so as to limit the need for overriding and to allow clients to specify explicitly only what they wish to retrieve. Now that we illustrated this point we will align with this practice in all the examples below.
-* the other reason to deal with overlaps between view definitions and input projections is simply flexibility. In principle, input projections ought to be allowed to ''refine'' the view in their projections. While it is not possible to recognise all refinements statically (and in some case it is genuinely difficult even when it is possible), <code>ViewReader</code> recognise and allow a number of common refinements including:
-:* refinements on existence constraints. If the view requires the mere existence of a property, then an input projection can specify a further constraint on it. For example, the view <code>document().where(NAME)</code> can be refined (and overridden) by the input projection <code>document().withValue(NAME,"myname")</code>;
-:* refinements on [[#Deep_projections|deep projections]]. If the view constraints properties of inner elements, then an input projection can constrain other properties of those elements, or even the same properties if the refinement is allowed in turn. For example, the view <code>document().where(METADATA,metadata().with(NAME))</code> can be refined by the input projection <code>document().with(METADATA, metadata().withValue(NAME,"myname"))</code>.
-Flexibility is not only introduced by refinements:
-* a view constrain may also be overridden by a widening input projection. For example, the view <code>document().where(NAME,"myname")</code> can be overridden by the wider<code>document().with(NAME)</code>. The two projections merge into the projection <code>document().with(NAME,"myname")</code>. Thus clients are not required to know the details of the view if they wish to retrieve the names of document descriptions.
-To conclude on the possibilities for view-based access, notice that [[#Empty_projections|empty projections]] continue to retain their simplicity of use under views. For example:
-<source lang="java5" highlight="3,5">
-...
-//define a local view
-PartProjection view = part().withValue(LANGUAGE,FRENCH);
-...
-GCubeDocument doc = viewReader.get("...some id...", part());
-...
-</source>
-continues to retrieve all parts of documents in the view. Here, the view definition and the input projection merge into a projection that has the single constrain of the view and optional include constraints for all the other remaining documents properties.
-Similarly , the [[#Catch-All_Constraints|catch-all constraints]] such as <code>etc()</code> and <code>allexcept()</code> retain their semantics under views, and can be used in input projections with the usual expectations (using them in view definitins as defaults for retrieval is also possible, though discouraged in the general case for the reasons discussed previously).
-== Remote Views ==
-When working with remote views, accessing documents in the view is not the only client requirements. ''Publishing'' a remote view, i.e. share it with other remote clients,and ''discovering'' existing views with given properties are also common tasks for clients. In the gDL, support for these tasks is mostly found in <code>CollectionView</code>s.
-A <code>CollectionView</code> is a local proxy a remote view, not dissimilarly from how <code>GCubeDocument</code>s are local proxies of remote document descriptions. More specifically, collection views are document-oriented abstractions over the tree-oriented <code>View</code> proxies of WS-Resources of the View Manager service, as offered by the [[View_Manager#Client_Library|client-library]].
-Most properties of <code>View</code> proxies carry over directly to <code>CollectionView</code>s, including:
-* the ''collection identifier'' (cf. <code>collectionId()</code>);
-* the ''view identifier'' (cf. <code>id()</code>);
-* the ''descriptive name'' of the view, e.g <code>"myview"</code> (cf. <code>name()</code>);
-* the free-form ''description'' of the view, e.g. <code>"all document descriptions such that..."</code> (cf. <code>description()</code>);
-* the broad ''type'' of the view, e.g. a <code>QName</code> like <code>{http://...}mytype</code> (cf. <code>type()</code>);
-* the time of ''last update'' of the view (cf. <code>lastUpdate()</code>);
-* the ''cardinality'' of the view (cf. <code>cardinality()</code>);
-* the ''generic properties'' of the view (cf. <code>properties()</code>).
-Most importantly, the CM <code>Predicate</code>s used in <code>View</code> proxies to characterise the document descriptions in the view become <code>Projection</code>s over
-<code>GCubeElement</code>s, whether entire document descriptions (i.e. <code>GCubeDocument</code>s) or specific inner elements of such descriptions (e.g. <code>GCubeMetadata</code>).
-<code>CollectionView</code> defines a read-only interface to the properties of the view, which supports generic programming tasks over views. Most clients however will work with concrete implementations of the interface. The current version of the gDL includes the following ones:
-* <code>GenericView</code>: a generic implementation for arbitrary remote views;
-* <code>MetadataView</code>: an implementation for remote views defined over a simple set of metadata element properties;
-* <code>AnnotationView</code>: an implementation for remote views defined over a simple set of annotation properties;
-All <code>CollectionView</code> implementations inherit from the abstract <code>BaseCollectionView</code>, while <code>MetadataView</code> and <code>AnnotationView</code> inherit from the more specific <code>SimpleView</code>.
-'''note''': Custom implementations can be created as well, by specialising <code>GenericView</code> or, more commonly, <code>BaseCollectionView</code>  which has been explicitly designed for open-ended extensibility.
-Across all implementations, <code>CollectionViews</code> may be in either one of two states:
-* '''unbound''': this is the state of instances that are not associated with remote views. In this state, <code>CollectionView</code>s can be used for view publication and view discovery, but not for view-based access to collections.
-* '''bound''': this is the state of instances that are bound to remote views. In this state, <code>CollectionView</code>s can be used for view-based access but not for view publication or view discovery.
-Typically, <code>CollectionView</code>s are created unbound and used for publication or discovery, depending on the general goals of the client. Both operations introduce bindings: a <code>CollectionView</code>s becomes bound after publication and all the <code>CollectionView</code>s returned from discovery operations are already bound; these can then be used for view-based access. In less common cases, clients may start with a VM proxy and inject it directly into a new <code>CollectionView</code>s, which is thus instantiated in a bound state. All the  implementations of <code>CollectionView</code> are responsible for enforcing state-based constraints.
-=== Publishing Views ===
-We use <code>GenericView</code>s to illustrate how client can publish, discover and use remote views. We then show how working with <code>MetadataView</code>s and <code>AnnotationView</code>s changes the basic usage patterns.
-Publishing a remote view involves creating a proxy for it in a given scope, setting its properties, and invoking the method <code>publish()</code> on it:
-<source lang="java5" highlight="7,18,21">
-import static org.gcube.contentmanagement.gcubedocumentlibrary.projections.Projections.*;
-import static java.util.Locale.*;
-GCubeScope scope = ...
-//creates unbound view in given scope
-GenericView view = new GenericView(scope);
-assert(view.isBound()==false);
-//sets view properties
-view.setId("...");
-view.setCollectionId("...");
-view.setName("...");
-view.setDescription("...");
-//sets view definition
-view.setProjection(document().where(LANGUAGE,FRENCH));
-/publish view
-view.publish();
-assert(view.isBound());
-</source>
-We create a <code>GenericView</code> in a given scope and sets its properties, including the projection that defines it. We then publish the view in the scope in which we created it. The test shows that the <code>GenericView</code> is unbound until its publication, after which it is bound.
-A few points to notice about view publication:
-* in the example, we have used a <code>DocumentProjection</code> for documents (with a bytestream) in French. As usual, clients can use the type of projection which is more convenient to express their constraints (e.g. a <code>MetadataProjection</code>). Regardless of the type of the injected projection, a <code>GenericView</code> returns always a <code>DocumentProjection</code> from its <code>projection()</code> method, as in the general case the definition of the view may be acquired during [[#Discovering_Views|discovery]], when its gDL type is statically unknown.
-* views can be published with the method <code>publishAndBroadcast()</code>. As the name suggests, this overload induces the View Manager service to broadcast a record of its creation. This is then used for autonomic state replication across its instances, in line with the scalability mechanisms of the service.
-* in the example, we have provided an explicit identifier for the view. We could have omitted it, in which case the View Manager service would have generated one for it.
-* like identifiers, other properties of the view (e.g. time of last update) are set on the view by the View Manager at the point of creation. During publication these properties are automatically synchronised on the local proxy.
-* we did not set the type of the view before publication. The type of a <code>GenericView</code> is in fact constant (the name of the class itself, <code>GenericView</code>) and clients cannot alter it. This constancy is found in all other implementations of <code>CollectionView</code>, as it ensures that upon [[#Discovering_Views|discovering views]] we can recognise the classes to which we should bind them.
-=== Discovering Views ===
-=== Using Views ===
 = Advanced Topics =