GCube Document Model

From Gcube Wiki
Jump to: navigation, search

The gCube Document Model (gDM) is a descriptive model used in gCube to manage datasets that are primarily associated with authoring and rendering processes.
The gCube Model Library (gML) is a client library that includes an object-based implementation of the gDM.
The gML is a component of the subsystem of gCube Information Organisation Services.

In what follows, we first summarise the gDM in abstract terms and then overview its implementation in the gML.

Model

The model defines five main data structures to describe entities with broad and well-known semantics:

  • the document: a self-contained unit of content within a collection of related units;
  • the metadata: a formal description of a document, in some model other than gDM;
  • the annotation: a subjective observation or record about a document;
  • the part: a component of a document;
  • the alternative representation: a secondary manifestation of a document.

The structures that describe these entities are the elements of the model.

Elements may have properties, where a property is a named value that:

  • may or may not occur within the element (optional vs. mandatory);
  • may occur more times within the element (single vs. repeated);
  • may be a piece of text, an integer, a datetime, or a bytestream (simple);
  • may be a structure of other properties (structured)

Metadata, annotations, parts, and alternative representations are themselves optional and repeated properties of documents. They are the inner elements of the document.

Some properties are common to documents and inner elements:

  • identifier: identifies a document (at least) within its collection, or an inner element (at least) within its document (mandatory, simple, single, text);
  • name : free-form element descriptor (optional, simple, single, text);
  • type : free-form descriptor of element semantics (optional, simple, single, text);
  • creation time: records the time in which the element description was created (optional, simple, single, datetime);
  • last update time: records the time in which the element description was last changed (optional, simple, single, datetime);
  • bytestream: a digital representation of the element (optional, simple, single, bytestream);
  • mime type: the MIME type of the element’s bytestream (optional, simple, single, text in RFC 2046 syntax);
  • length : number of octets in the element’s bytestream (number of octets) (optional, simple, single, positive integer);
  • url : a URL to a digital representation of the element’s (optional, simple, single text compliant with RFC 2396/RFC 2732 syntax);
  • language : the language of the element (optional, simple, single, text in ISO639 code list);
  • schema uri: a schema for the (embedded/referred) bytestream of the element, if applicable (optional, simple, single, text compliant with RFC 2396/RFC 2732 syntax);
  • schema name: a free-form descriptor (optional, simple, single, text compliant with RFC 2396/RFC 2732 syntax);
  • property: an arbitrary property of the element (optional, structured, repeated);
    • key: identifies the property within the element (mandatory, simple, single, text);
    • type: free-form descriptor of property semantics (optional, simple, single, text);
    • value: the value of the property (mandatory, simple, single text).

A few properties are specific to individual elements.

  • documents
    • collection identifier: identifies a document in a collection (mandatory, simple, single, text);
  • annotations
    • previous: the identifier of an another annotation of the document which an annotation somehow refers or related to (optional, simple, single, text);
  • parts
    • order: the order of a part within the document (optional, simple, single, positive integer):

Implementation

The gML is an object-based implementation of the gDM which is designed for distributed document management.
Clients can use its classes and interfaces to create, change, or inspect documents that have been or will be stored remotely within the system.

As one aspect of distribution, the gML defines an exchange format for documents. In that format, documents are constrained forms of the edge-labelled trees managed by the Content Manager service. This enables documents to be stored, updated, and retrieved through that service. While the gML itself does not implement CRUD operations for documents, it does prepare the ground for their implementation. One such implementation can be found in the gCube Document Library (gDL).

Element Classes

In the gML, all the elements of the gDM have the GCubeElement interface. The interface gives read-only access to common element properties and can be used for element-generic operations. It is then implemented by element-specific classes such as:

  • GCubeDocument
  • GCubeMetadata
  • GCubeAnnotation
  • GCubePart
  • GCubeAlternative.

Clients use these classes to construct, update, and inspect elements. By design, none of these element classes is thread-safe. Clients that access their instances from multiple threads are responsible for synchronisation, typically through some form of shared lock.

New Elements and Element Proxies

All element classes define two constructors. The default constructor creates elements that have no identifier while a secondary constructor creates elements with identifiers. Since identifiers cannot be changed in the gML, elements that are created without an identifier may only receive one when they are stored in the system. Until then, these elements are new within the system, and the methodisNew() points this out. E.g. for documents:

GCubeDocument doc = new GCubeDocument(); //a new document
assert(doc.isNew());

In contrast, elements that are created with identifiers represent local proxies of elements that have been previously stored in the system, when they were given precisely those identifiers:

GCubeDocument proxy = new GCubeDocument("someid");  // a document proxy
assert(proxy.isNew()==false);

Most often, element proxies are created when documents are deserialised from their exchange format. The retrieval operations of the gDL, most noticeably, return proxies of remotely stored documents. Clients may also create element proxy manually, so as to express changes that they wish to be reflected in their remote counterparts. We discuss element updates later on and concentrate next on the creation of new elements.

Creating Elements

Since all elements share a set of common properties in the gDM, creating a new document is not much different from creating a new metadata element, a new annotation, a new alternative representation, or a new document part. We discuss here the common steps and deal later with the specifics of individual element types.

Like identifiers, the time of creation and last update cannot be set on new elements. Elements acquire the former when they are first stored in the system and the latter when they are updated in the system.

assert(doc.creationTime()==null);
assert(doc.lastUpdate()==null);

Other common properties may instead be assigned on new elements, e.g:

doc.setName("some document");
doc.setLanguage(Locale.ITALIAN);
doc.setBytestreamURI(new URL("http://......");
doc.setSchemaName("someschema");
doc.setSchemaURI(new URI("http://..."));

Bytestreams are among these properties, and for this clients have a number of options:

byte[] bytes = ...
doc.setBytestream(bytes)
 
InputStream stream = ...
doc.setBytestream(stream)
 
Reader characterStream = ....
doc.setBytestream(characterStream)

Note: the stream-based methods are for the convenience of clients that read bytestreams from files, or that hold already stream-based references for them. They do not imply that arbitrarily large bytestreams should be inlined in an element. Large bytestreams ought to be made available at some public URL, and clients should set the URL on the element.

In all cases above, an attempt is made to guess the MIME type of the bytestream (if not already set by the client). Similarly, the length of the bytestream is automatically derived:

doc.setBytestream("<a>hello world</a>".getBytes());
assert (doc.mimeType()!=null)
assert (doc.length()!=0)

Clients may well override the guessed MIME types:

doc.setMimeType("text/xml");

but they do so knowing that both length and MIME type may be recomputed when elements are stored within the system, depending on the behaviour of the particular storage back-end.

Finally, adding generic properties to elements is straightforward:

GCubeElementProperty prop = new GCubeElementProperty("somekey","sometype","someval");
doc.addProperty(prop);

Here GCubeElementProperty is simply the triple required to represent a generic element property. If a property with the same key was previously added to the element, then this is replaced by the new property and returned to the client.

Creating Documents

In addition to common properties, documents may are associated with a collection and have inner elements.


Clients can set the identifier of the document's collection only once:

doc.setCollectionID("somecollectionid");
doc.setCollectionID("somecollectionid");  //raises an error

As to inner elements, clients add them to type-specific collections within the document, e.g.:

GCubeMetadata m = ...
doc.metadata().add(m);
 
GCubeAnnotation a= ...
doc.annotations().add(a);

Note: a new document can only have new elements. An attempt to add an element proxy to a new document is non-sensical and will raise an error.

An inner element that is added to a document becomes bound to it:

assert(m.document().equals(doc));
assert(a.document().equals(doc));

Note: the assertions above show that documents can be compared for equivalence. The same is true of all elements in gML.

The binding between a document and one of its inner elements can be broken by removing the element from the document. However, it cannot be overridden by adding the element to another document:

doc.metadata().remove(m); //ok
 
GCubeDocument anotherdoc = ...
anotherdoc.metadata().add(m)  //raises an error

Creating Metadata

In the current version of the gDM, metadata elements may only have common properties.

Creating Annotations

In addition to common properties, annotations of the same document may be 'threaded' in the gDM. To model this, clients can set an annotation to follow another in a thread, e.g.:

GCubeAnnotation a1 = ...
GCubeAnnotation a2 = ...
a2.setPrevious(a1);

The operation succeeds only if:

  • a1 and a2 are linked to the same document;
  • both the document and a1 are element proxies.

The identifier of a1 is required in order to record the relationship in the system, and a1 can only have an identifier if its document has already been stored in the system. This aligns with the pragmatics of threading, whereby users observe existing annotations and then create new ones to 'respond' to those annotations.

Because of these constraints, annotation threads are defined when a document is updated, as we discuss later.

Creating Alternative Representations

In the current version of the gDM, alternative representations may only have common properties.

Creating Parts

In addition to common properties, document parts may be ordered (e.g. document chapters). It is easy to set the order of a part:

GCubePart p = ...
p.setOrder(3);

but the operation succeeds only if:

  • p is already linked to a document,
  • the document has another part wither order 2.

Thus clients may add ordered parts to new documents or document proxies, as long as they do so after they have added the parts to the document and in strict order, e.g.:

GCubePart part1 = new GCubePart();
...set part properties...
doc.parts().add(part1);
part1.setOrder(1);
 
GCubePart part2 = new GCubePart();
...set part properties...
doc.parts().add(part2);
part1.setOrder(2);

Updating Elements

Clients may use element proxies to express changes they wish to apply to documents that have been previously stored within the system.
Given a document proxy that is synchronised with its remote counterpart, we may ask it to track all future changes to its properties:

doc.trackChanges();

In response, the proxy takes a snapshot of its current properties. Later, it can use this snapshot to tell how the properties have changed since. This delta information may then be fed to the system as a minimal and exact specification of the changes that need to be reflected onto the proxied document. The update operations of the gDL do precisely this, extract delta information from document proxies and feed it to the system on behalf of clients.

Changes can only be tracked on a document proxy that is synchronised with its remote counterpart. Enabling change tracking on a new document is non-sensical and will generate an error. Again, clients can use the read operations of the gDL to obtain synchronised document proxies .

Staged Updates

When changes do not depend on the state of documents within the system, the cost of synchronisation can be conveniently avoided. In these cases, clients can perform staged updates on document proxies. As introduced above, creating such proxies requires only knowledge of document identifiers:

GCubeDocument proxy = new GCubeDocument("someid");

Clients can then proceed as follows:

  • stage the proxy, i.e. set dummy values for properties they wish to change or delete;
  • enable change tracking on the proxy;
  • change or delete the staged properties, or else add new properties to the proxy;

In essence, staging gives enough information to understand what properties have changed and how, regardless of their current value in the system, e.g:

proxy.setName("dummy");  //stage name of proxy
proxy.trackChanges();    //marks end of staging
proxy.setName("newname") //applies changes to staged property

Here, the client stages the name of the proxy and changes it to its new value after enabling change tracking. Clients can add or delete properties in a similar manner, e.g:

import static java.util.Locale.*;
 
proxy.setBytestream("dummy".getBytes());
 
proxy.trackChanges();
 
proxy.setBytestreamURI((null);
proxy.setLanguage(ENGLISH);

Here, the clients stages a bytestream URI and then erases that value after enabling change tracking. At that point, the client adds a language to the proxy, which is thus understood to be entirely new for it.

Staging extends to generic document properties:

GCubeElementProperty prop1 = new GCubeElementProperty("somekey","somevalue");
GCubeElementProperty prop3 = new GCubeElementProperty("someotherkey");
 
proxy.addProperty(prop1);
proxy.addProperty(prop2);
 
proxy.trackChanges();
 
//change value of first property
prop1.setValue("newval");
 
//delete second property
proxy.removeProperty("someotherkey");
 
//add new property
GCubeElementProperty newprop = new GCubeElementProperty("newkey","newtype","newvalue");
proxy.addProperty(newkey);

Finally, clients can use the method isTracked() on a document proxy, e.g.:

GCubeDocument proxy = new GCubeDocument("someid");
...
proxy.trackChanges();
...
assert(proxy.isTracked());

They can also use the method resetChanges() to delete the record of changes applied since they last enabled tracking:

proxy.resetChanges();
assert(proxy.isTracked()==false);

Resetting changes is rarely needed, however, as the update operations of the gDL do that automatically after committing the changes that have been tracked on a document proxy.

Updating Inner Elements

Updates to inner element follow a similar pattern, e.g.:

import static java.util.Locale.*;
 
GCubeDocument proxy = new GCubeDocument("someid");
 
//stage first metadata
GCubeMetadata mdproxy1 = new GCubeMetadata("1");
mdproxy1.setLanguage(ENGLISH);
 
//stage second metadata
GCubeMetadata mdproxy2 = new GCubeMetadata("2");
 
proxy.metadata().add(mdproxy1);
proxy.metadata().add(mdproxy2);
 
//track change
proxy.trackChanges(); 
 
//effect change on first metadata
mdproxy1.setLanguage(ITALIAN);  //change metadata language
mdproxy1.setName("newname");  //add metadata name
 
//remove second metadata
proxy.metadata().remove(mdproxy2);
 
//add new metadata
GCubeMetadata newmd = new GCubeMetadata();  //new metadata
.... //build new metadata
proxy.metadata().add(newmd);

In this example, the client stages a set of changes to the metadata elements of a document. He stages one metadata proxy in order to change its language and add a name for it. He also stages another metadata proxy that he wishes to delete from the proxied document. The client then enables change tracking and performs the required changes, including removing the second metadata element from the proxy. Finally, he adds a new metadata element to the document proxy which he did not stage and thus appears new for the document.

As a further example, consider staging the definition of an annotation thread:

GCubeDocument proxy = new GCubeDocument("someid");
 
//stage first annotation
GCubeAnnotation annotationProxy = new GCubeAnnotation("1");
proxy.annotations().add(annotationProxy);
 
proxy.trackChanges();
 
GCubeAnnotation newAnnotation = new GCubeAnnotation();
//build new annotation...
 
proxy.annotations().add(newAnnotation);
newannotation.setPrevious(annotationProxy);


And again, consider staging the addition of an ordered part:

GCubeDocument proxy = new GCubeDocument("someid");
 
//stage existing part
GCubePart partProxy = new GCubePart("1");
proxy.parts().add(partProxy);
partProxy.setOrder(5);
 
proxy.trackChanges();
 
GCubePart newPart = new GCubePart();
 
...  //build new part
 
proxy.parts().add(newPart);
 
newPart.setOrder(partProxy);

Inspecting Elements

Consuming elements is straightforward and the code documentation for the element classes shows clearly what accessor methods are available on them, what output they may return, and under what conditions they may fail. We discuss below some cases of particular interest.

Inspecting Bytestreams

The bytestream of documents and their inner elements can be inspected with either one of three methods:

  • byte[] bytestream()
  • URI bytestreamURI()
  • InputStream resolveBytestream()

The first two methods return bytestreams as they were stored, i.e. inlined or referenced by URI. The third method abstracts over the two cases, resolving URIs when bytestreams are referenced. All methods return null when elements have no bytestream.

Inspecting Inner Elements

Navigating the inner elements of documents, in particular, relies on the same type-specific collections used for additions e.g.:

for (GCubeMetadata m : doc.metadata())
	...m...
 
Iterator<GCubeAlternative> it = doc.alternatives().iterator(); 
while (it.hasNext()) {
   .... GCubeAlternative a = it.next()......
}
 
 
if (doc.parts().contains("somemetadataid"))
	GCubePart p = doc.parts().get("somemetadataid")  //assuming document and part are not new
    ...p...
}
 
if (doc.annotations().size()>0)
	....

Type-specific collections are specialised, and behave like standard Java Lists only insofar as iterations and size inspections are involved. Clients that wish to use other List-based methods on their elements, need to do so on a clone of the type-specific collection, which has the required Listtype. For example, to directly access a particular element of the collection, clients may do:

List<GCubeAnnotation> annotations = doc.annotations().toList();
...annotations.get(0)....

Note also that type-specific collections may have correspondingly type-specific methods. We introduce these methods below, when we discuss how clients may inspect elements of given types.

Inspecting Annotations

As discussed above, the annotations of a document may be related to each other so as to form a linear thread. Clients may navigate one such thread backwards starting from any annotation in a document proxy, e.g.:

GCubeAnnotation a = ....
GCubeAnnotation predecessor = a.previous();

There are two cases in which previous() returns null, when:

  • the annotation has genuinely no predecessor, e.g. is at the beginning a thread or is an isolated annotation of the document
  • the annotation does have a predecessor but this element is not locally available in the document proxy.

A document proxy may only partially synchronised with the document when it's being locally staged, or when it has been parsed from a partial serialisation of the document. The latter case, for example, is common when clients use the read operations of the gDL to retrieve not less and no more information about the document than what they really require. In particular, they may retrieve some annotations of the document but not others.
After parsing a partial document serialisation, the proxy attempts to resolve all the the identifiers of previous annotations that can be resolved into objects. Those which cannot be resolved are still available to clients for inspection:

String predecessorID = a.previousID();

Under these constraints, clients can always obtain an inspect the threads of annotations which are available on a document proxy as follows:

GCubeDocument doc=...;
...
List<AnnotationThread> threads = doc.annotations().threads(); 
for (AnnotationThread thread : threads) {
    ...
   //start of thread
   GCubeAnnotation root = thread.annotation();    ...
   //all annotations, in depth-first order
   List<GCubeAnnotation> all = thread.annotations();   ...
   //threads rooted in answers to root (navigate the tree)
   List<AnnotationThread answers = thread.answers();   ...
   //containment check
   GCubeAnnotation someAnnotation = ...;
   if (thread.contains(someAnnotation))    ...
}

note: threads are always identified and computed starting from their root annotations. Thread fragments that do not include thread roots will not be returned from threads()

Inspecting Parts

We have seen above that the parts of a document may be ordered. While the method order() can be used to inspect the order of a part, the method previous() can be used to access the part that precedes the part in that ordering:

GCubePart p = ....
GCubePart predecessor = p.previous();
if (predecessor!=null) 
  assert(predecessor.order()==p.order()-1)

The reasons for the null check are analogous to those discussed above for annotation threads. In particular, previous() returns null when:

  • the part has genuinely no predecessor, e.g. it is the first or has no order within the document;
  • the part does have a predecessor but this element is not locally available in the document proxy.

See the discussion on annotation threads for an explanation of why and when this may occur.

Element URIs

All element classes can return URIs that are assigned to elements when these are stored with the Content Manager service. The content URI of an element is not formally part of the gDM, but element classes can compute it dynamically from element identifiers and collection identifiers, e.g.:

GCubeDocument doc = ....
 
if (!proxy.isNew() && proxy.collectionID()!=null) {
  URI uri = doc.uri();
  assert(uri.equals(new URI("cms://collID/docID")));
}

where docID and collID are, respectively, the identifier of the document and the identifier of the document's collection.

Note: as the example illustrates, URIs can only be computed on document proxies which are bound to a collection. An attempt to invoke uri() on a new document, or on a proxy which is not fully configured, raises an error. The document proxies that are returned by the read operations of the gDL satisfy these requirements.

Content URIs can also be computed on inner elements:

GCubeMetadata metadata = ....
 
if (!metadata.isNew() && metadata.document()!=null) {
  URI uri = doc.uri();
  assert(uri.equals(new URI("cms://collID/docID/metadataID")));
}

where docID and collID are as above, and where metadataID is the identifier of the metadata element.

Note: as the example illustrates, URIs of inner elements can only be computed on element proxies. Furthermore, the proxy must already be bound to a document for which a URI can be computed, as discussed above. An attempt to invoke uri() on an inner element that does not meet these requirements raises an error. The inner elements contained in the document proxies that are returned by the read operations of the gDL satisfy these requirements.

Serialisation and Deserialisation

The gML includes a small set of facilities to convert GCUBEDocument to and from their serialisations in the exchange format. The gDL makes heavy use of these facilities in its implementation of read and write operations over documents. All clients may nonetheless use them for a variety of serialisation purposes (result set creation, temporary file storage, testing, etc.).

The facilities are available as static methods of the Conversions class and their use can be illustrated as follows:

import static org.gcube.contentmanagement.gcubemodellibrary.elements.Conversions.*;
...
GCubeDocument doc = ...
 
String xml = toXML(doc);
 
...
 
StringReader reader = new StringReader(xml);
GCubeDocument doc2 = toDocument(reader);
 
assert(doc.equals(doc2));

Conversions include statics for conversions between document the object-based model of the edge-labelled of the Content Manager Library (CML). If required, clients can then use the facilities defined by CML to manipulate such trees:

import static org.gcube.contentmanagement.gcubemodellibrary.elements.Conversions.*;
...
GCubeDocument doc = ...
 
GDoc tree = toTree(doc);
 
...
 
GCubeDocument doc2 = toDocument(tree);
 
assert(doc.equals(doc2));