GCube Information Organisation Services (LEGACY)
The gCube Information Organisation Services is the family of subsystems implementing the services supporting the management (storage, organisation, description and annotation) of information. These services implement the notion of Information Objects, i.e. logical unit of information potentially consisting of and linked to other Information Objects as to form compound objects. in the services are organised in three main functional areas: (i) the storage and organisation of such Information Objects and their constituents (Content and Storage Management); (ii) the management of the metadata objects (actually implemented as a kind of Information Object) potentially equipping each Information Object (Metadata Management); and (iii) the management of the annotations objects (actually implemented as a kind of Information Object) potentially enriching each Information Objects (Annotation Management).
Contents
The gCube Content Model
While other infrastructures for the manipulation of content in Grid-based environments, like gLite, provide basic file-system like functionality for content manipulation, the Information Organization services are aimed to provide more high-level functionality, built on top of gLite or other storage facilities. Content is stored and organized following a graph-based data model, the Information Object Model, that allows finer control of content w.r.t. a file based view, by incorporating the possibility to annotate content with arbitrary properties and to relate different content unities via arbitrary relationships. Building on this basic data model, other services in the Information Organization family provide to other gCube services more sophisticated data models to manage complex documents, document collections, metadata and annotations.
Information Object Model
The elementary constructs of the model are information-objects (a node of the graph) and object references (the arcs). The ER Diagram in Figure 20 describes the model.
- An Information Object (IO) represents an elementary information unity. It is uniquely identified by an Object Identifier (OID), is labelled with a name1 and a type2 and Information optionally annotated with a number of properties. These properties are simple key-type-value associations. Finally, it can be associated with a raw-content. The raw content of an object is content of any kind. The model hides the actual storage details of the content of an object, that can be for instance stored as a file in gLite or as BLOB-field in a database, or maintained in storage facilities not under direct control of the Information Organization Services, e.g. as file stored in a remote server and accessible through some protocol like http, ftp or gridftp.
An object reference “links” two Information Objects. Each object might (i) reference many other objects and (ii) be referenced by many objects (m-n relationship). A reference is directed, it is labelled with a type attribute, called primary role, a secondary role, that may optionally further specify the function of the primary role and a position attribute, that allows to build ordered graph structures. It can also be associate with a number of other properties. The information-object model introduced above is exposed to higher level Information Organization Services by a component called the Storage Management Service (cf. Section 6.3). The generality of this simple information model allows to build complex data-structures. The services within the Information Organization stack build on top of this model to offer an organized, high-level view of content. This is done by attaching specialised semantics to the labels used to annotate Information Objects and references.
Document and Collection Models
It is easy to build, starting from this model, a document model in which complex documents, composed of various, eventually nested subparts, are represented as chains of Information Objects linked via appropriate relationships. For instance, an HTML document that includes a number of images may be modelled as a complex object that provides references to Information Objects (containing the images). The positioning attribute present in the information-object model helps in representing an aggregate object made up of parts that have to be fitted together in a certain order. A dedicated component in the Information Organization family, the Content Management Service (cf. Section 6.4), exposes the document model to other services. In a similar way, specific, complex metadata (like indexes, multimedia features) can be represented as separate Information Objects that are associated to the object they describe via appropriate relationships. For instance, a reference type may be “indexes” with a role name that gives additional information, like “full-text index”. The same representation mechanisms are also used to instantiate a concept of collection. Collections are the basic data structure used to organize information inside the Information Organization Services. Each collection is characterized by a collection identifier, labelled with a number of specific properties, and contains a number of documents. More specifically, a document can only exist as part of a given collection. Collections can in turn be nested, i.e. a collection can appear as member of another collection. A collection can be static (or materialized), that is contain a statically defined number of objects, that are added to it or deleted from it explicitly, or be virtual. The content of a virtual collection is not determined statically, but rather specified through declarative membership predicates that define which objects currently present in the gCube information-objects space are part of the collection. Its contents are thus determined dynamically at the moment when the collection is accessed by evaluating the membership predicates. For example, it is possible to define the collection of all objects having a certain MIME type (e.g. pdf).
Metadata and Annotation Models
The metadata and annotation models are based upon a number of characterisations of the primitives which define the gCube storage model, namely Information Objects and directed, binary relationships between such objects. Based on such characterisations, we can define the metadata and annotation models so as to follow closely the intuition and yet preserve a degree of semi-formal rigour.
- The primary role of a relationships is a characterisation of its intended semantics. We assume that the secondary role of a relationship is an optional specialisation of the primary role. Whenever convenient, we say that a relationship with (primary or secondary) role R is an R-relationships and that its source and target are an R-source or an R-target. If R1 and R2 are, respectively, the primary and secondary role of a relationship, then a R2-relationship is also an R1-relationship.
- A relationships R is a dependency for its source (target) if the existence of the latter depends on the existence of at least one R-target (source). In this case, we also speak of a dependent source (dependent target). As a pragmatic corollary, an R-source (R-target) is deleted when its last R-target (R-source) is deleted.
- We say that a relationship is exclusive for its source (target) if it cannot relate the latter to more than one target (source). Otherwise, we say that it is repeatable on its source (target).