GCube Information Organisation Services (LEGACY)

From Gcube Wiki
Revision as of 13:03, 24 November 2008 by Ali.boloori (Talk | contribs) (Metadata and Annotation Models)

Jump to: navigation, search

The gCube Information Organisation Services is the family of subsystems implementing the services supporting the management (storage, organisation, description and annotation) of information. These services implement the notion of Information Objects, i.e. logical unit of information potentially consisting of and linked to other Information Objects as to form compound objects. in the services are organised in three main functional areas: (i) the storage and organisation of such Information Objects and their constituents (Content and Storage Management); (ii) the management of the metadata objects (actually implemented as a kind of Information Object) potentially equipping each Information Object (Metadata Management); and (iii) the management of the annotations objects (actually implemented as a kind of Information Object) potentially enriching each Information Objects (Annotation Management).

The gCube Content Model

While other infrastructures for the manipulation of content in Grid-based environments, like gLite, provide basic file-system like functionality for content manipulation, the Information Organization services are aimed to provide more high-level functionality, built on top of gLite or other storage facilities. Content is stored and organized following a graph-based data model, the Information Object Model, that allows finer control of content w.r.t. a file based view, by incorporating the possibility to annotate content with arbitrary properties and to relate different content unities via arbitrary relationships. Building on this basic data model, other services in the Information Organization family provide to other gCube services more sophisticated data models to manage complex documents, document collections, metadata and annotations.

Information Object Model

The elementary constructs of the model are information-objects (a node of the graph) and object references (the arcs). The ER Diagram in Figure 20 describes the model.

  • An Information Object (IO) represents an elementary information unity. It is uniquely identified by an Object Identifier (OID), is labelled with a name1 and a type2 and Information optionally annotated with a number of properties. These properties are simple key-type-value associations. Finally, it can be associated with a raw-content. The raw content of an object is content of any kind. The model hides the actual storage details of the content of an object, that can be for instance stored as a file in gLite or as BLOB-field in a database, or maintained in storage facilities not under direct control of the Information Organization Services, e.g. as file stored in a remote server and accessible through some protocol like http, ftp or gridftp.

An object reference “links” two Information Objects. Each object might (i) reference many other objects and (ii) be referenced by many objects (m-n relationship). A reference is directed, it is labelled with a type attribute, called primary role, a secondary role, that may optionally further specify the function of the primary role and a position attribute, that allows to build ordered graph structures. It can also be associate with a number of other properties. The information-object model introduced above is exposed to higher level Information Organization Services by a component called the Storage Management Service (cf. Section 6.3). The generality of this simple information model allows to build complex data-structures. The services within the Information Organization stack build on top of this model to offer an organized, high-level view of content. This is done by attaching specialised semantics to the labels used to annotate Information Objects and references.

Document and Collection Models

It is easy to build, starting from this model, a document model in which complex documents, composed of various, eventually nested subparts, are represented as chains of Information Objects linked via appropriate relationships. For instance, an HTML document that includes a number of images may be modelled as a complex object that provides references to Information Objects (containing the images). The positioning attribute present in the information-object model helps in representing an aggregate object made up of parts that have to be fitted together in a certain order. A dedicated component in the Information Organization family, the Content Management Service (cf. Section 6.4), exposes the document model to other services. In a similar way, specific, complex metadata (like indexes, multimedia features) can be represented as separate Information Objects that are associated to the object they describe via appropriate relationships. For instance, a reference type may be “indexes” with a role name that gives additional information, like “full-text index”. The same representation mechanisms are also used to instantiate a concept of collection. Collections are the basic data structure used to organize information inside the Information Organization Services. Each collection is characterized by a collection identifier, labelled with a number of specific properties, and contains a number of documents. More specifically, a document can only exist as part of a given collection. Collections can in turn be nested, i.e. a collection can appear as member of another collection. A collection can be static (or materialized), that is contain a statically defined number of objects, that are added to it or deleted from it explicitly, or be virtual. The content of a virtual collection is not determined statically, but rather specified through declarative membership predicates that define which objects currently present in the gCube information-objects space are part of the collection. Its contents are thus determined dynamically at the moment when the collection is accessed by evaluating the membership predicates. For example, it is possible to define the collection of all objects having a certain MIME type (e.g. pdf).

Metadata and Annotation Models

The metadata and annotation models are based upon a number of characterisations of the primitives which define the gCube storage model, namely Information Objects and directed, binary relationships between such objects. Based on such characterisations, we can define the metadata and annotation models so as to follow closely the intuition and yet preserve a degree of semi-formal rigour.

  • The primary role of a relationships is a characterisation of its intended semantics. We assume that the secondary role of a relationship is an optional specialisation of the primary role. Whenever convenient, we say that a relationship with (primary or secondary) role R is an R-relationships and that its source and target are an R-source or an R-target. If R1 and R2 are, respectively, the primary and secondary role of a relationship, then a R2-relationship is also an R1-relationship.
  • A relationships R is a dependency for its source (target) if the existence of the latter depends on the existence of at least one R-target (source). In this case, we also speak of a dependent source (dependent target). As a pragmatic corollary, an R-source (R-target) is deleted when its last R-target (R-source) is deleted.
  • We say that a relationship is exclusive for its source (target) if it cannot relate the latter to more than one target (source). Otherwise, we say that it is repeatable on its source (target).
  • Relationships with primary role is-member-of (IMO) give the semantics of collections to their targets and that of collection members to their sources.
  • We assume that IMO-relationships are repeatable on members, so that a object can be a member of more than one collection. We also assume that IMO-relationships dependencies on their members, so that a member is deleted when its last collection is deleted. Finally, we assume that collections cannot be members of other collections in turn.
  • We expect the members of a collection to share enough similarities to be homogeneously processed, such as content formats and/or relationships. In particular, we speak of an R-collection to characterise a collection whose members are all sources (targets) of R-relationships.
  • We say that a relationship R preserves membership if it relates members of some collection C to members of the same or a different collection C'. If C is an R-collection, then we say that R is a R-mapping from C to C'.
  • Since objects can be members of multiple collections, we assume that relationships on members may hold in the scope of some but not all of those collections. Scope is most usefully and tractably modelled when it is lifted to entire collections. In particular, we say that a collection C is in the scope of a collection C' if there is an R-mapping from C to C' and there is an R-relationship between C and C'.
  • We say that an object is a document if it models intellectual content. We then say that a collection is a document collection if all its members are documents.

The Metadata Model We identify a type of relationships with primary role is-Described-by (IDB) which give to targets the semantics of metadata about the corresponding sources. In particular, we say that an IDB-source is a metadata object, or simply metadata, for the corresponding IDB-target. We expect IDB-targets to be documents, but do not require it. We constrain IDB-relationships to be exclusive for their targets but repeatable for their sources. Accordingly, a metadata object can describe one and only one object but an object can have an arbitrary number of metadata objects. We also constrain IDB-relationships to preserve membership, so that a metadata object describes a member of some collection if and only if it is a member of some other collection in turn. We say that a collection M is a metadata collection of type R for a collection C if M is an R-collection in the scope of C for some secondary role R of IDB. Specifically, (i) all the members of M are metadata for members of C, and (ii) M is metadata for C.