Difference between revisions of "Data Access and Storage Facilities"

From Gcube Wiki
Jump to: navigation, search
m
 
(27 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
[[Category:gCube Features]]
 +
Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system or accessible through facilities available within the system.
 +
 +
A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces.
 +
 +
This document outlines the design rationale and high-level architecture of such components.
 +
 
== Overview ==
 
== Overview ==
Few lines with a promotional 'flavour', e.g. ''"gCube xxx facilities offer scalable, high-performance, reliable, open source instruments for ..."''
+
 
 +
Collectively, data access and storage components provide three key facilities:
 +
 
 +
* the ability to store data in computational and storage resources managed by the system;
 +
* the ability to retrieve data available in computational and storage resources managed by the system;
 +
* the ability to retrieve data available in computational and storage resources outside the management regime of the system;
 +
 
 +
The facilities are provided over data with heterogeneous structure, size, and semantics:
 +
 
 +
* from unstructured data to structured data and semi-structured data;
 +
* from small data sets to large and very large data sets;
 +
* from document data, to statistical, biodiversity, and semantic data;
 +
 
 +
and in compliance with the following non-functional requirements:
 +
 
 +
* the requirement of secure access;
 +
* the requirement of scalable and efficient access;
 +
* the requirement of standards-based access;
 +
 
 +
In summary, the data access and storage components provide ''secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics''.
  
 
== Key Features ==
 
== Key Features ==
A bullet list highlighting the main features offered by the facilities. The 'flavour' should be catchy and user-friendly. Some examples are (from MongoDB):
 
  
;Document-oriented storage
+
;uniform model and access API over structured data
:JSON-style documents with dynamic schemas offer simplicity and power.
+
:dynamically pluggable architecture of model and API transformations to and from internal and external data sources;
 +
:plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs;
 +
 
 +
;fine-grained access to structured data
 +
:horizontal and vertical filtering based on pattern matching;
 +
:URI-based resolution;
 +
:in-place remote updates;
 +
 
 +
;scalable access to structured data
 +
:autonomic service replication with infrastructure-wide load balancing;
 +
 
 +
;efficient and scalable storage of structured data
 +
:based on graph database technology;
 +
 
 +
;rich tooling for client and plugin development
 +
:high-level Java APIs for service access;
 +
:DSLs for pattern construction and stream manipulations;
 +
 
 +
;remote viewing mechanisms over structured data
 +
:“passive” views based on arbitrary access filters;
 +
:dynamically pluggable architecture of custom view management schemes;
 +
 
 +
;uniform modelling and access API over document data
 +
: rich descriptions of document content, metadata, annotations, parts, and alternatives
 +
: transformations from model and API of key document sources, including [http://www.openarchives.org/ OAI] providers;
 +
: high-level client APIs for model construction and remote access;
 +
 
 +
;uniform modelling and access API over semantic data
 +
: tree-views over RDF graph data;
 +
: transformations from model and API of key document sources, including [http://www.w3.org/TR/rdf-sparql-query/ SPARQL] endpoints;
  
;Full Index Support
+
;uniform modelling and access over biodiversity data
:Index on any attribute, just like you're used to.
+
:access API tailored to biodiversity data sources;
 +
:dynamically pluggable architecture of transformations from external sources of biodiversity data;
 +
:plugins for key biodiversity data sources, including [http://www.iobis.org OBIS], [http://www.gbif.org GBIF] and [http://www.catalogueoflife.org Catalogue of Life];
  
;Replication & High Availability
+
;efficient and scalable storage of files
:Mirror across LANs and WANs for scale and peace of mind.
+
: unstructured storage back-end based on [http://www.mongodb.org MongoDB] for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding;
 +
: no intrinsic upper bound on file size;
  
;Auto-Sharding
+
;standards-based and structured storage of files
:Scale horizontally without compromising functionality.
+
: POSIX-like client API;
 +
: support for hierarchical folder structures;
  
 
== Subsystems ==
 
== Subsystems ==
  
Because
+
Data access and storage components cluster within the following subsystems, where each subsystem specialises along the structure or the semantics of the data:
# the identified facilities might be quite extent / "fat" from the functional point of view and
+
# the information introduced so far is very generic from a technical point of view
+
one or more 'subsystem' pages should be created.
+
  
Each subsystem page is expected to provide the reader with a description capturing '''design''' and '''deployment aspects''' as well as '''supported use cases'''. The following template is proposed:
+
;the [[Tree-Based Access]] subsystem
 +
:groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size.
 +
:The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes.
 +
:A subset of the components build on the generic interface and specialise it to data with domain-specific semantics, including document data and semantic data.
  
[[Subsystem Specification Template]]
+
;the [[Biodiversity Access]] subsystem
 +
:groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size;
 +
:The subsystem focuses on uniform access to biodiversity data for domain-specific processes and feeds into the Tree-Based Access subsystem.
  
Next step will be the identification of the subsystems for each facility.
+
;the [[File-Based Access]] subsystem:
 +
:groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size;
 +
:The subsystem focuses on storage and retrieval of bytestreams for arbitrary processes, including but not limited to system processes.

Latest revision as of 10:22, 24 July 2013

Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system or accessible through facilities available within the system.

A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces.

This document outlines the design rationale and high-level architecture of such components.

Overview

Collectively, data access and storage components provide three key facilities:

  • the ability to store data in computational and storage resources managed by the system;
  • the ability to retrieve data available in computational and storage resources managed by the system;
  • the ability to retrieve data available in computational and storage resources outside the management regime of the system;

The facilities are provided over data with heterogeneous structure, size, and semantics:

  • from unstructured data to structured data and semi-structured data;
  • from small data sets to large and very large data sets;
  • from document data, to statistical, biodiversity, and semantic data;

and in compliance with the following non-functional requirements:

  • the requirement of secure access;
  • the requirement of scalable and efficient access;
  • the requirement of standards-based access;

In summary, the data access and storage components provide secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics.

Key Features

uniform model and access API over structured data
dynamically pluggable architecture of model and API transformations to and from internal and external data sources;
plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs;
fine-grained access to structured data
horizontal and vertical filtering based on pattern matching;
URI-based resolution;
in-place remote updates;
scalable access to structured data
autonomic service replication with infrastructure-wide load balancing;
efficient and scalable storage of structured data
based on graph database technology;
rich tooling for client and plugin development
high-level Java APIs for service access;
DSLs for pattern construction and stream manipulations;
remote viewing mechanisms over structured data
“passive” views based on arbitrary access filters;
dynamically pluggable architecture of custom view management schemes;
uniform modelling and access API over document data
rich descriptions of document content, metadata, annotations, parts, and alternatives
transformations from model and API of key document sources, including OAI providers;
high-level client APIs for model construction and remote access;
uniform modelling and access API over semantic data
tree-views over RDF graph data;
transformations from model and API of key document sources, including SPARQL endpoints;
uniform modelling and access over biodiversity data
access API tailored to biodiversity data sources;
dynamically pluggable architecture of transformations from external sources of biodiversity data;
plugins for key biodiversity data sources, including OBIS, GBIF and Catalogue of Life;
efficient and scalable storage of files
unstructured storage back-end based on MongoDB for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding;
no intrinsic upper bound on file size;
standards-based and structured storage of files
POSIX-like client API;
support for hierarchical folder structures;

Subsystems

Data access and storage components cluster within the following subsystems, where each subsystem specialises along the structure or the semantics of the data:

the Tree-Based Access subsystem
groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size.
The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes.
A subset of the components build on the generic interface and specialise it to data with domain-specific semantics, including document data and semantic data.
the Biodiversity Access subsystem
groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size;
The subsystem focuses on uniform access to biodiversity data for domain-specific processes and feeds into the Tree-Based Access subsystem.
the File-Based Access subsystem
groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size;
The subsystem focuses on storage and retrieval of bytestreams for arbitrary processes, including but not limited to system processes.