Difference between revisions of "Data Access and Storage Facilities"

From Gcube Wiki
Jump to: navigation, search
Line 1: Line 1:
 +
Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system.
 +
 +
A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces.
 +
 +
This document outlines the rationale and high-level architecture of such components.
 +
 
== Overview ==
 
== Overview ==
Few lines with a promotional 'flavour', e.g. ''"gCube xxx facilities offer scalable, high-performance, reliable, open source instruments for ..."''
+
 
 +
Collectively, data access components provide three key facilities:
 +
* the ability to store data in resources managed by the system;
 +
* the ability to access data that is stored in resources managed by the system;
 +
* the ability to access data that is stored in resources managed externally to the system;
 +
 
 +
The facilities are provided over data with heterogeneous structure, size, and semantics:
 +
* from unstructured data to structured data and semi-structured data;
 +
* from small data sets to large and very large data sets;
 +
* from document data, to statistical, biodiversity, and semantic data;
 +
 
 +
and in compliance with the following non-functional requirements:
 +
 
 +
* the requirement of secure access;
 +
* the requirement of scalable and efficient access;
 +
* the requirement of standards-based access;
 +
 
 +
In summary, the data access components provide ''secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics''.
  
 
== Key Features ==
 
== Key Features ==
A bullet list highlighting the main features offered by the facilities. The 'flavour' should be catchy and user-friendly. Some examples are (from MongoDB):
 
  
;Document-oriented storage
 
:JSON-style documents with dynamic schemas offer simplicity and power.
 
  
;Full Index Support
+
;uniform model and access API over structured data
:Index on any attribute, just like you're used to.
+
:dynamically pluggable architecture of transformations to and from internal and external data sources;
 +
:standards-based plugins for document, biodiversity, statistical, and semantic data sources;
 +
 
 +
;fine-grained access to structured data
 +
:horizontal and vertical filtering based on pattern matching;
 +
:URI-based resolution;
 +
:in-place remote updates;
 +
 
 +
;scalable access to structured data
 +
:autonomic service replication with infrastructure-wide load balancing;
 +
 
 +
;efficient and scalable storage of structured data
 +
:based on graph database technology;
 +
 
 +
;rich tooling for client and plugin development
 +
:high-level Java APIs for service access;
 +
:DSLs for pattern construction and stream manipulations;
  
;Replication & High Availability
+
;TODO features for biodiversity data:
:Mirror across LANs and WANs for scale and peace of mind.
+
:''Lucio please add with respect Species Service''
  
;Auto-Sharding
+
;TODO features for file access and storage
:Scale horizontally without compromising functionality.
+
:''Roberto please add with respect to File Storage API''
  
 
== Subsystems ==
 
== Subsystems ==
  
Because
+
Data access components cluster within the following subsystems:
# the identified facilities might be quite extent / "fat" from the functional point of view and
+
# the information introduced so far is very generic from a technical point of view
+
one or more 'subsystem' pages should be created.
+
  
Each subsystem page is expected to provide the reader with a description capturing '''design''' and '''deployment aspects''' as well as '''supported use cases'''. The following template is proposed:
+
;the [[Tree-Based Data Access]] subsystem
 +
:groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size, based on a uniform API of CRUD operations and a uniform data model of labelled trees;
  
[[Subsystem Specification Template]]
+
;the [[Biodiversity Data Access]] subsystem
 +
:groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size;
  
Next step will be the identification of the subsystems for each facility.
+
;the [[File-Based Access]] subsystem:
 +
:groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size;

Revision as of 14:48, 22 February 2012

Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system.

A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces.

This document outlines the rationale and high-level architecture of such components.

Overview

Collectively, data access components provide three key facilities:

  • the ability to store data in resources managed by the system;
  • the ability to access data that is stored in resources managed by the system;
  • the ability to access data that is stored in resources managed externally to the system;

The facilities are provided over data with heterogeneous structure, size, and semantics:

  • from unstructured data to structured data and semi-structured data;
  • from small data sets to large and very large data sets;
  • from document data, to statistical, biodiversity, and semantic data;

and in compliance with the following non-functional requirements:

  • the requirement of secure access;
  • the requirement of scalable and efficient access;
  • the requirement of standards-based access;

In summary, the data access components provide secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics.

Key Features

uniform model and access API over structured data
dynamically pluggable architecture of transformations to and from internal and external data sources;
standards-based plugins for document, biodiversity, statistical, and semantic data sources;
fine-grained access to structured data
horizontal and vertical filtering based on pattern matching;
URI-based resolution;
in-place remote updates;
scalable access to structured data
autonomic service replication with infrastructure-wide load balancing;
efficient and scalable storage of structured data
based on graph database technology;
rich tooling for client and plugin development
high-level Java APIs for service access;
DSLs for pattern construction and stream manipulations;
TODO features for biodiversity data
Lucio please add with respect Species Service
TODO features for file access and storage
Roberto please add with respect to File Storage API

Subsystems

Data access components cluster within the following subsystems:

the Tree-Based Data Access subsystem
groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size, based on a uniform API of CRUD operations and a uniform data model of labelled trees;
the Biodiversity Data Access subsystem
groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size;
the File-Based Access subsystem
groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size;