Difference between revisions of "Data Access and Storage Facilities"
Lucio.lelii (Talk | contribs) m (→Key Features) |
m |
||
(19 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system. | + | [[Category:gCube Features]] |
+ | Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system or accessible through facilities available within the system. | ||
A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces. | A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces. | ||
− | This document outlines the rationale and high-level architecture of such components. | + | This document outlines the design rationale and high-level architecture of such components. |
== Overview == | == Overview == | ||
− | Collectively, | + | Collectively, data access and storage components provide three key facilities: |
− | * the ability to | + | |
− | * the ability to | + | * the ability to store data in computational and storage resources managed by the system; |
− | * the ability to | + | * the ability to retrieve data available in computational and storage resources managed by the system; |
+ | * the ability to retrieve data available in computational and storage resources outside the management regime of the system; | ||
The facilities are provided over data with heterogeneous structure, size, and semantics: | The facilities are provided over data with heterogeneous structure, size, and semantics: | ||
+ | |||
* from unstructured data to structured data and semi-structured data; | * from unstructured data to structured data and semi-structured data; | ||
* from small data sets to large and very large data sets; | * from small data sets to large and very large data sets; | ||
Line 23: | Line 26: | ||
* the requirement of standards-based access; | * the requirement of standards-based access; | ||
− | In summary, the | + | In summary, the data access and storage components provide ''secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics''. |
== Key Features == | == Key Features == | ||
− | |||
;uniform model and access API over structured data | ;uniform model and access API over structured data | ||
− | :dynamically pluggable architecture of transformations to and from internal and external data sources; | + | :dynamically pluggable architecture of model and API transformations to and from internal and external data sources; |
− | : | + | :plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs; |
;fine-grained access to structured data | ;fine-grained access to structured data | ||
Line 51: | Line 53: | ||
:dynamically pluggable architecture of custom view management schemes; | :dynamically pluggable architecture of custom view management schemes; | ||
− | ;uniform modelling and access over document data | + | ;uniform modelling and access API over document data |
: rich descriptions of document content, metadata, annotations, parts, and alternatives | : rich descriptions of document content, metadata, annotations, parts, and alternatives | ||
+ | : transformations from model and API of key document sources, including [http://www.openarchives.org/ OAI] providers; | ||
: high-level client APIs for model construction and remote access; | : high-level client APIs for model construction and remote access; | ||
+ | |||
+ | ;uniform modelling and access API over semantic data | ||
+ | : tree-views over RDF graph data; | ||
+ | : transformations from model and API of key document sources, including [http://www.w3.org/TR/rdf-sparql-query/ SPARQL] endpoints; | ||
;uniform modelling and access over biodiversity data | ;uniform modelling and access over biodiversity data | ||
− | :dynamically pluggable architecture of transformations from | + | :access API tailored to biodiversity data sources; |
− | : | + | :dynamically pluggable architecture of transformations from external sources of biodiversity data; |
+ | :plugins for key biodiversity data sources, including [http://www.iobis.org OBIS], [http://www.gbif.org GBIF] and [http://www.catalogueoflife.org Catalogue of Life]; | ||
− | ; | + | ;efficient and scalable storage of files |
− | : | + | : unstructured storage back-end based on [http://www.mongodb.org MongoDB] for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding; |
− | : | + | : no intrinsic upper bound on file size; |
− | ; | + | ;standards-based and structured storage of files |
− | : | + | : POSIX-like client API; |
+ | : support for hierarchical folder structures; | ||
== Subsystems == | == Subsystems == | ||
− | Data access components cluster within the following subsystems, where each subsystem specialises along the structure or the semantics of the data: | + | Data access and storage components cluster within the following subsystems, where each subsystem specialises along the structure or the semantics of the data: |
;the [[Tree-Based Access]] subsystem | ;the [[Tree-Based Access]] subsystem | ||
− | :groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size. | + | :groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size. |
:The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes. | :The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes. | ||
− | + | :A subset of the components build on the generic interface and specialise it to data with domain-specific semantics, including document data and semantic data. | |
− | + | ||
− | : | + | |
− | + | ||
;the [[Biodiversity Access]] subsystem | ;the [[Biodiversity Access]] subsystem |
Latest revision as of 09:22, 24 July 2013
Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system or accessible through facilities available within the system.
A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces.
This document outlines the design rationale and high-level architecture of such components.
Overview
Collectively, data access and storage components provide three key facilities:
- the ability to store data in computational and storage resources managed by the system;
- the ability to retrieve data available in computational and storage resources managed by the system;
- the ability to retrieve data available in computational and storage resources outside the management regime of the system;
The facilities are provided over data with heterogeneous structure, size, and semantics:
- from unstructured data to structured data and semi-structured data;
- from small data sets to large and very large data sets;
- from document data, to statistical, biodiversity, and semantic data;
and in compliance with the following non-functional requirements:
- the requirement of secure access;
- the requirement of scalable and efficient access;
- the requirement of standards-based access;
In summary, the data access and storage components provide secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics.
Key Features
- uniform model and access API over structured data
- dynamically pluggable architecture of model and API transformations to and from internal and external data sources;
- plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs;
- fine-grained access to structured data
- horizontal and vertical filtering based on pattern matching;
- URI-based resolution;
- in-place remote updates;
- scalable access to structured data
- autonomic service replication with infrastructure-wide load balancing;
- efficient and scalable storage of structured data
- based on graph database technology;
- rich tooling for client and plugin development
- high-level Java APIs for service access;
- DSLs for pattern construction and stream manipulations;
- remote viewing mechanisms over structured data
- “passive” views based on arbitrary access filters;
- dynamically pluggable architecture of custom view management schemes;
- uniform modelling and access API over document data
- rich descriptions of document content, metadata, annotations, parts, and alternatives
- transformations from model and API of key document sources, including OAI providers;
- high-level client APIs for model construction and remote access;
- uniform modelling and access API over semantic data
- tree-views over RDF graph data;
- transformations from model and API of key document sources, including SPARQL endpoints;
- uniform modelling and access over biodiversity data
- access API tailored to biodiversity data sources;
- dynamically pluggable architecture of transformations from external sources of biodiversity data;
- plugins for key biodiversity data sources, including OBIS, GBIF and Catalogue of Life;
- efficient and scalable storage of files
- unstructured storage back-end based on MongoDB for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding;
- no intrinsic upper bound on file size;
- standards-based and structured storage of files
- POSIX-like client API;
- support for hierarchical folder structures;
Subsystems
Data access and storage components cluster within the following subsystems, where each subsystem specialises along the structure or the semantics of the data:
- the Tree-Based Access subsystem
- groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size.
- The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes.
- A subset of the components build on the generic interface and specialise it to data with domain-specific semantics, including document data and semantic data.
- the Biodiversity Access subsystem
- groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size;
- The subsystem focuses on uniform access to biodiversity data for domain-specific processes and feeds into the Tree-Based Access subsystem.
- the File-Based Access subsystem
- groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size;
- The subsystem focuses on storage and retrieval of bytestreams for arbitrary processes, including but not limited to system processes.