Difference between revisions of "Data Access and Storage Facilities"
m |
|||
(27 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
+ | [[Category:gCube Features]] | ||
+ | Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system or accessible through facilities available within the system. | ||
+ | |||
+ | A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces. | ||
+ | |||
+ | This document outlines the design rationale and high-level architecture of such components. | ||
+ | |||
== Overview == | == Overview == | ||
− | + | ||
+ | Collectively, data access and storage components provide three key facilities: | ||
+ | |||
+ | * the ability to store data in computational and storage resources managed by the system; | ||
+ | * the ability to retrieve data available in computational and storage resources managed by the system; | ||
+ | * the ability to retrieve data available in computational and storage resources outside the management regime of the system; | ||
+ | |||
+ | The facilities are provided over data with heterogeneous structure, size, and semantics: | ||
+ | |||
+ | * from unstructured data to structured data and semi-structured data; | ||
+ | * from small data sets to large and very large data sets; | ||
+ | * from document data, to statistical, biodiversity, and semantic data; | ||
+ | |||
+ | and in compliance with the following non-functional requirements: | ||
+ | |||
+ | * the requirement of secure access; | ||
+ | * the requirement of scalable and efficient access; | ||
+ | * the requirement of standards-based access; | ||
+ | |||
+ | In summary, the data access and storage components provide ''secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics''. | ||
== Key Features == | == Key Features == | ||
− | |||
− | ; | + | ;uniform model and access API over structured data |
− | : | + | :dynamically pluggable architecture of model and API transformations to and from internal and external data sources; |
+ | :plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs; | ||
+ | |||
+ | ;fine-grained access to structured data | ||
+ | :horizontal and vertical filtering based on pattern matching; | ||
+ | :URI-based resolution; | ||
+ | :in-place remote updates; | ||
+ | |||
+ | ;scalable access to structured data | ||
+ | :autonomic service replication with infrastructure-wide load balancing; | ||
+ | |||
+ | ;efficient and scalable storage of structured data | ||
+ | :based on graph database technology; | ||
+ | |||
+ | ;rich tooling for client and plugin development | ||
+ | :high-level Java APIs for service access; | ||
+ | :DSLs for pattern construction and stream manipulations; | ||
+ | |||
+ | ;remote viewing mechanisms over structured data | ||
+ | :“passive” views based on arbitrary access filters; | ||
+ | :dynamically pluggable architecture of custom view management schemes; | ||
+ | |||
+ | ;uniform modelling and access API over document data | ||
+ | : rich descriptions of document content, metadata, annotations, parts, and alternatives | ||
+ | : transformations from model and API of key document sources, including [http://www.openarchives.org/ OAI] providers; | ||
+ | : high-level client APIs for model construction and remote access; | ||
+ | |||
+ | ;uniform modelling and access API over semantic data | ||
+ | : tree-views over RDF graph data; | ||
+ | : transformations from model and API of key document sources, including [http://www.w3.org/TR/rdf-sparql-query/ SPARQL] endpoints; | ||
− | ; | + | ;uniform modelling and access over biodiversity data |
− | : | + | :access API tailored to biodiversity data sources; |
+ | :dynamically pluggable architecture of transformations from external sources of biodiversity data; | ||
+ | :plugins for key biodiversity data sources, including [http://www.iobis.org OBIS], [http://www.gbif.org GBIF] and [http://www.catalogueoflife.org Catalogue of Life]; | ||
− | ; | + | ;efficient and scalable storage of files |
− | : | + | : unstructured storage back-end based on [http://www.mongodb.org MongoDB] for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding; |
+ | : no intrinsic upper bound on file size; | ||
− | ; | + | ;standards-based and structured storage of files |
− | : | + | : POSIX-like client API; |
+ | : support for hierarchical folder structures; | ||
== Subsystems == | == Subsystems == | ||
− | + | Data access and storage components cluster within the following subsystems, where each subsystem specialises along the structure or the semantics of the data: | |
− | + | ||
− | + | ||
− | + | ||
− | + | ;the [[Tree-Based Access]] subsystem | |
+ | :groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size. | ||
+ | :The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes. | ||
+ | :A subset of the components build on the generic interface and specialise it to data with domain-specific semantics, including document data and semantic data. | ||
− | [[ | + | ;the [[Biodiversity Access]] subsystem |
+ | :groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size; | ||
+ | :The subsystem focuses on uniform access to biodiversity data for domain-specific processes and feeds into the Tree-Based Access subsystem. | ||
− | + | ;the [[File-Based Access]] subsystem: | |
+ | :groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size; | ||
+ | :The subsystem focuses on storage and retrieval of bytestreams for arbitrary processes, including but not limited to system processes. |
Latest revision as of 09:22, 24 July 2013
Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system or accessible through facilities available within the system.
A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces.
This document outlines the design rationale and high-level architecture of such components.
Overview
Collectively, data access and storage components provide three key facilities:
- the ability to store data in computational and storage resources managed by the system;
- the ability to retrieve data available in computational and storage resources managed by the system;
- the ability to retrieve data available in computational and storage resources outside the management regime of the system;
The facilities are provided over data with heterogeneous structure, size, and semantics:
- from unstructured data to structured data and semi-structured data;
- from small data sets to large and very large data sets;
- from document data, to statistical, biodiversity, and semantic data;
and in compliance with the following non-functional requirements:
- the requirement of secure access;
- the requirement of scalable and efficient access;
- the requirement of standards-based access;
In summary, the data access and storage components provide secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics.
Key Features
- uniform model and access API over structured data
- dynamically pluggable architecture of model and API transformations to and from internal and external data sources;
- plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs;
- fine-grained access to structured data
- horizontal and vertical filtering based on pattern matching;
- URI-based resolution;
- in-place remote updates;
- scalable access to structured data
- autonomic service replication with infrastructure-wide load balancing;
- efficient and scalable storage of structured data
- based on graph database technology;
- rich tooling for client and plugin development
- high-level Java APIs for service access;
- DSLs for pattern construction and stream manipulations;
- remote viewing mechanisms over structured data
- “passive” views based on arbitrary access filters;
- dynamically pluggable architecture of custom view management schemes;
- uniform modelling and access API over document data
- rich descriptions of document content, metadata, annotations, parts, and alternatives
- transformations from model and API of key document sources, including OAI providers;
- high-level client APIs for model construction and remote access;
- uniform modelling and access API over semantic data
- tree-views over RDF graph data;
- transformations from model and API of key document sources, including SPARQL endpoints;
- uniform modelling and access over biodiversity data
- access API tailored to biodiversity data sources;
- dynamically pluggable architecture of transformations from external sources of biodiversity data;
- plugins for key biodiversity data sources, including OBIS, GBIF and Catalogue of Life;
- efficient and scalable storage of files
- unstructured storage back-end based on MongoDB for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding;
- no intrinsic upper bound on file size;
- standards-based and structured storage of files
- POSIX-like client API;
- support for hierarchical folder structures;
Subsystems
Data access and storage components cluster within the following subsystems, where each subsystem specialises along the structure or the semantics of the data:
- the Tree-Based Access subsystem
- groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size.
- The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes.
- A subset of the components build on the generic interface and specialise it to data with domain-specific semantics, including document data and semantic data.
- the Biodiversity Access subsystem
- groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size;
- The subsystem focuses on uniform access to biodiversity data for domain-specific processes and feeds into the Tree-Based Access subsystem.
- the File-Based Access subsystem
- groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size;
- The subsystem focuses on storage and retrieval of bytestreams for arbitrary processes, including but not limited to system processes.