Data Access and Storage Facilities
Accessing data sources for retrieval or storage purposes is a fundamental requirement for a wide range of system processes, including indexing, transfer, transformation, and presentation. Equally, it is a main driver for clients that interface the resources managed by the system.
A large number of system components are dedicated to meet data access requirements, including services, service plugins, client-side libraries, server-side libraries, and front-end interfaces.
This document outlines the rationale and high-level architecture of such components.
Overview
Collectively, data access components provide three key facilities:
- the ability to store data in resources managed by the system;
- the ability to access data that is stored in resources managed by the system;
- the ability to access data that is stored in resources managed externally to the system;
The facilities are provided over data with heterogeneous structure, size, and semantics:
- from unstructured data to structured data and semi-structured data;
- from small data sets to large and very large data sets;
- from document data, to statistical, biodiversity, and semantic data;
and in compliance with the following non-functional requirements:
- the requirement of secure access;
- the requirement of scalable and efficient access;
- the requirement of standards-based access;
In summary, the data access components provide secure, scalable, efficient, standards-based storage and retrieval of data, where the data may be maintained by the system or outside the system and may vary in structure, size, and semantics.
Key Features
- uniform model and access API over structured data
- dynamically pluggable architecture of transformations to and from internal and external data sources;
- standards-based plugins for document, biodiversity, statistical, and semantic data sources;
- fine-grained access to structured data
- horizontal and vertical filtering based on pattern matching;
- URI-based resolution;
- in-place remote updates;
- scalable access to structured data
- autonomic service replication with infrastructure-wide load balancing;
- efficient and scalable storage of structured data
- based on graph database technology;
- rich tooling for client and plugin development
- high-level Java APIs for service access;
- DSLs for pattern construction and stream manipulations;
- TODO features for biodiversity data
- Lucio please add with respect Species Service
- TODO features for file access and storage
- Roberto please add with respect to File Storage API
Subsystems
Data access components cluster within the following subsystems:
- the Tree-Based Data Access subsystem
- groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size, based on a uniform API of CRUD operations and a uniform data model of labelled trees;
- the Biodiversity Data Access subsystem
- groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size;
- the File-Based Access subsystem
- groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size;