Data Management Software Consolidated Specifications

From Gcube Wiki
Jump to: navigation, search

Overview

This page contains an overview about the components and facilities provided by the gCube Data Management Software, along with links to the software specifications and to the Developers' guides. The main aim is to provide a summary of the supported software at different granularities. The facilities regard the gCube components that deal with several aspects of Data Management, in particular: Access, Storage, Transfer, Assessment, Harmonization and Certification.

Key Features

The gCube Data Management facilities provide the following key features:

Data Access and Storage

Uniform model and access API over structured data
dynamically pluggable architecture of model and API transformations to and from internal and external data sources
plugins for document, biodiversity, statistical, and semantic data sources, including sources with custom APIs and standards-based APIs
Fine-grained access to structured data
horizontal and vertical filtering based on Tree-patterns matching
URI-based resolution
in-place remote updates
Scalable access to structured data
autonomic service replication with infrastructure-wide load balancing;
Efficient and scalable storage of structured data
based on graph database technology
Rich tooling for client and plugin development
high-level Java APIs for service access
DSLs for pattern construction and stream manipulations
Remote viewing mechanisms over structured data
“passive” views based on arbitrary access filters
dynamically pluggable architecture of custom view management schemes
Uniform modelling and access API over document data
rich descriptions of document content, metadata, annotations, parts, and alternatives
transformations from model and API of key document sources, including OAI providers
high-level client APIs for model construction and remote access
Uniform modelling and access API over semantic data
tree-views over RDF graph data
transformations from model and API of key document sources, including SPARQL endpoints
Uniform modelling and access over biodiversity data
access API tailored to biodiversity data sources
dynamically pluggable architecture of transformations from external sources of biodiversity data
plugins for key biodiversity data sources, including OBIS, GBIF and Catalogue of Life
Efficient and scalable storage of files
unstructured storage back-end based on MongoDB for replication and high availability, automatic balancing for changes in load and data distribution, no single points of failure, data sharding
no intrinsic upper bound on file size
Standards-based and structured storage of files
POSIX-like client API
support for hierarchical folder structures

Data Transfer

Point to Point transfer
one writer-one reader as core functionality
Produce only what is requested
a producer-consumer model that blocks when needed and reduces the unnecessary data transfers
Intuitive stream and iterator based interface
simplified usage with reasonable default behavior for common use cases and a variety of features for increased usability and flexibility
Multiple protocols support
data transfer currently supports the following protocols: tcp and http
HTTP Broker Servlet
transfer results are exposed as an http endpoint
Reliable data transfer between Infrastructure Data Sources and Data Storages
by exploiting the uniform access interfaces provided by gCube
Structured and unstructured Data Transfer
both Tree based and File based transfer to cover all possible use-cases
Transfers to local nodes for data staging
data staging for particular use cases can be enabled on each node of the infrastructure
Advanced transfer scheduling and transfer optimization
a dedicated gCube service responsible for data transfer scheduling and transfer optimization
Transfer statistics availability
transfers are logged by the system and made available to interested consumers.
Data enrichment support
species occurrence data enrichment with environmental data dynamically acquired by data providers
data provenance recording
Standard-based data presentation
OGC standard-based Geospatial data presentation

Data Assessment, Harmonisation and Certification

Workflow-oriented tabular data manipulation
user-defined definition and execution of workflows of data manipulation steps
rich array of data manipulation facilities offered 'as-a-Service'
rich array of data visualisation facilities offered 'as-a-Service'
Reference-data management support
uniform model for reference-data representation including versioning and provenance

Components

Data Access and Storage

Tree-Based Access
groups components that implement access and storage facilities for structured data of arbitrary semantics, origin, and size
The subsystem focuses on uniform access to structured data for cross-domain processes, including but not limited to system processes
A subset of the components build on the generic interface and specialise it to data with domain-specific semantics, including document data and semantic data
Biodiversity Access
groups components that implement access and storage facilities for structured data with biodiversity semantics and arbitrary origin and size
The subsystem focuses on uniform access to biodiversity data for domain-specific processes and feeds into the Tree-Based Access subsystem
File-Based Access
groups components that implement access and storage facilities for unstructured data with arbitrary semantics and size
The subsystem focuses on storage and retrieval of bytestreams for arbitrary processes, including but not limited to system processes

Data Transfer

Result Set components
this family of components provide a common data transfer mechanism that aims to establish high throughput point to point on demand communication. It has been designed as a core functionality of the overall system and it can be considered as well the building block for the Data Transfer Scheduler & Agent components
Data Transfer Scheduler & Agent components
this family of components guarantees VO/VRE Administrators the possibility to transfer data among Data Sources and Data Storages. It can be exploited as well by any client or web services to implements data movement between infrastructure nodes

Data Assessment, Harmonisation and Certification

Tabular Data Service
a service supporting tabular data flow management
Time Series
a service for performing assessment and harmonization on time series
Codelist Manager
a library for performing import, harmonization and curation on code lists
Occurrence Data Reconciliation
a service for performing assessment and harmonization on occurrence points of species
Occurrence Data Enrichment Service
a service for performing enrichment of information associated to occurrence points of species
Taxon Names Reconciliation Service
a service for performing assessment and harmonization on taxa

Specifications

The specifications require preparatory information to be properly understood. In particular:

How to Develop a gCube Component
a basic guide to build a gCube Component
Buiding Components using the gCube Fetherweight Stack
a guide to develop libraries or clients for the gCube Services
Developer's Guide
the overall gCube Developer's Guide

Task oriented specifications can be found in the following:

Data Access and Storage Specifications
the specifications for the Data Access and Storage components
Data Transfer Components Specifications
the specifications for the Data Transfer components
Data Assessment ,Harmonization and Certification Specifications
the specifications for the Data Assessment, Harmonization and Certification components
Integrating Legacy Application into the WPS-Hadoop Framework
an additional guide on how to integrate external scripts in the WPS-Hadoop facilities developed in the Data Tansfer Task