Difference between revisions of "Tabular Data Service"

From Gcube Wiki
Jump to: navigation, search
m
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= Overview =
+
{| align="right"
Tabular Data is a system that allows to manage the lifecycle of statistical data. It's main characteristics are:
+
||__TOC__
 +
|}
 +
 
 +
Part of the [[Data Access and Storage Facilities]], this service focuses on the management of tabular data, i.e. any dataset that can be represented in a table format.
 +
 
 +
This document outlines the service design rationale, key features, and high-level architecture as well as the options for their deployment.
 +
 
 +
== Overview ==
 +
Tabular Data service is a system that allows to manage the entire lifecycle of tabular data. It's main characteristics are:
 
* Exposes its operation to the end user through a gCube Portlet;
 
* Exposes its operation to the end user through a gCube Portlet;
 
* Its operations can be invoked programmatically through a gCube Web Service (smart gears);
 
* Its operations can be invoked programmatically through a gCube Web Service (smart gears);
* Provides several operations for importing/exporting from/to serveral sources/sinks;
+
* Provides several operations for importing/exporting from/to several sources/sinks;
 
* Provides data validations and transformations capabilities;
 
* Provides data validations and transformations capabilities;
 
* Provides means for the management of data management processes automation.
 
* Provides means for the management of data management processes automation.
  
= Architecture =
+
=== Key Features ===
 +
The subsystem provides for:
  
[[File:TabularDataArchitecture.png|center|800px||Tabular Data system architecture]]
+
;comprehensive facilities for tabular data management
 +
:The subsystem offers a comprehensive set of data management facilities oriented to tabular data. Facilities include data import, data editing, and data filtering;
  
Tabular data system is made of three main subsystems:
+
;user-friendly interface
* '''Tabular Data Portlet''': human centric web application hosted on a gCube web portal. Allows management of the Tabular Data Service system on a per-user basis, allowing invocation of tabular data service methods and additional functionalities with external apps.
+
:The subsystem offers a graphical user interface where users can visualize the data and perform operations in a very user-friendly environment. Moreover, the environment allow to actually perform a workflow by applying a template to an imported table;
* '''Tabular Data Service''': main component of the tabular data architecture. It exposes several remote interfaces covering different areas of functionalities.
+
* '''Data/Metadata backend''': This is where raw tables data and metadata are stored and where the service keeps its management data.
+
  
The diagram shows also some of the main components that build up the service, which mainly relates to the operation management area:
+
;declarative approach
* '''Operation orchestrator''': The operation orchestrator is a component that receives incoming call requests from the service interface and unwinds them into a sequence of operation call. The orchestrator may enforce policies and command automatic operations/validations according to its configuration.
+
:Instead of providing the user with means to describe the workflow as a set of transformation steps the user provides a table template as a set of properties a target table should comply with.
* '''Operation modules''': Operation modules are classes that brings functionalities to the tabular data service. These functionalities can be reached directly with invocations on the Operation Interface. Operation modules can work directly with data on the Data backend or leverage the cube manager in order to create/clone tables or modify table metadata.
+
* '''Cube manager''': The cube manager is the lowest level component of the service stack. Its main responsibilites are managing the creation/modification of tables (and their metadata) and acting as a registry for all the created tables, allowing retrieval of tables metadata.
+
  
= Model =
+
;flexible and open workflow definition mechanism
Tables in the Tabular Data system are entities made of two separate elements:
+
:The set of workflow steps can be enriched providing wider capabilities for template descriptiveness;
* Raw data: This can be imagined as the data contained in user provided CSV files
+
* Metadata
+
** Data structure: this metadata describes how data is structured (e.g.: columns number or column data type) and how raw data can be reached
+
** Enriching Metadata: This metadata adds information on top of raw data and provides some context or additional information on top of it.
+
  
Raw data is managed directly by leveraging relational database services (PostgreSQL with Postgis extension).
+
;re-usability orientation
Metadata is managed and represented through a metadata model library called [http://maven.research-infrastructures.eu/nexus/index.html#nexus-search;gav~~tabular-model tabular-model].
+
:The subsystem is conceived to promote the reuse of its facilities in application dealing with tabular data; Moreover, it is conceived to be open so that additional functions can be easily added to serve domain specific cases;
Tabular Model provides
+
* a description for tables entities covering the minimum table structure description requrements
+
* elements that helps in enriching tables with additional metadata (column labels, descriptions, table version, etc.)
+
  
Tabular Model is [http://maven.research-infrastructures.eu/nexus/index.html#nexus-search;gav~~tabular-model-gwt GWT friendly], which means that it can be used in GWT Web application on client side, since it's java beans are translatable into javascript code.
+
== Design ==
  
= Service =
+
=== Philosophy ===
  
 +
[[Tabular Data Service Model and Operations]]
  
==Operations==
+
=== Architecture ===
Operation modules are a group of Java classes that provide, each one, a single functionality to the system. Functionalities provided by operation modules may fall under one of these categories:
+
* Import
+
* Export
+
* Transformation
+
* Validation
+
  
Each Operation takes an input, which is a set of parameters. These parameters may include a tabular data table or a column of a tabular data table or none of them (like in the import case). Along with additional parameters, each operation must belong to one of these categories:
+
[[File:TabularDataArchitecture.png|center|800px||Tabular Data system architecture]]
* Void scoped: does not require a table to compute
+
* Table scoped: requires a target table to compute
+
* Column scope: requires a target table column to compute
+
Each operation produce, as a result of its computation, a table and zero or more collateral tables. The create tables are always a new table probably created by first cloning the input table, if any is provided.
+
  
Operation modules leverages cube manager capabilities in order to create new tables, clone existing ones or modify the structure or additional metadata of tables.
+
Tabular data system is made of three main subsystems:
Operation modules can work with raw data directly on the DB, therefore data experts can rely on their SQL knowledge.
+
* '''Tabular Data Portlet''': human centric web application hosted on a gCube web portal. Allows management of the Tabular Data Service system on a per-user basis, allowing invocation of tabular data service methods and additional functionality with external apps.
 
+
* '''Tabular Data Service''': main component of the tabular data architecture. It exposes several remote interfaces covering different areas of functionality.
 
+
** '''Operation orchestrator''': The operation orchestrator is a component that receives incoming call requests from the service interface and unwinds them into a sequence of operation call. The orchestrator may enforce policies and command automatic operations/validations according to its configuration.
==Orchestrator==
+
** '''Operation modules''': Operation modules are classes that brings functionality to the tabular data service. These functionality can be reached directly with invocations on the Operation Interface. Operation modules can work directly with data on the Data back-end or leverage the cube manager in order to create/clone tables or modify table metadata.
 +
** '''[[Cube Manager]]''': The cube manager is the lowest level component of the service stack. Its main responsibilities are managing the creation/modification of tables (and their metadata) and acting as a registry for all the created tables, allowing retrieval of tables metadata.
 +
* '''Data/Metadata back-end''': This is where raw tables data and metadata are stored and where the service keeps its management data.
  
==Cube Manager==
+
== Deployment ==
  
Main
+
== Use Cases ==
[[Cube Manager]] link
+

Latest revision as of 20:26, 16 December 2013

Part of the Data Access and Storage Facilities, this service focuses on the management of tabular data, i.e. any dataset that can be represented in a table format.

This document outlines the service design rationale, key features, and high-level architecture as well as the options for their deployment.

Overview

Tabular Data service is a system that allows to manage the entire lifecycle of tabular data. It's main characteristics are:

  • Exposes its operation to the end user through a gCube Portlet;
  • Its operations can be invoked programmatically through a gCube Web Service (smart gears);
  • Provides several operations for importing/exporting from/to several sources/sinks;
  • Provides data validations and transformations capabilities;
  • Provides means for the management of data management processes automation.

Key Features

The subsystem provides for:

comprehensive facilities for tabular data management
The subsystem offers a comprehensive set of data management facilities oriented to tabular data. Facilities include data import, data editing, and data filtering;
user-friendly interface
The subsystem offers a graphical user interface where users can visualize the data and perform operations in a very user-friendly environment. Moreover, the environment allow to actually perform a workflow by applying a template to an imported table;
declarative approach
Instead of providing the user with means to describe the workflow as a set of transformation steps the user provides a table template as a set of properties a target table should comply with.
flexible and open workflow definition mechanism
The set of workflow steps can be enriched providing wider capabilities for template descriptiveness;
re-usability orientation
The subsystem is conceived to promote the reuse of its facilities in application dealing with tabular data; Moreover, it is conceived to be open so that additional functions can be easily added to serve domain specific cases;

Design

Philosophy

Tabular Data Service Model and Operations

Architecture

Tabular Data system architecture

Tabular data system is made of three main subsystems:

  • Tabular Data Portlet: human centric web application hosted on a gCube web portal. Allows management of the Tabular Data Service system on a per-user basis, allowing invocation of tabular data service methods and additional functionality with external apps.
  • Tabular Data Service: main component of the tabular data architecture. It exposes several remote interfaces covering different areas of functionality.
    • Operation orchestrator: The operation orchestrator is a component that receives incoming call requests from the service interface and unwinds them into a sequence of operation call. The orchestrator may enforce policies and command automatic operations/validations according to its configuration.
    • Operation modules: Operation modules are classes that brings functionality to the tabular data service. These functionality can be reached directly with invocations on the Operation Interface. Operation modules can work directly with data on the Data back-end or leverage the cube manager in order to create/clone tables or modify table metadata.
    • Cube Manager: The cube manager is the lowest level component of the service stack. Its main responsibilities are managing the creation/modification of tables (and their metadata) and acting as a registry for all the created tables, allowing retrieval of tables metadata.
  • Data/Metadata back-end: This is where raw tables data and metadata are stored and where the service keeps its management data.

Deployment

Use Cases