Tabular Data Flow Manager

From Gcube Wiki
Jump to: navigation, search

The goal of this facility is to realise an integrated environment supporting the definition and management of workflows of tabular data. Each workflow consists of a number of tabular data processing steps where each step is realized by an existing service component conceptually offered by a gCube based infrastructure.

In the following, the design rationale, key features, high-level architecture, as well as the deployment scenarios are described.

Overview

The goal of this service is to offer a facilities for tabular data workflow management, execution and monitoring. The workflow can involve a number of data manipulation steps each performed by potentially different service components to produce the desired output.

Key features

The subsystem provides for:

declarative approach
Instead of providing the user with means to describe the workflow as a set of transformation steps the user provides a table template as a set of properties a target table should comply with.
flexible and open workflow definition mechanism
The set of workflow steps can be enriched providing wider capabilities for template descriptiveness;
user-friendly interface
The subsystem offers a graphical user interface where users can define table templates. Moreover, the environment allow to actually perform a workflow by applying a template to an imported table;

Design

Philosophy

Tabular Data Flow Manager offers a service for tabular data workflow creation, management and monitoring. The underlying idea is to provide means to the service client to command multiple operations by providing a table template. A table template can be defined in terms of a set of properties the workflow resulting table should camply with. Table templates can be created by the end user with the UI and saved for later reuse. Applying a template to a target tabular data table results in the materialization of a set of workflow steps on the service, which can be monitored remotely. Each step is managed by a single software component which can also be invoked singularly. This approach aims at maximizing the exploitation and reuse of components offering data manipulation facilities.

Architecture

The subsystem comprises the following components:

  • Flow Service: A subset of Tabular Data Service functionalities that allows workflow creation, management, execution and monitoring;
  • Flow UI: the user interface of this functional area. It provides users with the web based user interface for creating, executing and monitoring the workflow(s);
  • Workflow Orchestrator: A service components that unpacks a table template into a sequence of operations to be performed on a target table;
  • Operation modules: A set of software modules, each one managing a specific operation (transformation,validation,import,export).

A diagram of the relationships between these components is reported in the following figure:

Tabular Data Flow Manager, internal Architecture

Deployment

The Service should be deployed in a single node along with the operation modules. The User Interface can be deployed in the infrastructure portal along with the needed client library.

Use Cases

Well suited Use Cases

This component well fit all the cases where it is necessary to manage a defined flow of data manipulation steps. An example is the data flow that allows a user to curate a set of uncurated data, provided periodically by a data provider, apply a set of default transformation and validation procedures and merge all the curated data chunks into a single table at the end of the process.