Difference between revisions of "Statistical Manager"

From Gcube Wiki
Jump to: navigation, search
(Philosophy)
 
(9 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
|}
 
|}
  
A cross usage service developed in the iMarine Work Package 10, aiming to provide users and services with tools for performing Data mining operations.
+
A cross usage service aiming to provide users and services with tools for performing Data Mining operations.
This document outlines the design rationale, key features, and high-level architecture, as well as the options deployment.
+
This document outlines the design rationale, key features, and high-level architecture, as well as the deployment context.
  
 
== Overview ==
 
== Overview ==
  
The goal of this service is to offer a unique access for performing data mining or statistical operations on heterogeneous data. data can reside on client side in the form of csv files or they can be remotely hosted, as SDMX documents or they can be stored in a database.
+
The goal of this service is to offer a unique access for performing data mining or statistical operations on heterogeneous data. These can reside on client side in the form of csv files or they can be remotely hosted, as SDMX documents or, furthermore, they can be stored in a database.
  
The Service is able to take such inputs and execute the requested operation by invoking the most suited computational infrastructure, chosing among a set of available possibilities: executions can run on multi-core machines, or on different computational Infrastructure, like the d4Science itself or Windows Azure and other options.
+
The Service is able to take such inputs and execute the operation requested by a client interface, by invoking the most suited computational infrastructure, choosing among a set of available possibilities: executions can run on multi-core machines, or on different computational infrastructures, like the [http://www.d4science.eu/vre d4Science], [http://www.windowsazure.com Windows Azure], [http://www.prace-ri.eu/BSC-Barcelona-Supercomputing?lang=en CompSs] and other options.
  
 
Algorithms are implemented as plug-ins which makes the injection mechanism of new functionalities easy to deploy.
 
Algorithms are implemented as plug-ins which makes the injection mechanism of new functionalities easy to deploy.
 +
 +
<!-- === Key features ===
 +
 +
<font color=red>TO BE COMPLETE</font> -->
  
 
== Design ==
 
== Design ==
Line 18: Line 22:
 
=== Philosophy ===
 
=== Philosophy ===
  
This represents a unique endpoint for those clients or services which want to perform complex operations without going to investigate into the details of the implementation.
+
This represents a unique endpoint for those clients or services which want to perform complex operations without going to investigate the details of the implementation.
 
Currently the set of operations which can be performed has been divided into:
 
Currently the set of operations which can be performed has been divided into:
 
*Generators
 
*Generators
Line 30: Line 34:
 
The subsystem comprises the following components:
 
The subsystem comprises the following components:
  
* '''species-products-dicovery-service''': a stateless Web Service that exposes read operations and implements it by delegation to dynamically deployable plugins for target repository sources within and outside the system;
+
* '''Ecological Engine Library''': a container for several data mining algorithms as well as evaluation procedures for the quality assessment of the modeling procedures. Algorithms follow a plug-in implementation and deploy;
  
* '''species-products-dicovery-library''': a client library that implements a high-level facade to the remote APIs of the Species manager service;
+
* '''Computational Infrastructure Occupancy Tree''': an internal process which monitors the occupancy of the resources to choose among when launching an algorithm.
  
* '''obis-spd-plugin''': a plugin of the Species Products Discovery service that interacts with [http://www.iobis.org OBIS] data source;
+
* '''Algorithms Thread''': an internal process which puts in connection the algorithm to execute with the most unloaded infrastructure which is able to execute it. Infrastructures are weighted even according to the computational speed; the internal logic will choose the fastest available;
  
* '''gbif-spd-plugin''': a plugin of the Species Products Discovery service that interacts with [http://www.gbif.org GBIF] data source;
+
* '''WS Resource''': an internal gCube process which takes care of all the computations asked by a single user\service. The WS Resource communicates with the other components by means of gCube events;
  
* '''catalogueoflife-spd-plugin''': a plugin of the Species Products Discovery service that interacts with [http://www.catalogueoflife.org Catalogue of Life] data source;
+
* '''Object Factory''': a broker for WS Resources and a link between the users' computations and the Occupancy Tree process.
 
+
* '''flora-spd-plugin''': a plugin of the Species Products Discovery service that interacts with [http://www.cria.org.br/ Brazilian Flora] data source.
+
 
+
* '''worms-spd-plugin''': a plugin of the Species Products Discovery service that interacts with [http://www.marinespecies.org/ Worms] data source.
+
 
+
* '''specieslink-spd-plugin''': a plugin of the Species Products Discovery service that interacts with [http://splink.cria.org.br/ SpeciesLink] data source.
+
  
 
A diagram of the relationships between these components is reported in the following figure:
 
A diagram of the relationships between these components is reported in the following figure:
  
  
[[Image:species-manager-arch.png|frame|center|Biodiversity Access Subsystem Architecture]]
+
[[Image:statistical-manager-architecture.png|frame|center|Statistical Manager Internal Architecture]]
  
 
== Deployment ==
 
== Deployment ==
+
All the components of the service must be deployed together in a single node. This subsystem can be replicated on multiple hosts and scopes, this does not guarantee a performance improvement because the performance are directly associated to the combination of the algorithms with the computational infrastructures.
All the components of the subsystem must be deployed together in a single node.  
+
There are no temporal constraints on the co-deployment of services and plug-ins. Every plug-in must be deployed on every instance of the service.
This subsystem can be replicated in multiple hosts; this does not guarantee a performance improvement because the scalability of this system depends on the capacity of external repositories contacted by the plugins.  
+
This subsystem requires at least 2GB of memory to run properly, the presence of multiple cores on the machine is preferred.  
There are no temporal constraints on the co-deployment of services and plugins. Every plugin must be deployed on every instance of the service.
+
This subsystem is lightweight, it does not need of excessive memory or disk space.  
+
 
+
  
 +
The Service will automatically take the available data sources and infrastructure by asking to the d4Science Information System for the scope it is running into.
  
 
=== Small deployment ===
 
=== Small deployment ===
  
[[Image:species-manager-small-depl.png|frame|center|Biodiversity Access Deployment]]
+
[[Image:statisticalManagerAndInfra.png|frame|center|Deployment Schema with inputs, outputs and connections to the infrastructure]]
  
 
== Use Cases ==
 
== Use Cases ==
Line 68: Line 64:
 
=== Well suited Use Cases ===
 
=== Well suited Use Cases ===
  
The subsystem is particularly suited to support abstraction over biodiversity data. Every biodiversity repository can be easily integrated in this subsystem developing a plugin.  
+
The subsystem is particularly suited to support abstraction over statistical and data mining processes. Every data mining algorithm of evaluation procedure on data can be easily integrated in this subsystem developing a plug-in.
 +
 
 +
The development of any plug-in for the Statistical Manager immediately extends the ability of the system to process new kinds of data.
  
The development of any plugin of the Species Manager services immediately extends the ability of the systems to discovery new biodiversity data.
+
== List of Algorithms ==
 +
[https://wiki.gcube-system.org/gcube/Statistical_Manager_Algorithms The list of currently available algorithms can be found here]

Latest revision as of 17:23, 8 July 2016

A cross usage service aiming to provide users and services with tools for performing Data Mining operations. This document outlines the design rationale, key features, and high-level architecture, as well as the deployment context.

Overview

The goal of this service is to offer a unique access for performing data mining or statistical operations on heterogeneous data. These can reside on client side in the form of csv files or they can be remotely hosted, as SDMX documents or, furthermore, they can be stored in a database.

The Service is able to take such inputs and execute the operation requested by a client interface, by invoking the most suited computational infrastructure, choosing among a set of available possibilities: executions can run on multi-core machines, or on different computational infrastructures, like the d4Science, Windows Azure, CompSs and other options.

Algorithms are implemented as plug-ins which makes the injection mechanism of new functionalities easy to deploy.


Design

Philosophy

This represents a unique endpoint for those clients or services which want to perform complex operations without going to investigate the details of the implementation. Currently the set of operations which can be performed has been divided into:

  • Generators
  • Modelers
  • Transducers
  • Evaluators

Further details are available at the Ecological Modeling wiki page, where some experiments are shown along with explanations on the algorithms.

Architecture

The subsystem comprises the following components:

  • Ecological Engine Library: a container for several data mining algorithms as well as evaluation procedures for the quality assessment of the modeling procedures. Algorithms follow a plug-in implementation and deploy;
  • Computational Infrastructure Occupancy Tree: an internal process which monitors the occupancy of the resources to choose among when launching an algorithm.
  • Algorithms Thread: an internal process which puts in connection the algorithm to execute with the most unloaded infrastructure which is able to execute it. Infrastructures are weighted even according to the computational speed; the internal logic will choose the fastest available;
  • WS Resource: an internal gCube process which takes care of all the computations asked by a single user\service. The WS Resource communicates with the other components by means of gCube events;
  • Object Factory: a broker for WS Resources and a link between the users' computations and the Occupancy Tree process.

A diagram of the relationships between these components is reported in the following figure:


Statistical Manager Internal Architecture

Deployment

All the components of the service must be deployed together in a single node. This subsystem can be replicated on multiple hosts and scopes, this does not guarantee a performance improvement because the performance are directly associated to the combination of the algorithms with the computational infrastructures. There are no temporal constraints on the co-deployment of services and plug-ins. Every plug-in must be deployed on every instance of the service. This subsystem requires at least 2GB of memory to run properly, the presence of multiple cores on the machine is preferred.

The Service will automatically take the available data sources and infrastructure by asking to the d4Science Information System for the scope it is running into.

Small deployment

Deployment Schema with inputs, outputs and connections to the infrastructure

Use Cases

Well suited Use Cases

The subsystem is particularly suited to support abstraction over statistical and data mining processes. Every data mining algorithm of evaluation procedure on data can be easily integrated in this subsystem developing a plug-in.

The development of any plug-in for the Statistical Manager immediately extends the ability of the system to process new kinds of data.

List of Algorithms

The list of currently available algorithms can be found here