Difference between revisions of "File-Based Access"

From Gcube Wiki
Jump to: navigation, search
(Architecture)
(Large Deployment)
 
(31 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Library that support facilities for access and manage objects on a remote File Storage Service.
+
Part of the  [[Data Access and Storage Facilities]], a cluster of components within the system focus on standards-based and structured access and storage of files of arbitrary size.
Large and small files are organized in directories under a unified interface.
+
The interface is posix like.
+
  
 +
This document outlines their design rationale, key features, and high-level architecture, as well as the options for their deployment.
  
  
 
== Overview ==
 
== Overview ==
  
 +
Access and storage of unstructured bytestreams, or files, can be provided through a standards-based, POSIX-like API which supports the organisation and operations normally associated with local file systems whilst offering scalable and fault-tolerant remote storage.
  
 +
API and remote storage are provided by a set of components, most noticeably a client library and a service based on a range of site-local back-ends, including MongoDB and Terrastore.
  
The Library permits manipulation of remote files like a local fileSystem.
+
The library acts a facade to the service and allows clients to download, upload, remove, add, and list files.  
Clients can download, upload remove and add a new file object in a remote File Storage Service.
+
Also is possible remove the contents of a remote directory or show the list of object in a remote directory.
+
  
Files may be downloaded and shared by several users. Based on the permission given by the owner of the file.
+
Files have owners and owners may define access rights to files, allowing private, public, or group-based access.
the library can interface with different backends like: MongoDB or Terrastore.
+
Providing a common interface to the user who uses, regardless of the backend used.
+
 
+
The library can interface with different backends for storing files, such as MongoDB or Terrastore.
+
The library offers to the user a common interface that is used regardless of the backend.
+
The backend is responsible for storing files, the library rather than to convey the right way storage requirements.
+
Through the use of metadata, the library allows clients to organize files in a structured way, even if the backend used and flat type.
+
  
 +
Through the use of metadata, the library allows hierarchical organisations of the data against the flat storage provided by the service's back-ends.
  
 
=== Key features ===
 
=== Key features ===
  
The subsystem comprises the following components:
+
The subsystem has the following features:
  
;Structured-files storage
+
;structured file storage
:supports structured data. Clients can build folders tree and files. Folders are made through the use of metadata, so even if the File Storage System does not allow the use of structured data, clients can also build directory on a flat storage like mongoDB.
+
:Clients can create folder hierarchies, where folders are encoded as file metadata and do not require direct support in the storage back-end.
;Secure storage:
+
:access to data is so authenticated: userID, groupID. Each file has an owner and for each file the client can specify access rights:
+
::private: read and write access allowed only to the owner of the file;
+
::shared: read and write access granted to all group members;
+
::public: read access and write to all users allowed.
+
;Data balancing:
+
:the files are organized in chunks and are distribuited on all the servers of backend file storage system based on the actual load of each server
+
;Scalability:
+
:horizontal scalability and replication are provided which are necessary functions for large deployments. The horizontal scalability is guaranteeed because the files are organized in chunks and are distribuited on all the servers of backend file storage system
+
;Data replication:
+
:supports asynchronous replication of data between servers for failover and redundancy.
+
  
== Design ==
+
;secure file storage
 +
: File access is authenticated against access rights set by file owners, including private, group, and public access rights;
  
 +
;scalable file storage
 +
:files are stored in chunks and chunks are distributed across clusters of servers based on the workload of individual servers;
  
 +
;fault-tolerant file storage:
 +
:file are asynchronously replicated across servers of clusters for data recovery and redundancy.
 +
 +
== Design ==
  
 
=== Philosophy===
 
=== Philosophy===
  
Navigating through folders on a remote storage system, having the ability to download and upload files, masking the backend system. This is the main goal of this library
+
Navigating through folders on a remote storage system, having the ability to download and upload files, a familiar POSIX-like interface and a scalable and fault-tolerand storage backend system, these are all key design goals for the subsystem
The library is thinked for preserving a unified interface that aligns with their generality and encapsulates them from the variety of File Storage Service Backend.  
+
The library has ben designed to preserve a unified interface that aligns with their generality and encapsulates them from the variety of File Storage Service Backend.  
 
The two layer: core and wrapper library permit the use of the library in standalone mode or in the Gcube framework.
 
The two layer: core and wrapper library permit the use of the library in standalone mode or in the Gcube framework.
 
  
 
=== Architecture===
 
=== Architecture===
Line 56: Line 44:
 
The library is divided in two layer: a core library and a wrapper library.
 
The library is divided in two layer: a core library and a wrapper library.
 
The core library is for generic purpose use, external to gCube framework.  
 
The core library is for generic purpose use, external to gCube framework.  
The wrapper library is thinked for use internal on gCube framework.
+
The wrapper library is thought for use internal on gCube framework.
 
The interaction between these two levels permits the use of the library within the framework gCube.
 
The interaction between these two levels permits the use of the library within the framework gCube.
The wrapper library interacts with IS for discover server resources that will be used from the core library.  
+
The wrapper library interacts with IS to discover server resources that will be used from the core library.  
 
The core library interacts with a File Storage Service backend.  
 
The core library interacts with a File Storage Service backend.  
 
The file Storage Service has the responsability of data storing.  
 
The file Storage Service has the responsability of data storing.  
 
At this time there are 2 kind of file storage service supported: Terrastore and MongoDB.
 
At this time there are 2 kind of file storage service supported: Terrastore and MongoDB.
 
  
 
File based access is provided by the following components:
 
File based access is provided by the following components:
 
  
 
;Core library:   
 
;Core library:   
Line 71: Line 57:
  
 
;Wrapper library:  
 
;Wrapper library:  
:Is a wrapper library for gCube framework. This library has the task of capturing the resources made available in the framework Gcube and pass them to the core library
+
:the wrapper library for the gCube framework, it  has the task of harvesting the configuration resources made available in the framework Gcube and pass them to the core library
  
 
;File Storage Service:  
 
;File Storage Service:  
:Is a service that have the responsability of remote data storage. This is invoked by the core library and can be based on differents technology like MongoDB, Terrastore.  
+
:the service responsible of remote data storage, it's invoked by the core library and can be based on differents technology like MongoDB, Terrastore.  
  
  
Line 82: Line 68:
  
  
  [[Image:FileAccessGraph_1.jpeg|frame|center|File Access Architecture]]
+
  [[Image:FileAccessGraph_2.jpeg|frame|center|File Access Architecture]]
  
 
== Deployment ==
 
== Deployment ==
Line 92: Line 78:
 
=== Large Deployment ===
 
=== Large Deployment ===
  
A large deployment consists of an instalation of a cluster of server dedicated to storage. Our current implementation uses a MongoDB File Storage Service. The servers are organized into MongoDB shards:
+
A large deployment consists of an instalation of a cluster of server dedicated to storage. Our current implementation uses a MongoDB File Storage Service. The current version is  2.0.1. The servers are organized into MongoDB replica set cluster: replica sets are a form of asynchronous master/slave replication, adding automatic failover and automatic recovery of cluster's member nodes
Each shard consists of one or more servers and stores data using mongod processes (mongod being the core MongoDB database process).
+
In a production situation, the replica set ensure high availability, automated failover, data redundancy and disaster recovery.  
In a production situation, each shard will consist of multiple servers to ensure availability and automated failover. The set of servers/mongod process within the shard comprise a replica set.  
+
  
In MongoDB, sharding is the tool for scaling a system, and replication is the tool for data safety, high high availability, and disaster recovery. The two work in tandem yet are are orthogonal concepts in the design.
 
  
  
  
 +
[[Image:largeDeploymentArch4.jpeg|frame|center|Large Deployment Architecture]]
  
 
=== Small Deployment ===
 
=== Small Deployment ===
  
A small deployment consists
+
A small deployment consists of an installation of a single server dedicated to storage.
 +
In this case, the installation is very simple but it's not guaranteed the failover and data replication, also is not guaranteed the horizontal scalability
 +
Given all this,  we just think that single server deployment isn’t the best way to get true durability.  We think the right path to durability is replication on many node. That’s why us current deployment is of kind "large deployment"
 +
 
 +
 
 +
[[Image:smallDeploymentArch4.jpeg|frame|center|Small Deployment Architecture]]
 +
 
 +
== Use Cases ==
 +
 
 +
 
 +
=== Well suited use cases===
 +
 
 +
The subsystem is particularly suited to support sharing and storing of a big number of files.
 +
 
 +
The library core can operate in standalone mode, without the wrapper library. This allows to adapt the library also in environments other than gCube.
 +
 
 +
Theoretically it would be possible to share files even from different environments. Obviously, only if clients had shared login credentials and resources.
 +
 
 +
The library is able to handle files of large size without loss of performance. This is thanks to the management of files in chunks.
 +
 
 +
=== Less well suited use cases===
 +
 
 +
The scalability is achieved only through static deployments of larger clusters.
 +
In the case where there was the need to aggregate more resources than those provided at deployment time, this could not be done if not in a static manner.
 +
 
 +
The current installation based on MongoDB backend works well on small and large files. It has a performance loss on file size very very small, smaller than 2 KB.
 +
There will be a little more space overhead in MongoDB than on the filesystem.

Latest revision as of 12:30, 11 September 2013

Part of the Data Access and Storage Facilities, a cluster of components within the system focus on standards-based and structured access and storage of files of arbitrary size.

This document outlines their design rationale, key features, and high-level architecture, as well as the options for their deployment.


Overview

Access and storage of unstructured bytestreams, or files, can be provided through a standards-based, POSIX-like API which supports the organisation and operations normally associated with local file systems whilst offering scalable and fault-tolerant remote storage.

API and remote storage are provided by a set of components, most noticeably a client library and a service based on a range of site-local back-ends, including MongoDB and Terrastore.

The library acts a facade to the service and allows clients to download, upload, remove, add, and list files.

Files have owners and owners may define access rights to files, allowing private, public, or group-based access.

Through the use of metadata, the library allows hierarchical organisations of the data against the flat storage provided by the service's back-ends.

Key features

The subsystem has the following features:

structured file storage
Clients can create folder hierarchies, where folders are encoded as file metadata and do not require direct support in the storage back-end.
secure file storage
File access is authenticated against access rights set by file owners, including private, group, and public access rights;
scalable file storage
files are stored in chunks and chunks are distributed across clusters of servers based on the workload of individual servers;
fault-tolerant file storage
file are asynchronously replicated across servers of clusters for data recovery and redundancy.

Design

Philosophy

Navigating through folders on a remote storage system, having the ability to download and upload files, a familiar POSIX-like interface and a scalable and fault-tolerand storage backend system, these are all key design goals for the subsystem The library has ben designed to preserve a unified interface that aligns with their generality and encapsulates them from the variety of File Storage Service Backend. The two layer: core and wrapper library permit the use of the library in standalone mode or in the Gcube framework.

Architecture

The library is divided in two layer: a core library and a wrapper library. The core library is for generic purpose use, external to gCube framework. The wrapper library is thought for use internal on gCube framework. The interaction between these two levels permits the use of the library within the framework gCube. The wrapper library interacts with IS to discover server resources that will be used from the core library. The core library interacts with a File Storage Service backend. The file Storage Service has the responsability of data storing. At this time there are 2 kind of file storage service supported: Terrastore and MongoDB.

File based access is provided by the following components:

Core library
Implements a high-level facade to the remote APIs of the File Storage Service. The core dialogues directly with a File storage Service that is responsibles for storing data. This level has the responsibility to split files into chunks if the size exceeds a certain threshold, to build the metadata such as: owner, type of object (file or directory), access permissions, etc. .. It also has the task of issuing commands to the File Storage System for the construction of the tree of folders by metadatas if any were needed
Wrapper library
the wrapper library for the gCube framework, it has the task of harvesting the configuration resources made available in the framework Gcube and pass them to the core library
File Storage Service
the service responsible of remote data storage, it's invoked by the core library and can be based on differents technology like MongoDB, Terrastore.


The following diagram illustrates the dependencies between the components of the subsystem:


File Access Architecture

Deployment

The deployment of this library has the focus on installing the File Storage System. The File Storage System is installed in a static, not dynamic capabilities based on the load of requests. Therefore it is very important to choose the correct installation according to the needs. As the number of servers dedicated to storage of data, not only increases the storage capacity, but it also improves the balance of the data and therefore the response time. On the other hand, if the storage requirements are few and the number of servers is large, there will be a waste of resources that will be little used


Large Deployment

A large deployment consists of an instalation of a cluster of server dedicated to storage. Our current implementation uses a MongoDB File Storage Service. The current version is 2.0.1. The servers are organized into MongoDB replica set cluster: replica sets are a form of asynchronous master/slave replication, adding automatic failover and automatic recovery of cluster's member nodes In a production situation, the replica set ensure high availability, automated failover, data redundancy and disaster recovery.



Large Deployment Architecture

Small Deployment

A small deployment consists of an installation of a single server dedicated to storage. In this case, the installation is very simple but it's not guaranteed the failover and data replication, also is not guaranteed the horizontal scalability Given all this, we just think that single server deployment isn’t the best way to get true durability. We think the right path to durability is replication on many node. That’s why us current deployment is of kind "large deployment"


Small Deployment Architecture

Use Cases

Well suited use cases

The subsystem is particularly suited to support sharing and storing of a big number of files.

The library core can operate in standalone mode, without the wrapper library. This allows to adapt the library also in environments other than gCube.

Theoretically it would be possible to share files even from different environments. Obviously, only if clients had shared login credentials and resources.

The library is able to handle files of large size without loss of performance. This is thanks to the management of files in chunks.

Less well suited use cases

The scalability is achieved only through static deployments of larger clusters. In the case where there was the need to aggregate more resources than those provided at deployment time, this could not be done if not in a static manner.

The current installation based on MongoDB backend works well on small and large files. It has a performance loss on file size very very small, smaller than 2 KB. There will be a little more space overhead in MongoDB than on the filesystem.