Workflow Engine Specification

From Gcube Wiki
Jump to: navigation, search

Overview

The Workflow Engine is in charge of processing and workflows, i.e. high level plans that bind together conceptual operations for the implementation of a task. As a workflow is a higher level sketch of the activity to be performed, potentially without even reference to executables and resources’ usage, the Workflow Engine operates on top of the ExecutionEngine. Its purpose is to abstract over the low level details that are needed by the ExecutionEngine and the Execution Plan it is provided with.

When the Workflow Engine is acting in the context of the gCube platform, it provides a gCube compliant Web Service interface. This Web Service acts as the front end not only to Workflow definition facilities, but it is also the "face" of the component with respect to the gCube platform. The Running Instance profile of the service is the placeholder where the underlying Workflow Engine instance Execution Environment Providers pushes information that need to be made available to other engine instances. Configuration properties that are used throughout the Workflow Engine instance are retrieved by the appropriate technology specific constructs and are used to initiate services and providers once the service starts.

Key features

Orchestration for external computational and storage infrastructures.
Allows users and systems to exploit computational resources that reside on multiple infrastructures in a single complex process workflow.
Native computational infrastructure provider and manager.
Allows full exploitation of the computational capacity of the D4Science Infrastructure by executing tasks directly on the latter.
Handling of data staging among different storage providers.
All Workflow Engine Adaptors handle data staging in a transparent way, without requiring any kind on external intervention.
Unbound extensibility via providers for integration with different environments.
The system is designed in an extensible manner, allowing the transparent integration with a variety of providers, among which environments for storage, resource registries, reporting systems depending on the environment on which the engine is operating.

Design

Philosophy

The Workflow Engine is designed to allow specific third party languages to be parsed and translated into internally used constructs. This task is performed by the Adaptors which can understand these kind of languages. In this way, the Workflow Engine opens up its usability level since existing workflows already defined in third party languages can be easily incorporated. Additionally, the learning curve for anyone wishing to use the execution and workflow capabilities of the system is greatly reduced as depending on one's needs one can simply focus on one of the existing supported languages which will best match the job at hand. Additionally, for the same language, more than one adaptors can be implemented that will offer different type of functionality either by modifying the semantics of the produced Execution Plan or even by incorporating external components to the evaluation. As a constituent part of PE2ng, the Workflow Engine is designed with a layered architecture decoupling the business domain, the infrastructure specific logic and the core execution functionality therefore allowing core re-usage to a multitude of use cases and avoiding sub-optimal compromises of strictly agnostic solutions.

Architecture

The Workflow Engine comprises a number of adaptors. The following list of adaptors is currently provided:

WorkflowJDLAdaptor
This adaptor parses a Job Description Language (JDL) definition block and translates the described job or DAG of jobs into an Execution Plan which can be submitted to the ExecutionEngine for execution. The plan is executed directly on the working nodes of the D4Science Infrastructure, in this way exploiting the computational resources of the native infrastructure.
WorkflowGridAdaptor
This adaptor constructs an Execution Plan that can contact a gLite UI node, submit, monitor and retrieve the output of a grid job.
WorkflowCondorAdaptor
This adaptor constructs an Execution Plan that can contact a Condor gateway node, submit, monitor and retrieve the output of a condor job.
WorkflowHadoopAdaptor
This adaptor constructs an Execution Plan that can contact a Hadoop UI node, submit, monitor and retrieve the output of a Map Reduce job

All the aforementioned adaptors are exploited by the WorkflowLayer, in order to provide a high level interface which enables the description and handling of the execution of workflows of jobs, over heterogeneous processing infrastructures. An extension of JDL, namely gJDL, is used for the description of workflows.

In addition to the above Adaptors, which are all contactable through the Web Service interface of the Workflow Engine, two more adaptors have been implemented to satisfy the special purpose requirements of certain applications. These adaptors are contactable only through the client service which employs them and are the following:

WorkflowSearchAdaptor
This adaptor is used by the planner of the gCube Search System in order to translate abstract search plans to concrete Execution Plans, which materialize search tasks executed on the native infrastructure.
WorkflowDTSAdaptor
This adaptor is defined and used by the Data Transformation Service and translates abstract data transformation paths to concrete Execution Plans which materialize transformation tasks and are executed on the native infrastructure.
WorkflowHiveQLAdaptor
This adaptor parses a Hive Query Language (HiveQL) query and translates the described MapReduce job into possibly multiple Execution Plans which can be submitted to the ExecutionEngine for execution, in this way exploiting the computational resources of the native infrastructure.

Deployment

The deployment of the Workflow Engine is tightly coupled with that of the Execution Engine. Specifically, the Execution Engine should be deployed on each node the Workflow Engine is deployed, regardless if the execution takes place locally or on one or more remote nodes. An important point in the deployment of the Workflow Engine is that in order for the latter to be discoverable in the infrastructure, it should be deployed as a gCube Web Service. Special purpose adaptors can either be packaged in their own libraries or become part of the core Workflow Engine library (as of now, the former is the preferred method). In either case only the packaging unit containing the adaptor should be deployed on these nodes, unless processing of external workflows is also desirable, or the application itself mandates service deployment.

Large deployment

When the rate of workflow processing tasks to be serviced is expected to be high, more than one Workflow Engine instances, along with their Web Service interfaces should be deployed in the infrastructure.

Workflow Engine large deployment

Small deployment

In infrastructures with low requirements for workflow processing, only one Workflow Engine instance can be deployed. One can even avoid deploying Workflow Engine instances entirely if the needs for workflow processing are limited to special purpose applications which employ their own locally deployed libraries.

Workflow Engine small deployment

Use Cases

Well suited Use Cases

  • Execution of JDL based workflows in the D4Science and gLite Infrastructures.
    • The Workflow Engine has been successfully used to execute a number of such jobs, such as OCRing, Reference Extraction and Text Extraction.
  • Execution of arbitrary workflows in the D4Science Infrastructure.
    • Through the corresponding adaptors, the Workflow Engine has been successfully used by the gCube Search System and the gCube Data Transformation Service as the means of translating their workflows to parallel distributed Execution Plans.
  • Execution of workflows targeting the Hadoop infrastructure
    • An example of such an application is Bibliometric Indexing.

Less well suited Use Cases

The Workflow Engine does not currently offer advanced features for automatic task parallelization, meaning that this logic should be integrated directly into the implementations of Adaptors. This not only makes the implementation of such Adaptors a non-trivial task but such logic should be implemented for each case separately. Despite this fact not posing a serious problem up until nowdue to the nature of the tasks for which Adaptors were employed, this kind of functionality will be included in the next major version of the Engine, becoming one of its core parts alongside the set of Adaptor.