Introduction to Flow 2.0

1.1 Generation of Flow 2.0

Azkaban currently supports both Flow 1.0 and Flow2.0, but the official documentation recommends using Flow2.0 as Flow 1.0 will be removed in future releases. The main design idea of Flow 2.0 is to provide a Flow level definition that was not available in 1.0. You can combine all the job/Properties files belonging to a given flow into a single stream definition file, whose content is defined using YAML syntax. You can also redefine a flow within a stream, called an embedded flow or a subflow.

1.2 Basic Structure

The project ZIP will contain multiple stream YAML files, a project YAML file, and optional libraries and source code. The basic structure of the Flow YAML file is as follows:

  • Each Flow is defined in a single YAML file;
  • Stream files are named after stream names, such as:my-flow-name.flow;
  • Contains all nodes in the DAG;
  • Each node can be a job or a process;
  • Each node can have attributes such as Name, Type, Config, dependsOn and Nodes sections.
  • Specify node dependencies by listing the parent node in the dependsOn list.
  • Contains additional stream-related configurations;
  • All common attributes of the stream in the current Properties file will be migrated to the Config section of each stream YAML file.

A complete configuration example is provided as follows:

config:
  user.to.proxy: azktest
  param.hadoopOutData: /tmp/wordcounthadoopout
  param.inData: /tmp/wordcountpigin
  param.outData: /tmp/wordcountpigout

# This section defines the list of jobs
# A node can be a job or a flow
# In this example, all nodes are jobs
nodes:
 # Job definition
 # The job definition is like a YAMLified version of properties file
 # with one major difference. All custom properties are now clubbed together
 # in a config section in the definition.
 # The first line describes the name of the job
 - name: AZTest
   type: noop
   # The dependsOn section contains the list of parent nodes the current
   # node depends on
   dependsOn:
     - hadoopWC1
     - NoOpTest1
     - hive2
     - java1
     - jobCommand2

 - name: pigWordCount1
   type: pig
   # The config section contains custom arguments or parameters which are
   # required by the job
   config:
     pig.script: src/main/pig/wordCountText.pig

 - name: hadoopWC1
   type: hadoopJava
   dependsOn:
     - pigWordCount1
   config:
     classpath: . / *
     force.output.overwrite: true
     input.path: ${param.inData}
     job.class: com.linkedin.wordcount.WordCount
     main.args: ${param.inData} ${param.hadoopOutData}
     output.path: ${param.hadoopOutData}

 - name: hive1
   type: hive
   config:
     hive.script: src/main/hive/showdb.q

 - name: NoOpTest1
   type: noop

 - name: hive2
   type: hive
   dependsOn:
     - hive1
   config:
     hive.script: src/main/hive/showTables.sql

 - name: java1
   type: javaprocess
   config:
     Xms: 96M
     java.class: com.linkedin.foo.HelloJavaProcessJob

 - name: jobCommand1
   type: command
   config:
     command: echo "hello world from job_command_1"

 - name: jobCommand2
   type: command
   dependsOn:
     - jobCommand1
   config:
     command: echo "hello world from job_command_2"
Copy the code

YAML syntax

To configure workflows using Flow 2.0, you first need to know YAML. YAML is a compact, non-markup language with strict formatting requirements that will throw parsing exceptions when you upload to Azkaban if your formatting fails.

2.1 Basic Rules

  1. Case sensitive;
  2. Use indentation to indicate hierarchy;
  3. There is no limit to the length of the indentation, as long as the elements are aligned, they belong to the same level;
  4. Use # to indicate comments;
  5. By default, single and double quotation marks are not required for strings. However, both single and double quotation marks can be used. Double quotation marks indicate that special characters do not need to be escaped.
  6. YAML provides a variety of constant constructs, including: integer, floating point, string, NULL, date, Boolean, and time.

2.2 Object writing method

There must be a space between the # value and: symbols
key: value
Copy the code

2.3 Map Writing Method

All key-value pairs with the same indentation belong to a map
key: 
    key1: value1
    key2: value2

# write two
{key1: value1, key2: value2}
Copy the code

2.3 Array writing method

The first notation uses a dash followed by a space to represent an array item
- a
- b
- c

# write two
[a,b,c]
Copy the code

2.5 Single and Double quotation marks

Single and double quotation marks are supported, but the double quotation marks do not escape special characters:

s1: 'Content \n string'
s2: "Contents \n string"

After the transformation:
{ s1: 'Contents \\n string'. s2: 'Content \n string' }
Copy the code

2.6 Special Symbols

A YAML file can contain multiple documents, which are segmented using –.

2.7 Configuring References

Flow 2.0 recommends defining public parameters under config and referencing them through ${}.

Three, simple task scheduling

3.1 Task Configuration

Create a flow configuration file:

nodes:
  - name: jobA
    type: command
    config:
      command: echo "Hello Azkaban Flow 2.0."
Copy the code

In the current version, Azkaban supports both Flow 1.0 and Flow 2.0. If you want to run Azkaban as Flow 2.0, you need to create a new project file that specifies that you are using Flow 2.0:

Azkaban - flow - version: 2.0Copy the code

3.2 Packaging and Uploading

3.3 Execution Results

Since the Use of the Web UI was covered in version 1.0, I won’t go over it here. For versions 1.0 and 2.0, only the configuration is different, and all other uploads are performed the same. The result is as follows:

Four, multi-task scheduling

As in the case given in 1.0, we assume that we have five tasks (jobA — jobE), task D needs to be executed after task A, B, and C, and task E needs to be executed after task D. The relevant configuration files should be as follows. You can see that in 1.0 we had to define five configuration files, whereas in 2.0 we only needed one.

nodes:
  - name: jobE
    type: command
    config:
      command: echo "This is job E"
    # jobE depends on jobD
    dependsOn: 
      - jobD
    
  - name: jobD
    type: command
    config:
      command: echo "This is job D"
    # jobD depends on jobA, jobB, jobC
    dependsOn:
      - jobA
      - jobB
      - jobC

  - name: jobA
    type: command
    config:
      command: echo "This is job A"

  - name: jobB
    type: command
    config:
      command: echo "This is job B"

  - name: jobC
    type: command
    config:
      command: echo "This is job C"
Copy the code

Fifth, embedded flow

Flow2.0 allows you to define a Flow within a Flow, called an inline Flow or subflow. Here is an example of an embedded Flow with the following Flow configuration:

nodes:
  - name: jobC
    type: command
    config:
      command: echo "This is job C"
    dependsOn:
      - embedded_flow

  - name: embedded_flow
    type: flow
    config:
      prop: value
    nodes:
      - name: jobB
        type: command
        config:
          command: echo "This is job B"
        dependsOn:
          - jobA

      - name: jobA
        type: command
        config:
          command: echo "This is job A"
Copy the code

The DAG diagram for the embedded stream looks like this:

The implementation is as follows:

The resources

  1. Azkaban 2.0 Design Flow
  2. Getting started with Azkaban Flow 2.0

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series