Airflow: a workflow management platform

Airflow: A Workflow Management Platform

Maxime Beauchemin

The Nuggets translation Project

Permanent link to this article: github.com/xitu/gold-m…

Translator: yqian1991

Proofread by: Park-Ma DerekDick

Out Maxime Beauchemin

Airbnb is a fast-growing, data-inspired company. Our data teams and volumes are growing rapidly, as is the complexity of the challenges we face. Our expanding team of data engineers, data scientists and analysts is using Airflow, a platform we have built to move quickly and maintain our advantage as we edit, monitor and rewrite the data pipeline ourselves.

We are proud to announce today that we are opening and sharing our workflow management platform: Airflow.

Github.com/airbnb/airf…

Directed acyclic graphs (DAGs) are blossoming

As people who work with data begin to automate their processes, writing batch jobs is inevitable. These jobs must be executed on a given schedule, they usually rely on an existing set of data sets, and other jobs depend on them as well. Even if you have several data working nodes working together for a short period of time, the batch job for computation can quickly scale into a complex graph. Now, if you have a mid-sized data team working at a fast pace, they’re going to be facing an ever-improving data infrastructure in a few years, and they’re going to have a lot of complex computing networks on their hands. This complexity becomes an important burden for the data team to deal with and even understand.

These job networks are usually directed acyclic graphs (DAGs), which have the following properties:

Scheduled: Each job should run at scheduled intervals
Mission critical: If some jobs aren’t running, we have a problem
Evolution: As companies and data teams mature, so does data processing
Heterogeneity: The modern analytics stack is changing rapidly, and most companies run several systems that need to be glued together

Every company has one (or more)

Workflow management has become a common requirement because most companies have multiple ways of creating and scheduling jobs internally. You can always start with the old Cron scheduler, and many vendor development packages come with scheduling functionality. The next step is to create scripts to call other scripts, which works in a short time. As a result, simple frameworks have emerged to address job state storage and dependencies.

Often, these solutions grow passively in response to the growth of specific job scheduling requirements, often because existing variants of the system cannot even be easily extended. Also note that the people who write data pipes are usually not software engineers, and their tasks and competencies revolve around processing and analyzing data, not building workflow management systems.

Given that the growth of internal workflow management systems is always at least a generation behind the company’s needs, the friction between editing, scheduling, and error-checking of jobs creates a lot of inefficiencies and frustrations that turn data workers against their path to high productivity.

Airflow

After reviewing open source solutions and listening to Airbnb employees’ insights into the systems they used in the past, we came to the conclusion that there was nothing on the market that could meet our current and future needs. We decided to build a completely new system to solve this problem correctly. As the project progressed, we realized we had a great opportunity to give back to the open source community on which we also depend so heavily. Therefore, we decided to open source the project under the Apache license.

Here are some of Airbnb’s louder-driven processes:

Data warehousing: cleaning, organizing, testing data quality and publishing data to our ever-growing data warehouse
Growth analysis: Calculate metrics on guest and homeowner engagement and growth audits
Testing: Calculate the logic of our A/B test testing framework and add it up
Targeted email: Use rules for the target and mass email to attract users
Sessions (Sessionization) : A data set that calculates click flow and residence time
Search: Calculates metrics related to search rankings
Data infrastructure maintenance: database fetching, folder cleaning, and applying data retention strategies…

architecture

Just as English is the language of business, Python has firmly established itself as the language of data work. Airflow has been written in Python since its creation. The code base is extensible, well-documented, stylized, syntactically checked, and has high unit test coverage.

Pipes are also written in Python, which means that dynamic pipe generation via configuration files or other metadata is inherent. “Configuration is code” is the principle we stick to in order to achieve this. While a YAML or JSON based job configuration would have allowed us to generate the Airflow data pipeline in any language, we felt that some liquidity was lost in the transition. Being able to introspect code (IPython! And integrated development tools) subclasses and metaprograms and using the imported library to help write the data pipeline adds significant value to Airflow. Note that you can edit the job in any programming language or markup language, as long as you can write Python code to explain the configuration.

You only need a few lines of command to get this thing going, but its entire architecture contains as many components as possible:

Job definitions, included in source control.
A rich command-line tool (command-line interface) to test, run, backfill, describe, and clean up the components of your directed acyclic graph.
A Web application for browsing the definitions, dependencies, progress, metadata, and logs of directed acyclic graphs. The Web server is packaged in Airflow and built on the Python Web framework Flask.
A metadata warehouse, typically a MySQL or Postgres database, that is used for recording task status and other persisted information.
A set of work nodes that run task instances of jobs in a distributed fashion.
A scheduler that triggers a task instance ready to run.

scalability

The market has shipped various methods for interacting with common systems such as Hive, Presto, MySQL, HDFS, Postgres and S3 and allows you to trigger arbitrary scripts. The base module is designed to be easily extended.

Hooks are defined as abstractions of external systems and share the same interface. Hooks use a centralized Vault database to abstract host/port/login/password information and provide callable methods to interact with these systems.

The operator uses hooks to generate specific tasks that, when instantiated, become nodes in the data flow. All operators derive from BaseOperator and inherit a rich set of properties and methods. The three main operators are:

An operator that performs an action or notifies another system to perform an action
The transfer operator moves data from one system to another
Sensors are a special class of operators that run until certain conditions are met

The Executors implement an interface that allows the Airflow component (command line interface, scheduler, and Web server) to perform operations remotely. Currently, Airflow comes with a SequentialExecutor (for testing), a multithreaded LocalExecutor, a CeleryExecutor that uses Celery and an awesome asynchronous task queue based on distributed messaging. We also plan to open source YarnExecutor in the near future.

A gorgeous user interface

While Airflow provides a rich command-line interface, the best way to monitor and interact with workflow is through the Web user interface. You can easily graphically display pipeline dependencies, view progress, easily get logs, view associated code, trigger tasks, correct false positives/negatives to analyze how long a task is consuming, and get an overall view of when a task typically ends each day. The user interface also provides administrative functions: managing connections, pooling, and pausing the process of directed acyclic graphs.

The icing on the cake is that the user interface has a Data Profiling section that lets users perform SQL queries on registered connections, browse result sets, and provide a way to create and share simple charts. This charting application is a mashup of Highcharts, Flask Admin’s add, delete, change and query interface, plus Airflow hooks and the macro library. The URL argument is passed to the SQL you’re using on the diagram. The Airflow macro is working via Jinja Templating. With these features and query capabilities, it is easy for an Airflow user to create and share result sets and charts.

A catalyst

The productivity and enthusiasm with which Airbnb’s staff did their data work increased several times as we used Airflow. Pipelining is also written faster, and the time spent monitoring and troubleshooting errors is significantly reduced. More importantly, the platform allows people to create reusable modules, computing frameworks, and services from a higher level of abstraction.

Enough said!

We’ve made trial Airflow extremely easy with a heuristic tutorial. To see the sample results, you only need to execute a few shell commands. Take a look at the Quick Access and tutorials section of the Airflow documentation and you’ll have your Airflow Web application and its own interactive instance running in minutes!

Github.com/airbnb/airf…

inairbnb.ioCheck out all of our open source projects and follow us on Twitter:@AirbnbEng + @AirbnbData

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.

The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.