• Why Robinhood uses Airflow
  • Vineet Goel
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: cf020031308
  • Proofreader: Yqian1991

Robinhood batched a lot of tasks with timed jobs. These activities range from data analysis and indicator aggregation to brokerage activities such as dividend payments. We initially used Cron to schedule these tasks, but as they grew in number and complexity, it became more challenging:

  • Dependency management is difficult. With CRON, we use the worst expected duration of upstream operations to schedule downstream operations. This becomes more and more difficult as the complexity of these jobs and the size of their dependencies increases.
  • Failure handling and alerts must be managed by jobs. With dependencies, if the job cannot handle retries and upstream failures, the engineer is on call.
  • Back hard. We have to sift through logs or alerts to see how the job has performed in the past day.

In order to meet scheduling requirements, we decided to scrap cron and replace it with something that solved the above problems. We explored a number of open source alternatives such as Pinball, Azkaban and Luigi before settling on Airflow.

Pinball

Developed by Pinterest, Pinball has many of the features of a distributed, horizontally scalable workflow management and scheduling system. It addresses many of the issues mentioned above, but documentation is minimal and the community is relatively small.

Azkaban

Azkaban, developed by LinkedIn, is probably the oldest of the alternatives we’ve considered. It uses properties files to define workflows, whereas most new alternatives use code. This makes it more difficult to define complex workflows.

Luigi

Luigi, developed by Spotify, has an active community and is probably closest to Airflow in our survey. It uses Python to define workflows and has a simple UI. But Luigi doesn’t have a scheduler, and users still have to rely on Cron to schedule jobs.

Hello, Airflow!

Created by Airbnb and with a growing community, Airflow seemed the best fit for our purposes. It is a horizontally scalable distributed workflow management system that allows us to specify complex workflows using Python code.

Dependency management

Airflow uses the operator as the basic abstraction unit for defining the task and uses DAG (directed acyclic graph) to define the workflow through a set of operators. Operators are extensible, which makes it easy to customize workflows. There are three types of operators:

  • Action An operator that performs some operations, such as executing Python functions or submitting Spark Jobs.
  • Transfer Operators that move data between systems, such as from Hive to Mysql or from S3 to Hive.
  • The sensor triggers downstream tasks in the dependency network when certain conditions are met, such as checking whether a file is available on S3 before downstream use. Sensors are an enormous feature of Airflow, allowing us to create complex workflows and easily manage their preconditions.

Here is an example of how different types of sensors can be used in a typical ETL (Data Extraction, transformation and load) workflow. The example uses the sensor operator to wait for data to become available and the transfer operator to move the data to the desired location. The action operator is then used for the transition phase, and the result is then loaded using the transition operator. Finally, we use the sensor operator to verify that the results have been stored correctly.

| sensors - > transfer - > | | | -- - > > transfer sensors extraction | | | | to loadCopy the code

ETL workflow using different types of the Airflow operator

Troubleshooting and monitoring

Airflow allows us to configure retry policy configuration for individual tasks and to set alarms in the event of failures, retries, and running tasks that are longer than expected. Airflow has an intuitive UI with some powerful tools for monitoring and managing operations. It provides a historical view of the job and tools to control the status of the job — for example, to terminate a running job or manually rerunn the job. A unique feature of Airflow is the ability to create charts from job data. This allows us to build custom visualizations to closely monitor jobs and serve as a good debugging tool when troubleshooting job and scheduling problems.

extensible

The Airflow operator is defined using a Python class. This makes it easy to define custom, reusable workflows by extending existing operators. We built a large set of custom operators internally, some of the notable examples being OpsGenieOperator, DjangoCommandOperator, and KafkaLagSensor.

Smarter Cron

Airflow DAG is defined using Python code. This allows us to define schedules that are more complex than CRon. For example, some of our DAGs only run on trading days. With cron crude, we had to set it up to run on all weekdays and then handle market holidays in the application.

We also use the Airflow sensor to start working immediately after the market closes, even if the market is open for only half a day. The following example dynamically updates on a given date based on market time by customizing operators for workflows that require complex scheduling.

A workflow that schedules dynamically according to market time on a given date

backfill

We use Airflow for metric aggregation and bulk processing data. As requirements continue to change, we sometimes need to go back and change the way we aggregate certain metrics or add new ones. This requires the ability to backfill data at any time in the past. Airflow provides a command-line tool that allows us to backfill across any time period using a single command or trigger backfill from the UI. We use Celery (made by our Ask Solem) to distribute these tasks into the worker Box. The Celery distribution capability enables us to use more worker boxes when running backfill, thus making backfill quick and convenient.

Common pitfalls and weaknesses

We are currently using Airflow 1.7.1.3, which has worked well in production but has its own weaknesses and pitfalls.

  • Time zone issues — Airflow relies on system time zone (not UTC) for scheduling. This requires the entire Airflow setting to run in the same time zone.
  • The scheduler runs scheduled jobs and backfill jobs separately. This can lead to strange results, such as backfilling that does not conform to the MAX_ACTIVE_RUNS configuration of the DAG.
  • Airflow is intended primarily for data batch processing and its designers have decided to always wait for an interval before starting work. Thus, for a job scheduled to run at midnight every day, the execution time passed in the context is “2016-12-31 00:00:00”, but actually runs at “2017-01-01 00:00:00”. This can be confusing, especially in jobs that run irregularly.
  • Unexpected backfill — By default, backward fill will be attempted when the DAG recovers from a pause or when adding a new DAG with start_date as the past time. While this behavior is expected, there is ultimately no way around it, and if a job should not be backfilled, it can cause problems. Airflow 1.8 introduces the recent operator to resolve this issue.

conclusion

Airflow quickly grew into an important part of our Robinhood infrastructure. The ability to define DAG using Python code and an extensible API makes Airflow a configurable and powerful tool. Hopefully, this article will be useful to anyone exploring scheduling and workflow management tools to meet their own needs. We are happy to answer any questions. If this kind of thing is interesting to you, consider our recruitment!

Thanks to Arpan Shah, Aravind Gottipati, and Jack Randall.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.