Netflix: How to build an open and collaborative real-time ETL platform?

Abstract: This article is shared by Mr. Zhenzhong Xu, senior software engineer of Netflix. It contains interesting cases, various challenges in distributed system foundation and solutions, as well as discusses its gains in development and operation, and some new visions for open self-service real-time data platform. And some new thoughts on the Realtime ETL base platform. The content of this article is divided into the following three parts:

Product background
Product features
Challenges & Solutions

Netflix is committed to the joy of its members. We are relentlessly focused on improving the product experience and quality content. In recent years, we have been investing heavily in technology-driven Studio and content production. In the process, we found that there are many unique and interesting challenges in the field of real-time data platforms. For example, in a microservice architecture, domain objects are distributed across different apps and their stateful stores, making real-time reporting and entity search discovery with low latency and high consistency particularly challenging.

Product background

Netflix’s long-term vision is to bring joy and smiles to the world by capturing high-quality, diverse content from around the world and sharing it with over 100 million users. Netflix’s efforts to create a pleasurable experience are divided into two directions:

On the one hand, knowledge is fed back through data integration and used to improve users’ product experience.
On the other hand, by establishing a technology-driven Studio, we can help produce products with higher content quality.

As a data platform team, we need to pay attention to how to help different developers and data analysts in the company realize their value in the company, and finally make our own contribution to solve the above two aspects of the problem.

Briefly introduce the Netflix data platform team and its corresponding product, Keystone. Its main function is to help companies bury all microservices, create agents, publish events, collect event information, and store it in different data warehouses, such as Hive or ElasticSearch, and finally help users perform calculations and analytics while data is stored in real time.

From the user’s perspective, Keystone is a fully self-contained platform that supports multiple users. Users can easily declare and create their own pipelines through the PROVIDED UI.
From a platform perspective, Keystone provides solutions that are difficult to implement in all underlying distributed systems, such as Container Orchestration and Workflow Management, which are not visible to users.
From the point of view of the product, there are two main functions, one is to help users move data from the edge device to the data warehouse, the other is to help users real-time computing function.
From a digital point of view, the use of Keystone in Netflix is very necessary, as long as the developers who deal with data, so Keystone has thousands of users across the company, and a hundred Kafka clusters support 10 petabytes of data per day.

Keystone architecture is divided into two layers, the bottom layer Kafka and Flink as the bottom engine, the bottom layer of all distributed system more difficult technical solutions abstract, invisible to users, the whole application is built in the upper layer; The service layer provides abstract services, and the UI is relatively simple to the user without concern for the underlying implementation.

The following is a brief introduction to the development of Keystone products in the past four or five years. The initial motivation was to collect data from all devices and store it in a data warehouse, using Kafka technology because data movement was easy to solve and was essentially just a multi-concurrent problem.

Since then, users have developed new requirements for simple processing of data as it moves, such as filters, and a very general function called projection, for which Keystone introduced new features.

After some time, users indicated that they wanted to do more complex ETL, such as Streaming Join, so the product decided to provide users with the underlying API and abstract the underlying solution for all distributed systems to better focus on the content at the upper level.

Product features

The product features will be introduced around Netflix’s two “superheroes” Elliot and Charlie. Elliot is a data scientist from the Data Science Engineering Organization. His need is to find responsive patterns in very large data to help improve user experience. Charlie is an application developer from the Studio organization whose goal is to help other developers around him produce higher quality products by developing a series of applications.

The work of these two people is very important for the product. Elliot’s data analysis results can help make better recommendations and personalized customization, and ultimately improve user experience. Charlie’s work helps developers around him become more efficient.

Recommendation & Personalization

As a data scientist, Elliot needed a simple and easy to use real-time ETL operation platform. He didn’t want to write very complicated code and needed to ensure low latency of the entire pipeline. His work and related needs are as follows:

Recommendations and personalization. In this work, the same video can be pushed to the corresponding users in different forms according to the different characteristics of individuals. The video can be divided into multiple rows, and each row can be classified into different categories. Different rows can be changed according to personal preferences. In addition, the title of each video will have an artwork, and different users in different countries and regions may have different preferences for the artwork. They will also calculate and customize the artwork suitable for users through algorithms.

A/B Testing. Netflix provides non-member users with the opportunity to watch videos for free for 28 days. At the same time, we also believe that users are more likely to buy Netflix service if they see videos suitable for them. However, it takes 28 days to complete A/B Testing. For Elliot, there may be mistakes in A/B Testing. What he cares about is how to detect problems in advance without waiting until the end of 28 days.

When watched Netflix on equipment, will be in the form of a request and gateway to interact, and then the gateway these requests will be distributed to the back-end service, such as a user on the device, click on the play, pause, fast forward, fast rewind operation, such as these have different micro processing service, so you need to the corresponding data are collected, further processing.

For the Keystone platform team, data generated in different microservices needs to be collected and stored. Elliot needs to integrate different data to address his concerns.

There are four main reasons for using stream processing: real-time reporting, real-time alerting, fast training of machine learning models, and resource efficiency. Compared with the previous two points, the rapid training of machine learning models and resource efficiency are more important to Elliot’s work. In particular, resource efficiency needs to be emphasized. According to the A/B Testing of the previous 28 days, the current practice is to do Batch Processing with the data of the previous 27 days every day. This process involves A lot of repeated Processing, and the use of stream Processing can help improve the overall efficiency.

Keystone provides a command line tool for users. Users need to enter corresponding commands on the command line to perform operations. The tool starts by asking users simple questions, such as what repository they need to use. Users can start developing with tools; The product also provides a series of simple SDKS, currently supporting Hive, Iceberg, Kafka and ElasticSearch, among others.

It should be emphasized that Iceberg, a Table Format dominated by Netflix, plans to replace Hive in the future. It provides many features to help users optimize; Keystone provides a simple API for users to directly generate Source and Sink.

After Elliot completes a series of work, he can choose to submit his own code to the Repository, and the background will automatically start a CI/CD pipeline to package all the source code and products in the Docker image to ensure the consistency of all versions. Elliot only needs to select which version you want to deploy at the UI and click the Deploy button to deploy the JAR into production.

The product will help it solve the difficult problems of the underlying distributed system in the background, such as how to do container arrangement, etc. At present, the arrangement is based on resources, and it is planned to develop towards K8S in the future. During Job package deployment, a cluster of JobManager and a cluster of TaskManager are deployed, so each Job is completely independent for users.

The product provides default configuration options and allows users to modify and overwrite the configuration information on the platform UI. The deployment takes effect without rewriting the code. Elliot had a requirement in the process of Stream Processing, such as reading data from different topics. In case of a problem, the operation may need to be performed in Kafka or data warehouse. Faced with this problem, The requirement is to switch between different sources without changing the code, and the UI provided by the current platform makes it easy to do this. The platform also helps users choose at deployment time how many resources they need to run their jobs.

During the transition from Batch Processing to Stream Processing, many users already have many required artifacts, such as schemas, so the platform helps them easily integrate these artifacts.

The platform has many users who need to write ETL projects on top of it, and as the number of users increases, the scalability of the platform becomes more and more important. Therefore, the platform adopts a series of patterns to solve this problem. Specifically, there are mainly three patterns being used, namely, Extractor pattern, Join pattern and Enrichment pattern.

Content Production

First, a brief introduction to Content Production. This includes forecasting the cost of video production, making programs, making deals, producing videos, post-processing videos, publishing videos, and financial reports.

Charlie is in the Studio department and is responsible for developing a series of applications to help support Content Production. Each application is developed and deployed based on a microservice architecture, and each microservice application has its own responsibilities. To take the simplest example, there will be a microservice app for managing movie titles, a microservice app for deals and contracts, and so on.

Faced with so many micro-service applications, Charlie faces the challenge that he needs to join the data from different places in the process of real-time search, such as searching the actors of a certain movie. In addition, as data is increasing every day, it is difficult to ensure the consistency of data updated in real time. This is essentially caused by the characteristics of distributed micro-service system, different micro-services may choose to use different databases, which adds a certain complexity to the guarantee of data consistency. There are three common solutions to this problem:

Dual writes: When a developer knows that the data needs to be stored in the primary database, it is easy to write the data to another database in two times. However, this operation is not fault-tolerant, and inconsistencies may result in errors.
Change Data Table: the concept of transaction needs to be supported by the database. No matter what operation is done to the database, the corresponding Change will be added to the transaction Change statement and stored in a separate Table. Then you can query the Change Table and obtain the corresponding Change information and synchronize it to other Data tables.
Distributed Transaction: Refers to a Distributed Transaction, which is complex to implement in a multi-data environment.

One of Charlie’s requirements was to copy all movies from the Movie Datastore to a Movie Search index supported by Elasticsearch. The Polling System was used to pull and copy the data. Data consistency is guaranteed by the above Change data table method.

The disadvantage of this scheme is that it only supports regular data pull. In addition, Polling System is closely combined with data source directly. Once the Schema of Movie Search Datastore changes, Polling System needs to be modified. For this reason, the architecture was later improved by introducing an event-driven mechanism that reads all implemented transactions in the database and passes them to the next job for processing through stream processing. In order to generalize the solution, CDC (Change Data Capture) support has been implemented on the Source side for different databases, including MySQL, PostgreSQL and Cassandra, which are commonly used in Netflix. Processed by Keystone pipeline.

Challenges and solutions

Here are some of the challenges and solutions:

Ordering Semantics

For example, if an Event contains create, UPDATE, and DELETE, an operation Event must be returned to the customer in strict accordance with the sequence. One solution is to control through Kafka; Another solution is to ensure that the captured events are in the same order as the actual data read from the database in a distributed system. In this scheme, when all the change events are captured, there will be duplication and disorder, and Flink will be used to remove and reorder the events.

Processing Contracts

In many cases, when writing stream processing, you do not know the specific information of the Schema. Therefore, you need to define a contract on the message, including Wire Format, and define the information related to the Schema at different levels. Such as Infrastructure, Platform, etc. The purpose of the Processor Contract is to help users combine different Processor metadata to minimize the possibility of duplicate code.

Take a specific case, for example, Charlie wants to be notified in time when he has a New Deal. The platform helps him to realize an open and composable streaming data platform by combining different related components, such as DB Connector and Filter, etc., through user-defined contracts.

ETL engineering as seen in the past is mostly for data engineers or data scientists. But empirically, the whole PROCESS of ETL — Extract, Transform and Load — has the potential to be more widely used. The original Keystone was simple and easy to use, but had little flexibility. Later, while flexibility increased, complexity increased accordingly. Therefore, the future team plans to further optimize the current basis and launch an open, cooperative, combinable and configurable ETL engineering platform to help users solve problems in a very short time.

About the author:

Zhenzhong Xu is a software engineer at Netflix. He works on the infrastructure of the highly scalable and flexible streaming media data platform at Netflix. He is keen on researching and sharing anything interesting related to the basic principles of real-time data system and distributed system!

Netflix: How to build an open and collaborative real-time ETL platform?

Product background

Product features

Recommendation & Personalization

Content Production

Challenges and solutions

About the author:

Related Posts

Aurora Developer Weekly [No.0702]

I interviewed 10 companies in 7 days. How do YOU go from 0% to 70%?

Product managers are not here to change the world