Facebook: How do we scale servers when ZooKeeper hits a bottleneck

Facebook’s infrastructure consists of a number of data centers spread across different geographic areas that host millions of servers. These servers run many systems, from a front-end Web server to a news summary aggregator to our messaging and live video applications. In addition to our regular code push, we push thousands of configuration changes to the server every day. As a result, it is fairly common for our servers to perform trillions of configuration checks. One of the configuration checks is for the Web server to personalize the user experience, such as its text being displayed in English for one user and Portuguese for another.

In this article, we introduce location-Aware Distribution (LAD), a new peer-to-peer system for handling the Distribution of configuration changes to millions of servers. We found that LAD was significantly better at delivering large updates than previous systems, with support files now 100MB versus 5MB, while supporting around 40,000 subscribers per publisher versus 2500 before.

As we described in our 2015 SOSP paper, Facebook’s configuration management system (called Configerator) uses the open source distributed synchronization service ZooKeeper to distribute configuration updates.

All data in ZooKeeper is stored in a unified data store. Each ZooKeeper node (ZNode) can contain data or other child ZNodes. Each Znode can be accessed using a hierarchical path (such as /root/znode/my-node). Clients reading data from ZooKeeper can set up observers on ZNode; When ZNode is updated, ZooKeeper notifies these clients so they can download the update.

ZooKeeper’s strong data consistency and strict distribution guarantees are key to our ability to reliably scale and run the system. However, as our infrastructure grew to millions of machines, we found That ZooKeeper became a bottleneck.

Strict sorting guarantee: A key feature of ZooKeeper is that it provides strict sorting guarantee, meaning that writes are always processed sequentially by a thread. To ensure uninterrupted read operations, ZooKeeper interleaves read and write operations. We found that updates to a particular ZNode could trigger a lot of monitoring, which in turn could trigger a lot of reads from clients. Some of these writes can cause updates to block.
Huge clusters: When data ZNodes are updated, ZooKeeper notifies all interested clients. This can cause a huge cluster problem when there are more incoming client requests than ZooKeeper can handle.
Large updates can saturate NICs: Some of our profiles are up to 5 MB in size. Given a 10GB /s NIC, we found that a box could only serve about 2000 customers. If such files were updated frequently, we found that it might take 10 seconds for the updated file to propagate to all the clients that needed it.

When we started designing the new distribution system, we made several requirements, including:

Support for large updates: increase file size from 5 MB to 100 MB.
Standalone data store: ZooKeeper combines a data store with its distribution framework. We wanted to separate the data storage and distribution components so that we could size and scale each component separately.
Distribution capability: Can support millions of customers.
Delay: Limit the release delay to 5 seconds.
Configuration files: Compared to our previous ZooKeeper-based system, this system supports the number of 10x configuration files,
Can meet the large cluster and update rate surge situation.

LAD consists of two main components that communicate with each other via Thrift RPC:

Agent: A daemon running on each machine that provides configuration files for any application that needs them.
Dispenser: This component plays two roles. First, it polls the data store for new updates. Second, it builds a distribution tree for a set of agents interested in updating.

The figure above shows how LAD organizes agents into a distribution tree, which is essentially a well-organized peer-to-peer network. As shown in Step 1, the agent sends a “subscribe” request to the distributor on behalf of the application running on the box. The distributor adds the broker to the distribution tree by sending the Add Peer request (Step 2). Once the agent is added to the tree, it starts receiving metadata updates (step 3). The content is filtered and the broker responds through the content request (Step 4). If the parent class has content, it immediately sends it to the child class; Otherwise, it downloads content from the parent server (Step 5).

By using the tree, LAD ensures that updates are pushed only to interested agents and not to all machines in the queue. In addition, the parent machine can send updates directly to the child machine, ensuring that no machine near the root server is overloaded.

One of the key lessons we learned with ZooKeeper was to separate metadata updates and publishing from content distribution. The LAD architecture relies on agents receiving constant metadata updates. If every agent receives all metadata updates, the number of requests would be too high. We solved this problem by using Shards: instead of having the entire data tree as part of a distribution tree, we split it into smaller distribution trees, each serving a portion of the data tree.

The way we implement shard design is to have each shard at a specific point in the directory structure and recursively include all the objects below it. Sharding provides a reliable medium for subscriptions: it limits the total number of trees while balancing the amount of metadata received by each broker.

The control plane (above left) is a lightweight Thrift RPC between the distributor and each subscriber. The distributor uses it to send tree commands that notify the parent node of its associated new children and check the activity of subscribers. The subscriber also sends subscription requests to the distributor using the control plane. The distributor maps these requests to the shard to which the subscription belongs and adds the subscriber to the distribution tree.

The data flow plane (on the right), the Thrift RPCS between peers in the distribution tree, is a daunting task. It is used by the parent node to send metadata updates from the distributor to its children. The data flow plane is also used to receive content requests from child nodes.

With separate control and data flow planes, each distributor can handle approximately 40,000 subscribers; ZooKeeper can only handle about 2,500 subscribers.

We learned some useful lessons while building and deploying LAD. Here are some highlights:

Instrumentality and monitorability are critical to production deployments. From experience, P2P-based systems can be challenging to operate and debug because it is not clear which nodes are in the send or receive path of a given request.
Failure alerts and disaster testing are critical in scale. In addition to well-documented procedures, we also found that running test cases and development strategies helped to respond effectively when problems arose. Specifically, we ran a series of tests introducing various types of application, host, network, and cluster-level failures to verify LAD’s resilience without affecting any clients. It was a satisfying process because we found bugs and flaws in our programs and tools.
Continuous and periodic testing is critical to the long-term reliability of the system. It’s not enough to just run the tests listed above once, because things change so quickly on Facebook that assumptions about a system or tool may not always hold. We are automating the testing process so that we can find and deal with system problems in a timely manner.

LAD has now been deployed into production as a data distribution framework for the configuration management system. We are also evaluating other good mass content distribution applications.

Reference links:

https://research.fb.com/publications/holistic-configuration-management-at-facebook/

https://code.fb.com/data-infrastructure/location-aware-distribution-configuring-servers-at-scale/

Activity recommended

What are the hot spots of operation and maintenance technology worth paying attention to in 2018? The 4th CNUTCon global Operation and maintenance Technology Conference officially launched, including intelligent operation and maintenance, Serverless, DevOps and other hot topics, dozens of top joint production, reveal the most cutting-edge operation and maintenance technology, recommended learning! Click “Read the article” to learn more about the conference highlights.

Facebook: How do we scale servers when ZooKeeper hits a bottleneck

Related Posts

What black technology does the real-time big data platform behind King Rong Yao use?

EM algorithms for Machine learning

Git uses summaries