1 introduction

In recent years, SNS network has become popular, which poses a huge challenge to website infrastructure construction. With hundreds of millions of users using these web services every day, the demands on computing, networking, and I/O resources overwhelm traditional Web architectures. The infrastructure of SNS sites needs to meet the following requirements: 1. 2. Real-time aggregation of content from different sources; 3. Access and update popular shared content; Processing millions of user requests per second.

We will describe how we improved the open source version of Memcached [14] and used it as a component to build a distributed key-value store for the world’s largest social network. We will discuss the evolution from a single cluster server to a geographically distributed multi-cluster. As far as we know, it is the largest memcached system installed in the world, handling billions of requests per second and storing trillions of data items.

This article is the last in a series [1, 2, 5, 6, 12, 14, 34, 36] on understanding the flexibility and utility of distributed key-value storage. This article focuses on memcached, an open source implementation of a full-memory hash table that provides low-latency access to shared storage at a low cost. With these features we can build data-intensive features that would otherwise be impossible. For example, if a single page request results in hundreds of database requests, such functionality stops at the prototype stage because it is too slow and expensive to implement. However, in our application, web pages typically fetch thousands of key-value pairs from the memcached server.

One of our goals is to present important themes deployed at different scales. While qualities at all scales are important, such as performance, efficiency, fault tolerance and consistency, our experience shows that some qualities at a given size require more effort than others to achieve. For example, maintaining data consistency is easier to achieve on a small scale network if the content is replicated in a small amount, compared to a larger network that often replicates only what is necessary. In addition, the importance of finding an optimal communication schedule increases as the number of servers increases and network work becomes a bottleneck.

This article includes four major contributions :(1) we describe the evolution of Facebook’s memcach-based architecture. (2) We identify memcached improvements that improve performance and increase memory efficiency. (3) We briefly describe mechanisms to increase our operational capacity and our system size. (4) We have featured production workloads. .

2 review

The following features greatly influenced our design. First, users read an order of magnitude more than they create, and this behavior (characteristic of reading and writing) creates a workload that clearly gives caching a big advantage. Second, we read data from multiple sources, such as MySQL databases, HDFS devices, and backend services. This diversity requires a flexible caching strategy that stores data from separate sources.

MemCached provides a simple set of operations (SET, GET, and DELETE) that make it a compelling basic component in a large-scale distributed system. In this article, we start with the open source version, which provides stand-alone memory hashes, and discuss how we can take this basic component, make it more efficient, and use it to build a distributed key-value storage system that can handle billions of requests per second. Next, we use “memcached” to refer to its source code or the binary instances it runs on, and “memcache” to refer to the distributed system made up of each instance.


Figure 1: Memcache as a bypass caching system to fill requirements. The left half shows the read path that the WEB server failed to hit while reading the cache, and the right half shows the write path.

Query cache: We rely on memcache to reduce the burden of reading the database. Specifically, we use memcache as a bypass cache system to fill the requirements, as shown in Figure 1. When a Web server needs data, it first asks for it in memcache with a string key. If it doesn’t find it, it retrieves it from the database or from a background service, and uses the key to store the results back in memcache. For write requests, the Web server sends an SQL statement to the database, followed by a delete request to memcache, invalidating the old cached data. Because delete is idempotent, we use delete cache instead of update cache.

Among the many ways to deal with the heavy query traffic of MySQL database, we chose memcache, which was the best choice given the limited resources and time constraints. In addition, the cache layer is separated from the persistence layer, allowing us to quickly adjust as the workload changes.

Universal cache: We also made memcache a more universal key-value storage system. For example, engineers use memcache to store the intermediate results of complex machine learning algorithms that can be used by many other applications. It requires very little effort on our part to allow additional services to leverage existing, in-use infrastructure without tuning, optimizing, deploying, and maintaining large server farms.

Just as memcached does not provide server-to-server collaboration, it is simply an in-memory hash table running on a single machine. Next, we describe how we built a distributed key-value storage system based on memcached to operate under a Facebook workload.

Figure 2: Overall architecture

The structure of the paper mainly describes the problems in three different scales. When we had our first server cluster, frequent read loads and extensive output were our biggest concerns. When it was necessary to scale to multiple front-end clusters, we solved the problem of data backup between clusters. Finally, we described a mechanism that allows us to extend the cluster around the world while providing a smooth user experience. Fault tolerance and operational complexity are important at any scale. We present important data references that guide our final design decisions. For more detailed analysis, see the work of Atikoglu et al.[8]. For an overview, see Figure 2. This is the final architecture where we organize the juxtaposed clusters to form a region and designate a master that provides the data flow to keep the non-master in sync. In the evolution of the system, we put these two major design goals first: 1. Only problems that have an impact on users or our operations are worth changing. We rarely consider optimizations with limited scope. 2. Transient reads of stale data, whose probability and responsiveness are similar, will be adjusted as parameters. We will expose mildly stale data for background storage and high load insulation.

3 In the cluster: Latency and load

Now consider the challenges posed by thousands of servers in a cluster. At this scale, we are looking to reduce the load on fetching the cache and the load on the database when the cache is not being cached.

3.1 Reducing latency

Memcache response time is an important factor in the overall response time, whether or not the cache is hit. A single web page request typically contains hundreds of memcache read requests. A popular page, for example, requires an average of 521 different resources from memcache.

To reduce the burden of databases and so on, we prepared cache clusters, each consisting of hundreds of memcache servers. Individual resources are hashed and stored in different memcache servers. Therefore, the Web server must request multiple memcache servers to satisfy the user’s request. This resulted in each Web server communicating with all memcache servers in a very short time. This all-for-all connection pattern can result in incast congestion or an unfortunate bottleneck on a server. Real-time backups can alleviate this, but often result in huge memory waste. (Translator: Why?)

Our approach to latency reduction focuses on the Memcache client, which is run by every Web server. This client provides a number of functions including serialization, compression, request routing, error handling, and request batching. The client maintains a map of all available servers, and updates to this map are required through an auxiliary configuration system. Parallel requests and batch processing: We built the Web application code to minimize the number of network roundtrips necessary to respond to page requests. We construct a directed acyclic graph (DAG) to represent the dependencies between data. Web servers use DAGs to maximize the number of items that can be read concurrently. On average, these bulk requests contain 24 primary keys per request. Client-server communication: The memcached server does not communicate directly. Where appropriate, we embed the complexity of the system in a stateless client rather than a memcached server. This greatly simplifies memcached, allowing us to focus on providing high performance for a more limited set of use cases. Keeping the client stateless allows rapid iterative development and simplifies the deployment process. The client logic can be provided as two components: a library that can be embedded into the application, or as a separate agent called McRouter. This proxy provides the memcached server’s excuse for routing requests/replies between different servers.

The client communicates with the memcached server using UDP and TCP. We rely on UDP to reduce the latency and overhead of requests. Because UDP is connectionless, every thread in the Web server is allowed to communicate directly with the Memcached server. With McRouter, there is no need to create and maintain connections, thus reducing overhead. UDP implements the detection of missing or out-of-order packets received (via serial numbers) and treats them as exceptions on the client side. It does not provide any mechanism for trying to recover. In our infrastructure, we found this decision to be practical. Under peak load, the memcache client observed that 0.25% of requests were dropped. About 80% of these are due to delayed or lost packages, and the rest are due to out-of-order deliveries. The client treats the exception as a cache miss, but the Web server, after querying the data, skips inserting items into memcached to avoid adding additional load to the server on a potentially overloaded network.

Figure 3: UDP, TCP delay after McRouter

For reliability, the client runs set and DELETE operations over TCP through the McRouter instance running on the same Web server. TCP avoids the need to add a retry mechanism to UDP implementations for operations where we need to confirm state changes (updates and deletions).

Web servers rely on high levels of parallelism and overcommit to achieve high throughput. Without some form of connection merged by McRouter, the large amount of memory required to open a TCP connection makes it especially expensive to open a connection between each Web thread and the memcached server. Combining high-throughput TCP connections enhances server efficiency by reducing the network, CPU, and memory resource requirements on these connections. Figure 3 shows the average, intermediate, and 95 percent latency for the Web server in a production environment for keyword retrieval under UDP and McRouter mechanisms via TCP. In all cases, the standard deviation from these mean values is less than 1%. As the data shows, relying on UDP can have a 20% latency reduction to service requests.

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

95% of the pages crawled were 1,740 items. 2 percent of cases have 95 keywords per request.

Incast congestion: The memcache client implements a traffic control mechanism to limit Incast congestion. When a client requests a large number of primary keys, the replies can overwhelm components such as racks and cluster switches if all the replies are reached at the same time. So the client uses a sliding window mechanism [11] to control the number of unprocessed requests. When the client receives a reply, the next request can be sent. Similar to TCP’s congestion control, the size of the sliding window grows slowly with successful requests and shrinks when a request is not answered. This window applies to all memcache requests regardless of the destination address; The TCP window, however, applies only to individual data flows.

Figure 4: Average waiting scheduling time for Web requests Figure 4 shows the impact of window size on the total waiting time for user requests in a running Web server. The data is collected from multiple racks in a front-end cluster. At each Web server, the user requests to render the Poisson arrival process. Referring to Little’s law [26], L=λW, assuming that the input request rate is constant (as it was in our experiment), the number of requests queued at the server (L) is proportional to the average time (W) it takes to process the request. The waiting time for Web requests is a direct indicator of the number of Web requests in the system. When the window is small, the application will have to serially distribute more sets of memcache requests, which will increase the duration of web requests. When the window is too large, the number of memcache requests being processed at the same time will cause incast congestion. The result would be a memcache error and the application would degrade to fetching data from persistent storage, which would result in slower processing of Web requests. There is a balance between these two extremes in which unnecessary delays are avoided and Incast congestion is minimized. 3.2 Reducing load We use memcache to reduce the frequency of reading data in more time-consuming ways, such as database queries. When the desired data is not cached, the Web server degrades to a more time-consuming approach. The following subsections describe three techniques for reducing load. 3.2.1 Leases

We introduced a new mechanism called leases to solve two problems: stale sets and Thundering surprises. A stale setting occurs when the Web server updates a value that is not the latest version in the cache. This can happen when reordering concurrent memcache updates. A panic occurs when a particular primary key is read and written in large numbers and frequently. Because the write operation repeatedly invalidates the most recently set value, the read operation will default to the more time-consuming mode. Our lease mechanism addresses both of these issues.

Leases are not the same as Those used by Cary G. Gray.

Intuitively, when the client has a cache miss, the memcached instance gives the client a lease to set the data into the cache. The lease is a 64-bit token that is bound to the primary key of the client’s initial request. When the value is set to the cache, the client provides the lease token. With this lease token, memcached can verify and determine if the data should be stored, thus mediating concurrent writes. If memcached invalidates the lease token because it received a delete request for the data item, the validation will fail. The way in which the lease prevents obsolete Settings is similar to the load-link/store-conditional operation [20]. Minor changes to the lease could also alleviate the stampede problem. Each memcached server adjusts the rate at which tokens are returned. By default, we configure the server to return a token every 10 seconds for each primary key. When a request is received within 10 seconds, a special notification will tell the client to wait. Typically, the client with the lease will successfully set up the data in a matter of milliseconds. As a result, data is often already in the cache while waiting for the client to retry. To illustrate this point, we collected a week’s worth of cache misses for the primary key set that is prone to stampede. Without the lease mechanism, all the cache misses would result in a peak database query rate of 17K/s. With the lease mechanism, the database query rate peaked at 1.3K/s. Because we prepare the database against peak load, the all lease mechanism provides significant efficiency gains.

Expiration value: When using the lease mechanism, we can minimize the application wait time for certain use cases. We can further reduce wait times by identifying situations where it is acceptable to return slightly outdated data. When a primary key is deleted, the value is moved to a data structure that holds the most recently deleted item and will live for a short time before it becomes clear. A GET request may return a lease, or data that is marked as out of date. Applications can use outdated data to continue forward processing without waiting for the latest data to be read from the database. Experience tells us that because cached data tends to be monotonically increasing database snapshots, most applications can use outdated data without making changes to the data.

Figure 5: Daily and weekly working sets for high-jitter key sets and low-jitter key sets 3.2.2 Memcache pool Using memcache as a common cache layer requires different workload sharing infrastructures, despite their poor access patterns, memory footprint, and quality of service requirements. Different application workloads can cause negative interference, which can lead to lower hit rates. To accommodate these differences, we split the memcached servers of the cluster into separate pools. We specify a pool (called wildcard) as the default pool, providing additional pools for primary keys that don’t fit in wildcards. For example, we might allocate a small pool for primary keys that are accessed frequently but do not take time to cache. We might also allocate a large pool for primary keys that are accessed infrequently but take no time to cache exceptions.

Figure 5 shows the working set of two different project collections, one with low jitter and the other with high jitter. The working set is approximated by sampling all operations per millionth of a data item. For these data items, we collect the minimum, average, and maximum data item sizes. These items are summed up and multiplied by a million to approximate the working set. The difference between daily and weekly work sets indicates the total number of jitter. Data items with different jitter characteristics interact in an unfortunate way: those low-jitter primary keys that are still valuable are kicked out before those high-jitter primary keys that are no longer accessed. Placing these different primary keys in different pools would prevent this negative interference and allow us to accommodate the cost of cache misses by setting the size of the high-jitter pool. Chapter 7 provides a more in-depth analysis.

A working set is defined as the amount of memory required by a process at a given time.

3.2.3 Intra-Pool Replication 



In some pools, we use replication to improve latency and memcached server efficiency. When (1) the application routinely reads many primary keys simultaneously, (2) the entire data set can fit into one or two memcached servers, and (3) the request rate is too high for a single server to handle, we choose a class of primary keys in the replication pool.



Instead of further partitioning the primary key space, we prefer to replicate within the instance. Consider a memcached server with 100 data items, capable of handling 500K requests per second. Look up 100 primary keys per request. The difference between the cost of querying 100 primary keys per request and one primary key per request in memcached is small. To scale the system to handle 1M requests per second, suppose we add a second server and divide the primary key equally between the two servers. Now the client needs to split each request containing 100 primary keys into two parallel requests containing 50 primary keys. As a result, both servers still had to process requests at 1M per second. Then, if we copy all 100 primary keys to two servers, a client request containing 100 primary keys can be sent to any replica (replica). This reduced the load per server to 500K requests per second. Each client selects a replica based on its OWN IP address. This approach requires the distribution of invalidation messages to all replicas to maintain consistency.3.3 Troubleshooting 



Failure to read data from memcache will result in a surge in back-end service load, which can lead to further cascading failures. There are two scales of failure that we must address :(1) a small number of hosts are unavailable due to network or server failures, and (2) widespread outages that affect a significant percentage of servers in the cluster. If the entire cluster has to go offline, we move the user’s Web requests to another cluster, which will effectively migrate all of memcache’s load.



For small outages, we rely on an automated repair system [3]. These actions are not instantaneous and may take several minutes. This is long enough to cause the cascading failures mentioned earlier, so we introduced a mechanism to further isolate back-end services from failures. We have a small number of Gutter machines to take responsibility for the small number of failed servers. In a cluster, the number of Gutter Gutter is about 1% of the number of memcached servers. When the memcached client doesn’t receive a response to its GET request, the client assumes the server has crashed and sends another request to a particular Gutter pool. If the second request misses, the client will insert the appropriate key-value pair into the Gutter machine after querying the database. Items in Gutter will expire quickly to avoid Gutter invalidation. Gutter limits the load on backend services at the expense of providing slightly outdated data.



Note that this design is different from how the client reallocates the primary key on the remaining memcached servers. Such an approach would run the risk of linkage failures due to the uneven distribution of frequently accessed primary keys. For example, a single primary key accounts for 20% of server requests. Servers that handle this frequently accessed primary key can also become overloaded. We reduce this risk by diverting load to idle servers.



Typically, each failed request results in an access to back-end storage, potentially overloading the backend. Using Gutter to store these results, a large portion of the failure is transferred to the access to the Gutter pool, thus reducing the load on the backend storage. In practice, this system reduces the client-visible failure rate by 99% per day, converting 10-25% of failures to cache hits. If a whole memcached server fails, in 4 minutes the gutter hit rate will increase to 35% on average and often close to 50%. So in the event that some memcached server becomes unreachable due to a failure or minor network accident, the Gutter will protect the back-end storage from a traffic spike.

4 Intra-Region: Replication 



As demand grows, it is tempting to buy more Web servers and memcached servers to scale the cluster. But childishly scaling the system won’t solve every problem. As more Web servers join to handle increased user traffic, high-request data items will only become more popular. As memcached servers grow, Incast congestion becomes worse. So we split the Web server and memcached server into front-end clusters. Together with storage clusters containing databases, these clusters are collectively called regions. The Region architecture also allows for smaller fault zones and easy-to-control network configurations. We traded replication of data for more independent failure domains, easier control of network configurations, and reduced incast congestion.



This chapter examines the impact of multiple front-end clusters sharing the same storage cluster. In particular, we illustrated the impact of allowing data replication across clusters and the potential memory efficiency of not allowing replication.

4.1 Region Failure 



In a Region, storage clusters store authoritative versions of data. To meet user requirements, data needs to be replicated to front-end clusters. The storage cluster is responsible for invalidating cached data to maintain consistency between the front-end cluster and the authoritative version. As an optimization, when the Web server modifies data, it also sends invalidation commands to its cluster, providing read and write semantics for single user requests, which reduces the lifetime of the native cache.



 

Figure 6: Failure pipeline shows primary keys that need to be removed by a daemon (McSqueal)



The SQL statement for modifying authoritative data is improved to include the corresponding memcache primary key that needs to be invalidated after the transaction commits [7]. We deployed an invalidation daemon (called McSqueal) on all of our databases. Each daemon checks the SQL statements submitted by the database, extracts any delete commands, and broadcasts the delete commands to all front-end clusters in region. Figure 6 shows this approach. We found that most of the invalidation commands issued did not result in deletion of data, and in fact only 4% of all delete commands issued invalidated actual cached data.

Reduce the delivery rate:If McSqueal could contact the Memcached service directly, the rate of sending packets from the back-end cluster to the front-end cluster would be unacceptably high. There are a lot of databases and a lot of memcached servers communicating across cluster boundaries, which causes packet rate problems. The invalidation daemons process the deletions in batches, using very few packets to send the operations to the specified servers running McRouter on each front cluster. McRouter then breaks down the individual delete operations from each batch and routes the invalid commands to the correct memcached server in the front-end cluster. Batch has an 18-fold performance improvement by counting the median delete commands in each packet.



Sending invalid commands through the Web server:It is easier to broadcast invalid commands to all front-end servers through a Web server. Unfortunately, there are two problems with this approach. First, because the Web server is less efficient at batching invalid commands than McSqueal, it has a higher package cost. Second, this approach is ineffective when systemic invalidation problems arise, such as incorrect routing of delete commands due to configuration errors. In the past, this often required dynamic reboots of the entire memcache infrastructure, a slow, disruptive process that we wanted to avoid. Instead, embedding invalid commands into SQL statements allows McSqueal to simply re-execute invalid commands that may have been lost or faulty routes, because database commits store reliable logs.



 

Table 1: Determinants of cluster or region replication

When the safety car leaves, a green flag appears. This is a rolling start. “when the safety car leaves, a green flag appears.

4.2 Region Pool Each cluster caches data independently based on mixed user requests. If user requests were randomly routed to all available front-end clusters, the data cached by all front-end servers would be roughly the same. This allows us to maintain a cluster offline without causing cache hit ratios to fall. Overcopying data can make memory inefficient, especially for large, rarely accessed data items. By having multiple front-end clusters sharing the same set of memcached servers, we can reduce the number of replicas. We call this a region pool. Communicating across cluster boundaries leads to greater latency. In addition, the bandwidth available between our clusters is 40% less than within the clusters. Replication trades more memcached servers for less inter-cluster bandwidth, lower latency, and better fault tolerance. For some data, it is more cost efficient to forgo the benefit of duplicates and have one copy per region. One of the major challenges of scaling memcache is deciding whether a primary key should be replicated across the front-end cluster or one copy per region. Gutter is also used when a region pool fails. Table 1 summarizes two categories of projects that have great value in our application. We move data of type B to region pool and do not change data of type A. Note that clients access type B data an order of magnitude less frequently than type A data. The low access rate of Type B data makes it a prime candidate for region pooling because such data does not adversely affect inter-cluster bandwidth. B-type data also occupies 25% of the wildcard pool space in each cluster, so regionalization provides significant storage efficiency. However, data items of type A are twice as large as those of type B and are accessed more frequently. Therefore, from the perspective of region, they are not placed in the Region pool. Data migration to region pools is currently based on manual heuristics based on access rate, data size, and number of access users. 4.3 Cold Cluster Warm-up Because existing clusters fail or undergo regular maintenance, we add new clusters to go online. In this case, the cache hit ratio will be low, which weakens the ability to isolate back-end services. This can be mitigated by a system called Cold Cluster Warmup, which enables clients in a “Cold Cluster” (that is, a front-end Cluster with an empty cache) to retrieve data from a “hot Cluster” (a Cluster with a normal cache hit ratio) rather than from persistent storage. This takes advantage of the replication of data across front-end clusters mentioned earlier. Using this system, cold clusters can be brought back to full capacity in hours rather than days. Care must be taken to avoid inconsistencies arising from competitive conditions. For example, if one client in the cold cluster updates the database and another client retrieves stale data before the hot cluster receives an invalid command, the data item will be inconsistent in the cold cluster. The memcached delete command supports a non-zero latency, that is, it refuses to add operations for a specified amount of latency. By default, all delete commands in a cold cluster have a two-second delay. When a cache miss occurs in a cold cluster, the client resends the request to the hot cluster and then adds the results to the cold cluster. Failure to add indicates that there is updated data in the database, so the client will re-read the data from the database. A deletion delay of more than two seconds is theoretically possible, but in most cases it is not. The operational benefits of a cold cluster warm-up far outweigh the costs of a few cache inconsistencies. Once the cold cluster hit rate stabilized, we turned off the cold cluster warm-up system, and the benefits decreased.

5 Cross-region: consistency

There are many advantages to distributing data centers across a wide geographic area. First, having the Web server close to the end user can greatly reduce latency. Second, geographic diversification can mitigate the effects of natural disasters and large-scale power failures. Third, the new location provides cheaper electricity and other economic incentives. We gain these advantages by deploying multiple regions. Each region contains one storage cluster and multiple front-end clusters. We specify that one region holds the primary database and the other regions contain read-only copies. We rely on MySQL’s replication mechanism to keep the replica database in sync with the master database. Based on this design, the Latency for the Web server to access both the local memcached server and the local database copy is low. When scaling to multi-region, maintaining data consistency between memcache and persistent storage becomes a major technical challenge. These challenges stem from the problem that replica databases can lag behind the master database.

Our system represents only one point on the broad spectrum of consistency and performance balance. The consistency model has evolved over the years to meet the needs of site scaling. Consistency models can be built in practice without sacrificing the need for high performance. Large volumes of data managed by systems imply significant costs for any small changes that increase network and storage requirements. Most ideas that provide strict semantics rarely make it beyond the design stage because they are simply too expensive. Unlike systems that are tailored to existing use cases, Memcache and Facebook were developed together. This allows application engineers and systems engineers to work together to design a model that is easy for application engineers to understand, efficient, and simple enough to implement reliable extensions. We provide final consistency as best we can, but emphasize performance and usability. So in practice the system worked very well, and we found an acceptable balance. Write from main region: In our system, we mentioned that data invalidation is implemented through the daemons of the storage cluster. This design has an important impact on the design of multi-region architecture. In particular, this design avoids a race where invalid commands arrive before data is copied from the primary region. Consider the case where a Web server in the main Region has made changes to the database, seeking to invalidate the now obsolete data. It is safe to send invalid commands in primary region. However, having a Web server in a replica region send invalidation commands may be premature, because changes to the primary database may not have been propagated to the replica database. Subsequent data queries against replica regions will compete with database replication, increasing the probability that stale data will be set in memcache. Historically, we implemented McSqueal after scaling to multiple regions. Write from a non-primary region: Now consider updating data from a non-primary region when the replication lag is very large. If his latest change is lost, the next request will cause confusion. Data is then read from the replica database and cached when the replicated stream completes. Without this guarantee, subsequent requests will cause outdated data in the copy to be read and cached. We use a remote marker mechanism to minimize the probability of reading stale data. The presence of a flag indicates that the data in the local replica database may be outdated, so the query should be redirected to the primary region. When the Web server wants to update the data whose primary key is K, the server (1) sets the remote flag r k in the region, (2) performs write operations to the primary database and emplaces the invalid K and R k in the SQL statement, and (3) deletes K in the local cluster. For subsequent requests for primary key K, the Web server cannot find the cache data, and then checks whether r k exists and directs the request to the primary region if it does, and to the local region if it does not. In this case, we trade the added delay in the event of a cache miss for a decrease in the probability of reading stale data. We implement remote tagging by using region pools. Note that when concurrent changes are made to the same primary key, one operation may remove the remote flag that should be reserved for another ongoing operation, and if this happens, our mechanism may return outdated information. In particular, our use of remote tag memcache violates cache results in a subtle way. As a cache, it is safe to delete or remove primary keys; It may cause more database load, but it does not compromise consistency. Instead, the presence of remote tags can help distinguish whether a non-primary database has obsolete data. In practice, we find that remote tag removal and concurrent modification are rare. Operational considerations: Cross-region communication is very time consuming because the data has to travel large geographical distances (such as across the Continental United States). By sharing the communication channel of the database replication delete stream, we gain network efficiency with low bandwidth connections. The deletion management system mentioned in Chapter 4.1 is also deployed in a replica region and broadcasts deletion commands to the memcached server through the replica database. When the downstream component does not respond, the database and McRouter temporarily store the delete command. Failures and delays in any component increase the probability that stale data will be read. Once the dirty component is reavailable, the temporary delete command will be resent. When problems are found, the alternative is to take the cluster offline or invalidate the data in the front-end cluster. These methods can cause more confusion than the workload benefits they get.

6 Single server upgrade

The many-to-many communication pattern implies that individual servers will become the bottleneck of the cluster. This chapter will cover performance tuning and improving memcached memory efficiency for better cluster scaling. Improving the performance of a single server cache is an active area of research [9,10,28,25].

6.1 Performance Tuning

We started with single-threaded memcached with a fixed-size hash table. The first major optimizations were :(1) allowing hash table auto-scaling to avoid look-up time drift to O(n), (2) making the server multithreaded by protecting multiple data structures with a global lock, and (3) giving each thread a separate UDP port to reduce contention for sending copies and later propagating interrupt processing overhead. The first two optimizations were contributed to the open source community. The following sections explore further optimizations that have not yet appeared in the open source version.

Our experimental host has a 2.67ghz (12 core, 12 hyperthreading) Intel Xeon CPU (X5650), an Intel 82574L Gigabit Ethernet controller and 12GB of memory. Production servers have more memory. More details have been made public before [4]. The performance test device consists of 15 clients that generate memcache traffic and a memcached server with 24 threads. The client and server are in the same rack, connected by gigabit Ethernet. These tests measure a two-minute delay in the sustained load memcached response.

Acquired performance: We first looked at the benefits of replacing the old multithreaded single-lock implementation with fine-grained locks. Before sending a memcached request with 10 primary keys, we pre-populate the cache with 32 bytes of data, and then measure the hit performance. Figure 7 shows the maximum request rate for submillisecond average return time for different versions of Memcached. The first set of bars is memcached before fine-grained locking, the second set is our current memcached, and the last set is open source version 1.4.10, which independently implements a coarse-grained version of our locking strategy.

The use of fine-grained locks increased the peak hit rate from 600K to 1.8m per second by a factor of three. Miss performance also improved from 2.7 megabits per second to 4.5 megabits per second. Hit cases are more time-consuming because the return value needs to be built and transferred, and miss only requires a static response (END) indicating that all primary keys were missed for the entire multiple request. We also looked at the performance impact of using UDP instead of TCP. Figure 8 shows peak requests with an average duration delay of less than one millisecond for single primary key retrieves and for 10 primary key retrieves. We found that the UDP implementation outperformed the TCP implementation by 13% for single primary key fetchings and 8% for 10 primary key fetchings.


Figure 7: Multiple capture hit and miss performance comparisons for different memcached versions

Figure 8: Performance comparison for TCP and UDP single and 10 primary key requests to get hits

Because multiple primary key fetches pack much more data in a single request than single primary key fetches, they accomplish the same thing with fewer packages. Figure 8 shows that 10 primary key retrieves provide a nearly four-fold performance improvement over single primary key retrieves.

6.2 Adaptive SLAB allocator

Memcached uses a slab allocator to manage memory. This allocator organizes memory into slab classes, each of which contains preallocated, evenly sized chunks of memory. Memcached stores data items into slab classes that can accommodate the smallest possible size of data item metadata, keys, and values. A slab class starts at 64 bytes and increases exponentially to 1MB with a factor of 1.07, with 4 bytes aligned to 3. Each slab class maintains a free list of available memory blocks, and when its free list is empty, it requests more memory from the 1MB slab. Once the memcached server can no longer allocate free memory, store new data items by removing the least recently used (LRU) data items from the slab class. As the workload changes, the original memory allocated to each slab class may no longer be sufficient, resulting in a low hit ratio.

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 3 page to fetch the data measured in 1740 was the 95th percentile data items. This extended factor ensures that we have both 64BYTE and 128byte, which makes it easier to take advantage of the hardware cache line.

We implemented an adaptive allocator that periodically rebalances slab allocations to fit the current workload. If a slab class is removing items, and if the next item to be removed is at least 20% more recent than the least recently used item in other slab classes, then the slab class needs more memory. If such a slab class is found, the slab storing the least recently used data items is released and transferred to the Underlying class. Note that the open source community has independently implemented a similar allocator that balances the removal rate across slab classes, whereas our algorithm focuses on how long it takes to balance the oldest data item in all classes. The balanced time provides a better approximation of the single global least-recently used (LRU) removal policy for the entire server than the adjusted removal rate, which is heavily influenced by access mode.

6.3 Cache of temporary entries

Because memcached supports expiration times, items can remain in memory after they expire. When items are requested or when they reach the end of the LRU, Memcached delays them by checking their expiration time. Although effective in general, this pattern allows short-term key values that are occasionally active to occupy memory space until they reach the end of the LRU.

So we introduce a hybrid mode that uses delayed culling for most keys and immediate culling for expired short-term keys. We put short-term entries into a circular cache of linked tables based on their expiration time (it takes a few seconds to index them until they expire) — we call it the temporary item cache. Every second, all entries in the cache’s header data grid are culled, and the header moves forward one space. When we set a short expiration time for a frequently used set of keys (for which items have a short lifetime), the percentage of memcache buffers used by the set of keys decreases from 6% to 0.3% without affecting the hit ratio.

6.4 Upgrading Software

Upgrades, bug fixes, AD hoc diagnostics, or performance testing all require frequent software changes. A Memcached server can reach a peak hit rate of 90% in a matter of hours. Next, it could take up to 12 hours to upgrade the memcached server, which requires careful management of database load. We modified memcached to use the System V shared memory area to store cached values and master data structures so that the data would remain available during software upgrades to minimize losses.

Figure 9: Cumulative distribution of access numbers for different memcached servers

7 Memcache workload

Now we describe the memcache load with data from running servers in production.

7.1 Measurement on the Web Server

We collected all memcache operations requested by a small percentage of users, then discussed fanout, response size, and latency characteristics of our workload. Fan out: Figure 9 shows the distribution of the number of MemcahCED servers that need to be contacted when a Web server responds to a page request. As you can see, 56% of page requests contact fewer than 20 memcached servers. User requests tend to request small amounts of cached data in terms of throughput. This distribution, however, has a long tail. This diagram also shows the distribution of requests to popular pages, which better represent the many-to-many communication pattern. Most of these requests will connect to more than 100 separate servers; Access to a few hundred memcached servers is not uncommon. Response size: Figure 10 shows the response size to the memcache request. The difference between the median (135byte) and the mean (954byte) implies a large difference in the size of the cache item. In addition, there are three different peaks at approximately 200 bytes and 600 bytes. Large items tend to store lists of data, while small items tend to store individual chunks of content.

Latency: We measure the round-trip latency of requesting data from memcache. This latency includes the cost of routing the request and receiving the reply, network transport time, and deserialization and decompression costs. At seven days, the median request delay was 333 microseconds, 475 microseconds at the 75th percentile, and 1.135 milliseconds at the 95th percentile. The median end-to-end latency for idle Web servers was 178 microseconds, 219 microseconds at the 75th percentile, and 374 microseconds at the 95th percentile. The large variation in latency at the 95th percentile is caused by responses that deal with large volumes of data and wait for runnable thread scheduling, which was discussed in Chapter 3.1.

Figure 10: Cumulative distribution of read data sizes

7.2 Pool Statistics

Now let’s talk about measuring four memcache pools. These pools are Wildcard (default pool), APP (pool specially set for specific application), replicated pool for frequently accessed data, and Regional pool for rarely accessed data. For each pool, we collected average statistics every four minutes, and Table 2 shows the maximum average over a one month statistical period. These data approximate the peak load of those pools. This table shows that the frequency of GET, SET, and DELETE operations varies widely for different pools. Table 3 shows the distribution of response sizes for each pool. These different characteristics stimulate our desire to separate different workloads. As discussed in Chapter 3.2.3, we replicated the data in the pool, using the advantages of batch processing to handle the high request rate. We observed that the replicated pool had the highest rate of GET operations (almost 2.7 times the second highest) and the highest ratio of bytes to packets, even though the pool had the smallest data item size. These observations are consistent with our design expectations for better performance with replication and batch processing. In the app pool, higher data jitter naturally leads to higher miss rates. The pool tends to keep data for a few hours before it is kicked out with new data. Data in the regional pool tends to be large and accessed infrequently, as shown by the request rates and data size distributions in the table.


7.3 Failure delay

We found that timeliness of failure was a key factor in determining the probability of exposure to expired data. To monitor this health, we sampled one of the millions of delete operations and recorded when the delete command was issued. We then periodically queried the contents of memcache in all front-end clusters for this sample and logged an error if a field was still cached when the delete command set it to invalid.

Figure 11: Delay of the deleted pipeline

In Figure 11, we use this monitoring mechanism to count 30-day failure delays. We split the data into two groups :(1) the delete operation is initiated from the web server in the primary region and sent to a memcached server in the primary region; (2) the delete operation is initiated from the replica region and sent to another replica region. As can be seen from the statistics, when both the launching place and destination of parameter operation are in the main region, the success rate is very high. The reliability of four nines can be achieved in one second, and five nines can be achieved in one hour. When the delete operation is initiated and destination is not in the main region, the reliability drops to three nines in one second and four nines in ten minutes. In our experience, when a failure operation fails after a few seconds, the most likely reason is that the first attempt failed, and subsequent retries will resolve the problem.

8 Related work

Some other large web sites are already aware of the use of key-value storage. DeCandia et al. [12] constructed a highly available key-value storage system (Dynamo), which has been widely used in Amazon website application service. Dynamo, by contrast, focused on optimizing writes under heavy load, whereas our system’s load was heavy reads. Similarly, LinkedIn uses Voldemort[5], derived from Dynamo. Other widely used key-value storage schemes include Github, Digg and Blizzard using Redis[6]. Twitter[33] and Zynga use Memcahed. Lakshmanet et al. [1] developed Cassandra, a distributed key-value database based on schema. However, we tend to use and extend memcached, mainly because of its simple design.

Our job is to extend Memecached to work with distributed data architecture. Gribble et al. [19] constructed an early version of key-value storage system for Internet extension services. Ousterhout et al. [29] also constructed a large-scale memory key-value storage system. Unlike these schemes, memcache does not guarantee persistence. We use other systems to solve the persistence problem of data storage.

Table 2: Memcache application pool traffic graph for each type averaging over 7 days

Table 3: Keyword size distribution of each type of program pool (K)

Ports et al. [31] provide a library for managing the cache of query results in the task database. What we need is a more flexible caching strategy. We use the recent and expired read priority strategies to investigate cache consistency and read operations in high-performance systems. Ghandeharizadeh and Yap et al. also proposed a formula to solve the problem of expired sets based on time markers rather than definite version numbers.

Although soft routing is easy to customize and program, it is less efficient than hard routing. Dobresuet et al. [13] studied these problems on universal servers by using multi-processors, multi-storage controllers, multi-queue network interfaces and batch processing. Using these techniques to implement micropaths keeps further work going. Twitter also independently developed a memcache proxy similar to micro-routing [32].

In Coda[35], Satyanarayanan et al. showed how to restore consistency of data set divergence caused by incoherent operations. Glendenninget al. [17] used Paxos formula of lever action [24] and weighting factor [16] to build Scatter, a non-contributing hash table with linear semantic churning.

TAO[37] is another Facebook system that relies heavily on caching, with key users ensuring low latency for large data volume queries. TAO and Memcache differ in two important ways. (1) TAO is implemented by a graphical model in which each node is identified by a fixed length persistent identifier (64-bit integer). (2) TAO has a coding specification that maps its graphics model to persistent storage and is responsible for the persistence layer. A number of other components, such as our customer library and microrouting, are common to both systems.

9 summary

In this article, we show you how to use memcached-based technology to meet Facebook’s growing needs. Many of the trade-offs discussed in this article are not very basic, but are encountered when optimizing the performance of an online system that continues to grow as new products are deployed. We learned the following lessons while building, maintaining, and expanding our systems. (1) Separate cache and persistent storage systems allow us to measure them separately; (2) monitoring, error reporting, and optional features are just as important as performance; and (3) managing stateful components is much more complex than stateless components. So keeping logic in stateless clients helps feature repetition and minimizes fragmentation of the system. (4) the system should be able to gradually add or subtract new features, even if this leads to temporary heterogeneity of the system’s feature set. (5) Simplicity is essential.