This article was first published in The Mesozoic Technology Community and reproduced with authorization.

Weibo has more than 160 million daily active users and tens of billions of daily visits. In the face of massive access from a large user base, a well-structured and constantly improved cache system plays a very important supporting role.

On April 21, At the on-site technical exchange activity of Mesozoic Technology into Box Technology, Sina Weibo technical expert Chen Bo explained the design and practice process of weibo Cache architecture for everyone.

Checking Twitter? Listen with us to hear how those big numbers show up.

Chen Bo: Hello, everyone. Today’s share mainly includes the following contents: first, the data challenges in the operation of Weibo, then the Feed system architecture, then the analysis of Cache architecture and evolution, and finally the summary and outlook.

The number According to the challenge

Feed platform system architecture

It is divided into five layers. The top layer is the end layer, such as the Web side, the client side, some clients of ios or Android, some open platforms, and some interfaces for third-party access. The following is the platform access layer. Different pools are mainly used to centrally allocate good resources to important core interfaces, so that in case of sudden traffic, they have better flexibility to serve and improve service stability. Next is the platform services layer, mainly Feed algorithms, relationships and so on. Next comes the middle tier, which provides some services through various intermediaries. The bottom layer is the storage layer, and the platform architecture looks something like this.

1. Feed timeline

When you brush micro-blog daily, for example, refresh it on the main site or client endpoint, and get 10 to 15 micro-blogs recently, how is it constructed? Refresh, after the first will get the attention of the user relationship, she has one thousand attention, for example, will take the one thousand ID to, according to the one thousand UID, got a microblog, published each user at the same time will get the user’s Inbox, is she received some special message, such as group a microblog, group of weibo, the following her focus on relationships, She focuses on people’s weibo list, after a series of weibo list get the collection, sorting, needed to get the ID, again to the ID to get every weibo ID corresponding weibo content, if these weibo came forward, it also has a original weibo, will further take the original weibo content, through the original micro win user information, further filtering based on user word, For these weibo filtering, filter out the user do not want to see the weibo, after leaving the weibo, further, users of these weibo have collection, praise, do some flag set, finally also to the microblogging various counting, forward, reviews, praise for assembling, finally the dozens of weibo is returned to the various end users. In this way, a user can obtain dozens of records at a time. The back-end server may have to assemble hundreds or even thousands of records in real time and then return them to the user, depending on the strength of the Cache system. Therefore, Cache architecture design directly affects the performance of weibo system.

2. Feed Cache architecture

Then we take a look at the Cache architecture, which is mainly divided into six layers. The first layer is Inbox, which mainly consists of grouped micro-blogs, and then directly to some micro-blogs of the group owner. The Inbox is relatively small, mainly in the way of pushing. Then for Outbox, each user will send a regular microblog, will be in its Outbox, according to the number of stored ID, actually divided into multiple caches, common about 200, if the long about 2000. The third group is some relationships, its followers, fans, users. The fourth is content, some content of each micro-blog exists here. The following are some existential judgments, such as whether the micro-blog has been liked or not. Before, some stars said that I did not like the micro-blog, how did it show that I liked it, causing some news? This is a record, in fact, she forgot to like it at some time. And then there’s the big one at the bottom — counting. A micro blog comment forwarding and so on counts, for the user, her number of followers number of these data.

Cache architecture and evolution

1. Simple KV data type

Next we focuses on the microblog Cache architecture evolution process, the most began to weibo online are put it as a simple KV witnesses to store data types, we mainly take the hash shard storage pool in MC, launched a few months later found some problems, some of the node machine downtime or any other reasons, A large number of requests penetrate the Cache layer and reach the DB, causing the request to slow down and even the DB to die. So we quickly adds a HA layer to transform it, so even if the Main layer appeared some nodes or outage situation after hanging up, these requests will further penetrate into HA, not through the DB layer, it can ensure that in any case, shot did not reduce the whole system, system stability is bigger. For this kind of work, it is widely used in the industry now, and many people say that I directly use hash, but there are also some pits. For example, I have a node, node 3 is down, Main removes it, and some QA of node 3 is distributed to several other nodes. This business volume is not very large, but through DB, DB can resist. If behind the node 3 again, it added, add after the node 3 access will come back again, if the node 3 because the network reason or the reason of machine itself, it is down, some node 3 requests and will be distributed to other nodes, can be a problem at this time, scattered before writing back to the other nodes data has no update, If it doesn’t get removed it will get mixed data.

There is a big difference between Weibo and wechat. In fact, Weibo is a squared-type business. For example, in an emergency, when a star finds a girlfriend, the instant traffic will be 30%. At this time, the whole MC will become a bottleneck, leading to the slow down of the whole system. For this reason, we introduce L1 layer, which is also a Main relation pool, and each L1 is about 1/N, 1/6, 1/8 and 1/10 of the memory of Main layer. I will increase 4 to 8 L1S according to the request volume. In this way, all incoming requests will first visit L1, and if L1 hits it, it will directly visit it. If L1 does not hit it, it will visit main-HA layer again. In this way, when some burst traffic, L1 can resist most of the hot requests. For the microblog itself, the hotter the new data, the more it can hold with a small increase in memory.

To sum up briefly, through the storage of simple KV data type, we actually take MC as the main, the HASH node in the layer does not drift, and Miss penetrates to the next layer to read. Through multiple sets of L1 read performance improvement, can withstand peak and burst traffic, and the cost will be greatly reduced. For writing and reading strategies, to write more, read the words use through step by step, you to write back, if you Miss the existing data inside, we initially using Json/XML, 12 years after directly USES the Protocol | Buffer formats, to some larger compression using QuickL.

2. Set class data

When I talk about simple QA data, how to deal with complex collection data, for example, I follow 2000 people and add one person, which involves some modification. There is a way to take all 2000 ids down to modify, this will be more bandwidth, machine pressure. There’s also some paging, and I have 2000 of them, and I just need to fetch a few pages, like page two, which is ten to twenty, and I wonder if I could not fetch all the data back. In addition, the linkage calculation of some resources will calculate that ABC also pays attention to user D among some people I follow. This involves the modification and acquisition of some data, including calculation, and MC is actually not good at it. All kinds of concerns are stored in Redis, through Hash distribution, storage, a group of multiple storage to separate read and write. Redis now has about 30 terabytes of memory and 2-3 trillion requests per day.

In the process of using Redis, I actually encountered some other problems, for example, FROM the following relationship, I followed 2000 UID, one way is full storage, but weibo has a large number of users, some users login less, some users are very active, so the cost of putting all in memory is relatively high. So we changed the Redis usage to Cache, for example, only store active users, if you have not been active for a period of time, we will kick you out of Redis, and add you when you visit again. There is a problem with this. Redis works in single-threaded mode. If it adds a certain UV and focuses on 2000 users, it may expand to 20,000 UIds. So we extend a new data structure, twenty thousand UIds directly open, write directly to Redis, the overall efficiency of reading and writing is very high, its implementation is a long open array, through the Double Hash address.

For Redis we have made some other extensions, some of the previous sharing, which you can see on the Internet, to put the data into public variables, the whole upgrade process, we tested 1 GB to load in 10 minutes, 10 GB to load in more than 10 minutes, now it is millisecond upgrade. For AOF, we use the rolling AOF, each AOF is with an ID, reach a certain amount and then scroll to the next AOF. When the RDB falls to the ground, we will record the AOF file and its location when the RDB is built, and implement full incremental copy through the new RDB and AOF extension mode.

3. Other data types – count

Then there are some other data types, such as a count, really count in every Internet company may have, for some small and medium sized business, actually the MC and Redis enough to use, but in weibo count some characteristics of a single Key of multiple count, such as a tweet, a number of forwarding, comments, and thumb up, A user has a variety of numbers, such as the number of fans and followers, because it is a count, its Value size is relatively small. According to its various business scenarios, it is about 2-8 bytes, generally 4 bytes is more than, and then about 1 billion records of new microblog are added every day, and the total record is even more substantial. Hundreds of counts may have to go back.

4. Counter -Counter Service

It is possible to take Memcached initially, but there is a problem with it. If the count exceeds its content capacity, it will cause some counts to be culled and lost after a downtime or restart. And then there might be a bunch of counts where it’s zero, so how to save it, whether to save it or not, it takes up a lot of memory. Weibo counts billions of times every day, even storing 0 will take up a large amount of memory, if not, it will lead to penetration into DB, which will have an impact on service solubility. After 2010, we used Redis access again. With the increasing amount of data, we found that Redis memory payload was still relatively low. A KV needs at least 65 bytes, but in fact, we need 8 bytes for a count and 4 bytes for Value, which is actually only 12 bytes effective. There are forty more bytes that are being wasted, and that’s just for a single KV, but it’s going to be even more if you have multiple counts for a Key, let’s say four counts, eight bytes for a Key, four counts each is four bytes, 16 bytes is about 26 bytes. But using Redis, you need about 200 bytes. Later, by developing Counter Service by ourselves, the memory was reduced to 1/5 to 1/15 of that of Redis, and hot and cold data were separated. Hot data was stored in memory, and if cold data became hot again, it was put into LRU. RDB and AOF are implemented to realize full incremental replication. In this way, hot data can be stored in a single machine, while cold data can be stored in a single machine.

In memory, it is divided into N tables in advance. Each Table draws a certain range according to the pointer sequence of IDS. If any ID comes to find the Table where it is located, if it is directly increased or decreased, a new count comes. When it is found that there is not enough memory, a small Table is dumped into the SSD, leaving the new location at the top for the new ID to use. Some doubt that if within a certain range, my ID was set count is 4 bytes, but weibo special hot, more than four bytes, become a count how to deal with a lot of, to exceed the limits put it on the Aux dict for deposit, to fall on the inside of the SSD Table, we have a special IndAux visit, Replication is performed in RDB mode.

5. Other data types – Existence judgment

Then in addition to count, weibo and some business, some existence, such as a tweet show, could you recommend a thumb up, read, and if the user has read this weibo, don’t show to him, this has a big characteristic, it checks whether there is any, each record is very small, such as Value1 a bit is ok, But the total amount of data is huge. For example, about 100 million microblogs are published every day, and tens of billions or hundreds of billions of such total data need to be judged. How to store them is a big problem, and many of them are zero, or as mentioned above, should 0 be saved? If 0 is saved, hundreds of billions of records should be saved every day. The volume of requests will eventually penetrate the Cache layer to the DB layer, and no DB can withstand that volume of traffic.

We also made some selection. First of all, we directly considered whether we could use Redis. Single KV65 bytes, one KV can be 8 bytes, and the Value is only 1 bit. The second kind of Counter Service we newly developed, single KV Value1 bit, I only save 1 BYT, a total of 9 BYT is ok, so the daily increase of memory 900G, save may only save the latest several days, save three days almost 3 T, the pressure is also quite big. But it’s a lot better than Redis.

Our final plan USES the development of the Phantom, first by distributing the Shared memory segment, the final use of memory can only 120 g, the algorithm is very simple, for each Key hash to a N time, if the hash a bit it is 1, if three hash, three digits set it to 1, the X2 is three times the hash, And then when you’re trying to figure out if X1 exists, if you hash it three times, if it’s all one you think it exists, and if you hash X3, and it’s got zero bits, you’re 100 percent sure it doesn’t exist.

Its implementation architecture is relatively simple, the shared memory is split into different tables in advance, in which the open method of calculation, and then read and write, the landing of the AOF+RDB method for processing. Because the entire process is stored in shared memory, the data will not be lost when the process is upgraded and restarted. When external access, build Redis protocol, it directly extends the new protocol can access our service.

6. Summary

To summarize, so far, the focus has been on high availability in the Cache cluster, its scalability, including its performance, and especially the storage cost. There are some things we haven’t looked at, such as how operational 21 is and how weibo now has thousands of servers.

7. Further optimize

Service 8.

The first solution is to serially manage the entire Cache and the configuration to avoid frequent restarts. In addition, if the configuration is changed, you can directly modify it with a script.

Servitization also introduces Cluster Manager to achieve external management, through an interface to manage, can carry out service verification. In terms of service governance, capacity expansion and reduction can be achieved, and SLA can also be well guaranteed. In addition, for development purposes, Cache resources can now be masked.

Summary and Prospect

Finally, a brief summary is given. For microblog Cache architecture, it should be optimized and enhanced from different aspects such as data architecture, performance, storage cost and servitization.

-END-