The authors introduce

Chen Bo is a technical expert on Sina Weibo and author of “Deep Into Distributed Caching”.

Note: This article is reprinted with authorization from freshmanTechnology.

Weibo has more than 160 million daily active users and tens of billions of daily visits. In the face of massive access from a large user base, a good architecture and constantly improving cache system plays a very important supporting role. In this article, Sina Weibo technical expert Chen Bo will explain in detail how those huge data are presented.

This article outline

1. Data challenges in the operation of Weibo

2. System architecture of Feed platform

3. Cache architecture and evolution

4. Summary and outlook

Data challenge

Feed platform system architecture

The system architecture of the Feed platform is divided into five layers. The top layer is the end layer, such as the Web side, the client side, some clients of IOS or Android used by everyone, as well as some open platforms and some interfaces for third-party access. The next layer is platform access layer. Different pools are mainly used to centrally allocate good resources to important core interfaces. In this way, in case of sudden traffic, better flexibility can be provided to improve service stability. Below that is the platform services layer, mainly Feed algorithms, relationships, and so on. Next comes the middle tier, which provides some services through various intermediaries. The lowest layer is the storage layer.

1
Feed Timeline

When you brush your micro-blog daily, for example, refresh it on the main site or client endpoint, and you get 10 to 15 micro-blogs. How is this constructed?

After the refresh, the user’s attention relationship will be gained first. For example, if he has a thousand followers, he will get the thousand IDS, and then according to the thousand UUIds, he will get some micro-blogs published by each user. At the same time, it will get the user’s Inbox, which is the special messages he receives, such as a group of micro-blog, group micro-blog, following relationship, and list of micro-blog followers.

After getting this series of micro-blog list, collect and sort, get the required IDS, and then take the corresponding micro-blog content of each micro-blog ID for these ids. If the tweets are forwarded, it has an original tweet, which it takes further. Obtain user information through the original micro-blog, further filter these micro-blogs according to the user’s filter words, filter out the micro-blogs that users do not want to see.

According to the above steps to leave the micro-blog, will be further to see, the user of these micro-blog collection, like, do some flag Settings, but also on these micro-blog various counts, forwarding, comments, like the number of assembly, finally put the ten micro-blog back to the user’s various end.

In this way, if a user requests a dozen records at a time, the back-end server will assemble hundreds or even thousands of data in real time and then return it to the user. The whole process depends on the strength of the Cache system, so the design of Cache architecture will directly affect the performance of the microblog system.

2
Feed the Cache architecture

Next, let’s look at the Cache architecture, which has six main layers. First of all, the Inbox is the first layer, which is mainly grouped into some micro-blogs, and then directly to some micro-blogs of the group owner. Inbox is less, the main way is to push.

And then for the second layer of Outbox, every user will post regular tweets in its Outbox. According to the number of ids stored in the Cache, the Cache is actually divided into multiple Cache ids. The common Cache ID is about 200, and the long Cache ID is about 2000.

The third layer is some relationships, its attention, fans, users.

The fourth layer is content, some content of each micro-blog exists here.

The fifth layer is some existential judgments, such as whether I like a certain weibo post. Before, some celebrities said that I did not like the micro blog, how can it show that I like, caused some news. And this is a record, she actually liked it at some point but probably forgot.

At the bottom there is a larger layer — counting, counting the comments and forwarding of each micro-blog, as well as the number of users’ followers and followers.

Cache architecture and evolution

1
Simple KV data type

Next, we will focus on the evolution of the Cache architecture of Weibo. When weibo was first launched, we stored it as a simple KV data type. We mainly adopted hash sharding storage in MC pool. After a few months, we found some problems: some node machines were down or for other reasons, a large number of requests would penetrate the Cache layer and reach DB, resulting in the whole request slowing down, or even DB freezing.

So we quickly changed it and added an HA layer so that even if some node went down or died in the Main layer, the requests would go further through the HA layer, not through the DB layer. This can ensure that in any case, the whole system hit rate will not be reduced, system service stability has a relatively large improvement.

Now, this is something that’s used a lot in the industry, and then a lot of people say I’m just going to hash, but there are some pitfalls. For example, I have a node, node 3 breaks down, Main removes it, and some QA of node 3 is distributed to several other nodes. This business is not very large, but it penetrates DB, and DB can still resist. But if this node 3 recovers, and it’s added, then the access to node 3 comes back, and then node 3 goes down again for network reasons or machine reasons, and some of the requests from node 3 are distributed to other nodes. At this point, there will be a problem. The data that was written back to other nodes is no longer updated. If it is not removed, there will be mixed data.

In fact, weibo is a square type of business, such as an emergency, a star to find a girlfriend, instant traffic 30%. After an emergency, a large number of requests will appear on some nodes, which will make these nodes very hot. Even the MC cannot meet such a large number of requests. The MC becomes a bottleneck, slowing down the entire system.

For this reason, we have introduced L1, which is also a Main relationship pool. Each L1 is approximately 1/n, 1/6, 1/8, and 1/10 of the memory of Main. Depending on the number of requests, I will increase by 4 to 8 L1s, so that L1 will be the first to be accessed when all requests come in. If L1 hits it, it will directly access it, and if not, it will access main-HA layer, so that in the case of some burst traffic, L1 can resist most hot requests. For the microblog itself, the hotter the new data, the larger the volume with only a small amount of memory added.

To sum up briefly, through the storage of simple KV data type, we actually take MC as the main, the HASH node in the layer does not drift, and Miss penetrates to the next layer to read. Through multiple sets of L1 read performance improvement, can withstand peak and burst traffic, and the cost will be greatly reduced. For the read and write strategy, take multi-write, read through layer by layer, if Miss, write back. For the data in it, we initially used Json/ XML. After 2012, we directly used the Protocol Buffer format and compressed some large data with QuickL.

2
Set-like data

Simple QA data, but what about complex collection data?

For example, IF I follow 2,000 people and add one more person, it involves some modification. One way is to take all 2000 ids down to modify, but this will be a lot of bandwidth, machine pressure. There’s also some paging, and I have 2000 of them, and I just need to fetch a few pages, like page two, which is ten to twenty, and I wonder if I could not fetch all the data back. There are also some linkage calculations of resources, which will calculate that ABC also follows user D among some people I follow. This involves partial data modification and acquisition, including calculation, which is actually not good for MC.

All kinds of concerns are stored in Redis, through Hash distribution, storage, a group of multiple storage to separate read and write. Redis now has about 30 terabytes of memory and 2-3 trillion requests per day.

While working with Redis, there are actually some other issues. For example, from the following relationship, I followed 2000 UUIds. One way is full storage, but weibo has a large number of users, some users log in less, some users are very active, so the cost of storing all the uuIds in memory is relatively high. For example, only active users are stored. If you have not been active for a while, you will be kicked out of Redis and added to the Cache when you are accessed again.

There is a problem here, because Redis works in single-threaded mode, and if it adds a certain UV and focuses on 2000 users, it may scale to 20,000 UIds, and 20,000 UIds are plugged back in and basically Redis is stuck, unable to provide other services. So we extend a new data structure, 20,000 UIds directly open the end, when writing directly into Redis, the overall efficiency of reading and writing will be very high. Its implementation is an open array of long, addressed by a Double Hash.

We’ve made some other extensions to Redis, which you may have seen in some of our previous posts on the web. Putting data into public variables. The whole upgrade process, we tested 1 GB, took 10 minutes to load, and 10 GB took more than 10 minutes.

For AOF, we use the rolling AOF, each AOF has an ID, reaches a certain amount and then we scroll to the next AOF. When the RDB falls to the ground, we will record the AOF file and its location when the RDB is built, and implement full incremental copy through the new RDB and AOF extension mode.

3
Other data types – count

Then there are some other data types, such as a count, really count in every Internet company may have, for some small and medium sized business, actually the MC and Redis enough to use, but in weibo count some features: more than a single Key is count, such as a tweet, have a forwarding number, comments, and thumb up; A user has the number of followers, the number of followers, all sorts of numbers. Because it is a count, its Value size is relatively small. According to its various business scenarios, it is about 2-8 bytes, generally more than 4 bytes, and then about 1 billion weibo records are added every day, and the total record is more substantial. Then, a request, maybe hundreds of counts have to be returned.

4
Counter -Counter Service

It is possible to take Memcached initially, but it has a problem. If the count exceeds its content capacity, it will cause some counts to be culled and lost after a downtime or restart. And then there might be a bunch of counts where it’s zero, and then how to save it, whether to save it or not, and that takes up a lot of memory. Weibo counts billions of times every day, even storing 0 will take up a large amount of memory, if not, it will lead to penetration into DB, which will have an impact on service solubility.

After 2010, we used Redis access again. With the increasing amount of data, we found that Redis memory payload was still relatively low. A KV needs at least 65 bytes, but in fact, we need 8 bytes for a count and 4 bytes for Value, so the effective load is only 12 bytes. Forty more bytes were wasted. And that’s just a single KV, but if you have multiple counts for a Key, it’s a lot more wasteful. For example, four counts, a Key of 8 bytes, four counts each count is 4 bytes, 16 bytes need about 26 bytes, but with Redis about 200 bytes.

Later, through Counter Service developed by ourselves, the memory was reduced to 1/5 to less than 1/15 of Redis, and hot and cold data were separated. Hot data were stored in memory, and if cold data became hot again, it was put into LRU. RDB and AOF are implemented to realize full incremental replication. In this way, hot data can be stored in a single machine, while cold data can be stored in a single machine.

The entire storage architecture looks like the figure above, with memory on the top and SSD on the bottom. In memory, it is pre-divided into N tables, and each Table is delimited according to the pointer sequence of ID. If an ID is added to the Table and a new count is added to the Table, a small Table is dumped to the SSD with the new ID at the top of the Table.

Some people ask, if in a certain range, my ID was originally set to count 4 bytes, but weibo is very hot, more than 4 bytes, become a large count how to deal with? For those exceeding the limit, we store them in Aux dict. For the tables that fall into SSD, we have a special IndAux to access them and copy them through RDB.

5
Other data types – Existence judgment

In addition to counting, weibo has some business, some existential judgments. For example, if a micro blog shows whether it has been liked, read or recommended, if the user has already read the micro blog, do not show it to him. This has a big feature, it checks for existence, each record is very small, such as Value1 bit is ok, but the total amount of data is huge. For example, about 100 million new posts are published on Weibo every day, and the total number of posts read may be tens of billions or hundreds of billions. How to store it is a big problem, and a lot of it has zero existence. Again, do you want to store the 0 or not? If so, hundreds of billions of records are saved every day; If not, a large number of requests will eventually pass through the Cache layer to the DB layer, and no DB can handle that much traffic.

We also did some selection, first directly consider whether Redis can be used. A single KV is 65 bytes, a KV can be 8 bytes, and the Value is only 1 bit, so the efficiency of newly added memory per day is very low. The second kind of Counter Service we newly developed, a single KV Value1 bit, I only save 1 BYT, a total of 9 byT is ok. In this way, if you add 900G memory every day, you may only be able to save the latest several days. If you save three days, it is almost 3 T fast. The pressure is also quite big, but it is much better than Redis.

Our final solution was to develop the Phantom ourselves, starting with a segmented shared memory, and eventually using only 120GB of ram. The algorithm is very simple, you can hash each Key N times, and if one of the bits of the hash is 1, you hash it three times, three numbers and set it to 1. If I hash X2 three times, and then I’m going to see if X1 exists, if I hash it three times, if I hash it all one, then I think it exists, and if I hash X3, and its bits are 0, then I’m 100 percent sure it doesn’t exist.

Its implementation architecture is relatively simple, the shared memory is split into different tables in advance, in which the open method of calculation, and then read and write, the landing of the way of AOF+RDB processing. Because the entire process is stored in shared memory, the data will not be lost when the process is upgraded and restarted. When external access, build Redis protocol, it directly extends the new protocol can access our service.

6
summary

To summarize, so far we have focused on high availability, scalability, high component performance, and especially storage costs in Cache clusters. There are other things we haven’t looked at, such as operation and maintenance, and there are now thousands of servers.

7
Further optimization

8
As a service

The first solution is to serially manage the entire Cache and the configuration to avoid frequent restarts. In addition, if the configuration is changed, you can directly modify it with a script.

Servitization also introduces Cluster Manager to achieve external management, through an interface to manage, can carry out service verification. In terms of service governance, capacity expansion and reduction can be achieved, and SLA can also be well guaranteed. In addition, for development purposes, Cache resources can now be masked.

Summary and Prospect

In conclusion, we have optimized and enhanced the microblog Cache architecture from different aspects such as data architecture, performance, storage cost and servitization. Welcome to have research or questions about this peer comments, with us to discuss.

The recent hot,

Evolution of Ali intelligent operation and maintenance platform: from automation to unmanned

Solution of distributed Database and Cache double-write consistency Scheme

This article describes Apache Spark memory management

Build DevOps toolchains that are easy to land on

From SQL Server to MySQL, nearly a billion data transfer actual combat

Recent activities

DAMS China Data Asset Management Summit 2018