Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article was published in cloud + Community column by Tencent Cloud database TencentDB

Zou Peng, Tencent senior engineer, Tencent cloud database Redis, years of database, network security research and development experience. In the network, computing, storage, security and other fields have in-depth research and rich product experience. He has rich practical experience in high availability, high reliability and middleware of Redis, MySQL and other databases.

This time is mainly to share with you, Tencent cloud last month officially launched Redis4.0 cluster version of the relevant content, to share with you when we do cluster version of what to think, how we go to design the entire system architecture, what we did finally. The first point is about the mission of Redis. Let’s see what Redis is and why it is so popular. The second point is what thoughts Tencent Cloud Redis4.0 cluster design has experienced and what the final shape will be. What an architectural design it is.

Redis is a database star in the era of mobile Internet. Memcached was born in an era when Mysql could not meet the demands of high concurrency and low latency. However, Memcached is still in the use experience. The support side of the business scenario was too simple, and hence the birth of Redis, a Swiss Army knife with high performance, low latency and support for complex data structures.

Now let’s take a look at the era of Redis database. What is the situation today? This is the data that has just been updated this month. Redis’ official website now receives 65% of its traffic from mainland China and is used globally, but Chinese programmers use it the most. At present, the service category is the most popular, and the first requirement of service quality is fast. We can see that the first experience of express delivery, taxi hailing and take-out is fast, which is the advantage of Redis.

Performance includes high concurrency and low latency. Let’s take a look at how well Redis can handle concurrency. Redis can handle 100,000 requests per second on a single processor, and 99% of requests are returned within 1 millisecond at 50,000 requests. In-memory cache, using Redis does not need to build tables, which for programmers, I think is really a gift from developers to us, so Redis can meet the requirements of this era, can attract us developers, can become the star of this era. Redis actually have 10 years of development history, but we can see these two years on the cloud was sustained and rapid growth, Redis main scene lies in the cache, from the point of our current data, if not to abandon the game scene, 80% of the scene is cached, so it still cache database, there are a lot of labels, We concluded that Redis is a very fast very simple easy to use memory database, this is Redis simple portrait.

Enter the topic of today, I will share with you that we have done nearly half a year of Tencent Cloud Redis4.0Cluster version of the situation, we are based on the community 4.0 version + self-developed Proxy to create a distributed cache database, we first know what the official Cluster is a database, relative to the master and slave version, The official Cluster has data level and management level. We can take a look at these two levels. The first level is in the Cluster, where there is a logic in charge of Sharding data to different fragments and breaking up the data. Another piece is to be the support of smooth migration, added two commands in the new version, if the data is not on this fragmentation can tell you on the other fragmentation, coupled with the smart client to cooperate, even if the data is moved, also won’t access failure, there is always a place to find it, this is the data level. The management plane is a completely autonomous management system based on the Gossip protocol. It is a decentralized solution that does not need to be set up by a third party. The node-free management depends entirely on the discussion of everyone, whether the person is alive or not. The other is high availability, there will be a whole set of detection logic and voting it to death logic, cluster edition does two big features, this is the official source of the situation.

We believe that Redis Cluster must have a Proxy, the first native Cluster version must have a smart client support, just said in the Cluster version of several new commands, when you access the data is not in the shard, will tell you to fetch elsewhere, originally do not need to deal with such commands, When migrating to cluster edition, encountering this command is silly, there is no way to run. This is where you need intelligent client support. The other case is that your client needs to be aware of the backend architecture, synchronize all the information to the client, and then the client does the shard. To ops is simpler, but are extremely hostile to our developers, on the cloud, IP resource is very precious, we now have a customer of electricity, now with 128 pieces of cluster, with a set of two from all the nodes to 128 x 3, more than 400 IP, a C network is not enough, the usage for the client to use too unfriendly. The reason why Proxy is necessary is to enrich some functions at some levels, and the monitoring of cluster is not enough, such as data skewness. Because it is a decentralized design, there is no overall control. We need to do traffic isolation, hot Key monitoring, access monitoring, or change the code of Redis-Server or use middleware to achieve. When making cloud, there are too many customers on the cloud, there will be a lot of customers, a lot of requirements, a lot of functions to change the code of Redis, the code of Redis is difficult to maintain, the simplest way is to make a Smart Proxy, which is equivalent to an intelligent client. We sank this Sharding logic into the middleware.

So let’s see if we want to pick a Proxy what are the alternatives? We should be familiar with these, Twemproxy is an antique, the biggest weakness of proxy component is unable to support expansion and shrinkage, you can not stand to move data again when business growth. The other is Codis, which was developed by spinlock, a domestic giant. Codis has made a complete program to provide everyone with a very large and complex system. There is no official elegant, but also changed the code of Redis Server, and there is no official lineage. This is the mainstream we can see the more common solution, we can not directly move on the cloud, because we can not care for the needs of thousands of users on the cloud.

Look at the scheme made by Tencent cloud, behind is the official source Cluster, which is completely autonomous version, and we have made a small part of optimization. Moving forward is the smart client, which does proxy forwarding, a lot of custom monitoring and data Sharding. LB is the first one, mainly to provide VIP, so that developers can see an IP, like the stand-alone version of it is OK, this is a more elegant solution, everything is shielded to the back end, we only need to write and read, this is our final solution.

Redis cluster edition itself data operation level is very simple and stable, in doing cluster edition we have made great efforts in two places, the first is data migration, let’s look at what scenarios will have data migration requirements?

Audience: Hello, teacher, I am a junior staff, our company is also using Redis cluster, if you want to use Tencent cloud, this step can solve the agent you just said, are these things managed by you? Before is our own Baidu took baidu’s official cluster plan in use.

Zou Peng: Where are your data now?

Audience: In our own local, we are interested in buying Tencent Cloud Redis.

Zou Peng: You have the data now, and the data will come up after the cloud service. We have the DTS platform. As long as you open up the network, our tools can connect to your Redis, and the data can be transmitted.

Audience: Thank you, teacher.

Zou Peng: The advantage of cloud computing is that you can have it immediately if you want. The whole cloud in SAAS layer PASS layer has been very perfect in China. If you want to start your own business in the future, just leave these hard work to us.

Moving on to the topic of data migration, clustering edition speaking of stability, the biggest challenge is data migration, in what scenarios will data migration take place? If you expand, for example, if you expand, you can see our scenario, three dimensions, the number of horizontal slices, 128 slices, the vertical dimension can be adjusted from 4G to 32G, and the number of copies, five copies, 100,000 writes, 500,000 reads. In this case, both capacity expansion and capacity reduction will occur. We’ve bought a little less at the beginning and can expand laterally or vertically. We spent a lot of money to do this, and there is a cluster version, which will inevitably produce data skew, if your Key design is not reasonable, it will appear that your data is basically hit in a fragment, this time data skew will involve data migration.

A difficult place, the migration process is smooth, the extreme access a Key is time to move, will wait for a few cycles, the principle of concrete can be down or communication, we present situation, such as the original move data would be broken connection, cluster now version of the support, with one of our PROXY can block, You don’t need to stop service to expand or shrink the capacity when your business is running, but it is still recommended to do it in the peak period of business, we specify a time to upgrade, such as three o ‘clock in the morning to do this thing is no problem. Redis has two pains, the first is the big Key and the second is the hot Key. If we have a big Key problem, for example, should we move this big Key or other keys during data migration?

Analysis of big keys to do RDB analysis, this process is very slow, we do backup on the cloud every day, we have done an asynchronous lazy sweep of big keys here, before moving one by one to scan the Key, and then combined with the data algorithm, where there is a big Key will know, we will avoid the big Key for relocation. Now at least your Redis won’t get stuck with big keys.

Audience: Did your move affect the previous figures?

Peng Zou: Relocation itself is designed to take into account business perception, do not have to be attached to stop service, we also want to achieve the ultimate usability.

We need to do global monitoring in Proxy, how to fry the value of Proxy? 1. Access monitoring; 2. Key analysis; 3. Index monitoring; 4, slow query; 5. Alarm configuration; 6. Traffic isolation.

We will analyze the Key of the instance, tell you what Key is put in Redis, and what the prefix is, as well as the big Key, the accurate big Key is done through RDP analysis. The big Key case mentioned above is that we need to take a real-time look at the data relocation, which is also the process of asynchronous scanning. Sometimes want to see the development of what to write the data in the inside, can pass the data to know that your Key, there are common indicators monitoring, traffic, the shooting, is very important, caching, can see through shooting, 10% 5% of the time at this time there is a problem, this is a Key index, can help us see abnormal problems in time, Capacity, traffic, and miss hits and queries.

Slow queries are not particularly numerous, but there will be. Or the whole Tencent cloud has a complete monitoring system, all indicators are connected to the cloud monitoring, configure an indicator, touch the threshold can send an alarm. It is easy to have large keys, which will affect other instances. In this case, we must isolate the traffic. We isolate the traffic in the Proxy to ensure the availability of services.

This is the final layout, the server corresponds to the cluster 4.0 below, this is the situation of the three layer, this side is the surrounding supporting systems, such as monitoring, resource management, backup. Source has distributed autonomy, we will also do a more granular level, compare the host layer, maximum guarantee availability.

Across this is available and high availability, dig drivers are afraid, we will provide cross and high availability solutions are available, and you in guangzhou area, for example, bought a cluster of 4.0, I will copy to the cluster to other areas are available, and in every area with a different IP, can access and write, but at the same time visit, write back to the main available area. When an exception occurs and the entire zone is inaccessible, your business can still be used when you move to zone 2. This is a framework that ensures regional level availability in terms of availability.

In addition, I would like to introduce the CKV engine which is compatible with Redis protocol in Tencent cloud at the beginning of 18 years. CKV name is very simple and simple, and it is the name of r&d personnel, not the name of Altair vega. This is the overall situation. We started the project in 2009, and the biggest background is Qzone. At that time, we got up, and in 2013, the page view was 1 billion. Last year, we focused on compatibility with Redis protocol. We officially launched it at the beginning of this year, and you can also see it on our official website.

Why mention the design separately here? Without Proxy clustering is not an elegant clustering version, but CKV does not have Proxy, but it is certainly elegant. Proxy has many advantages, but one problem is that it costs money and costs a lot. We’ll use another plan, is the earliest CKV solution, not the Proxy, the request will be randomly hit an arbitrary subdivision, each shard will have slot of global information, if it is found that the request cannot be handled in the current fragmentation can be forwarded to the destination node to process, each node can be a Proxy, benefit is saving money, time is lower. Here is the logical concept diagram, for example, from CVM to LB to data node. If your request reaches the slave node, the slave node will put the request to the master node, and the master node will return the data to the slave node after completion. This is a different scheme of CKV, which is source distributed. In addition, the consumption of Redis is Key operation and network operation. For example, when QPS50,000-100,000 is used, the network accounts for a large proportion. We change the network sending and receiving into multi-thread, which not only ensures data consistency, but also improves performance. But the need for transaction support is difficult to use the cluster version, this time can consider this mode to support, not only to break the 100,000 QPS, but also to do data fragmentation. More database cutting-edge technology can pay attention to our public number: Tencent cloud database CDB

Q & A

Q: Hi, what is the ratio between Redis and Mysql?

A: I’m curious, what’s the context of your question?

Q: MySQL will use a much larger percentage than yours.

A: If you look at this picture, the actual situation looks something like this, about 10:1.

Q: Have you considered how Redis achieves high grouping in single node? Can we consider DPDK?

A: We also tried this way of thinking, input and output than not particularly high, technology circles popular now a concept is to OS, FS, protocol stack, but the cost is the truth of special high, and in particular, the TCP slow very old very conservative, but to run for so many years, if the new do a set of costs will be particularly high, input and output ratio is very low, In the case of Redis cluster edition, we can consider the extension of sharding to improve the write performance, and add copies to improve the read performance. So this is also some of the thinking that we’ve been through.

Machine learning in action! Quick introduction to online advertising business and CTR knowledge

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!