“This article has participated in the call for good writing activities, click to view: the back end, the big front end double track submission, 20,000 yuan prize pool waiting for you to challenge!”

Including principle +BAT case practice, it takes 5 minutes to finish

Preview the content of this article:

  1. What is? Why is that? 1.1 What is Message Queue 1.2 Why is message Queue used 1.3 What are the problems caused by the introduction of message queue

  2. How’s that? RocketMQ 2.2 Kuaishou Trillion Kafka cluster smooth expansion 2.3 Kuaishou/Meituan optimization of Kafka cache pollution 2.4 CMQ in wechat red packet payment scenarios

What is Part1? Why is that?

1 What is a message queue

Queues in Java should be familiar. It has a first-in, first-out, or dual-end in and out of the way for data management; Automatic load balancing by blocking.

Message queues were originally named after queues because of their similarity in functionality and operations to Java’s native queues. Therefore, we can simply consider message queue as an intermediate service to meet the requirements of data transmission, management and consumption among services in distributed mode.

2 Why use message queues

Q: Why did you introduce message queues in your system?

We always need to know the value of message queues, and the actual pain points in our own business scenarios to answer the question why message queues are used, and to answer the value of introducing message queues into the system.

Decoupling between systems

A few days ago in the background and pay attention to the public number of a big man to discuss the operation of advertising water update as an example:

Retrieval systems, advertising awareness poster information changes are needed to update their index, but in fact retrieval system and the solution, there is no need to rely on the interface between materials, assets and other system to strong correlation of cognitive behavior, and the way of the interface in terms of maintenance and the pressure of the system is not friendly, so, the role of the message queue is of very important, each system released their message, Who needs who to subscribe to achieve the purpose without adding additional system call stress. (Note: The interface call of Builder is to get the latest information, which can be optimized by compression, etc.)

Therefore, when there is no requirement for real-time data interaction between systems, but their business information is also needed, message queues can be used to achieve the decoupling effect between systems. As long as the publisher defines the format of the good news queue, any operation of the consumer can be independent of the publisher, reducing the impact of unnecessary joint coordination and publishing conflicts.

Service asynchronization

The most typical example is the result notification function in the payment scenario.

We know that in general, both app push and SMS notifications are time-consuming operations. Therefore, there is no need for these time-consuming operations of non-core functions to affect the core operation of payment. As long as we send the payment result to the message topic designated by the SMS center after the payment operation is completed, the SMS center will naturally receive the message and ensure notification to users.

The picture comes from Zhihu Answer

Therefore, it is effective to use message queues to asynchronize non-core operations and improve the efficiency and stability of the entire business link.

Peak peel

This function enables us to focus on the focus of this paper, in the face of special scenes such as seconds kill, Spring Festival Gala red envelopes and other trillions of flow pulse pressure, an effective means to protect our system from the collapse of the service is message queue.

Through the message center’s high-performance storage and processing capabilities, the excess traffic that exceeds the system’s processing capacity is temporarily stored and gently released within the system’s processing capacity to achieve peak clipping effect.

Our advertising billing system, for example, in the face of tens of thousands of concurrent business search volume, click thousands of concurrent operation, the way of real-time interface must be wrong, after all, advertising behavior and payment behavior is not the same, users can retry payment failure, but the user’s business post click behavior is not to replay, the flow rate in the past is the past, therefore, To ensure the stability of the charging system, message queue should be used to cache the charging request.

other

Features such as broadcast, transactional, and final consistency are also commonly used in message queues.

3 What are the problems of message queues

Increased response latency on services

As mentioned above, message queues make non-core business processes asynchronous, which can improve the timeliness and fluency of the entire business operation and improve user experience. But it is also because the data is queued that consumption is inevitably delayed. Services do not take effect in a timely manner.

For example, for the product recommendation encountered before, the product recommendation list should not be full of goods, so as to eliminate the influence of special goods on the recommendation effect. In addition to the second kill, we also need to sense the goods off and off, blacklist, inventory, etc., and therefore, use the redis bit offset to maintain multiple states of a commodity. A message from the promotion group is then received to change the status of the item in the recommendation cache cluster, but due to the delay of the message, the status of the item may not be changed in a timely manner. But as long as the tradeoffs are business and technical acceptable, it’s OK.

Introduce architectural instability

The introduction of message queue is equivalent to adding a new system in the original distributed service link, and the system complexity increases accordingly. At the same time, the role of message queue requires its high performance and high availability.

Therefore, middleware teams and business systems need to work together on how to deploy a high-availability stable cluster, how to retry message delivery failures, how to set up broker data synchronization policies, how to idempotent message retransmissions due to broker exceptions, and how to retry unsuccessful consumption.

How about Part2?

4 RocketMQ supports seven years of double 11 zero failure

The peak value of transactions on Singles Day in 2020 was 58.3m transactions per second. RocketMQ has a number of in-depth customizations for Alibaba’s transaction ecosystem, and we will only cover the optimizations for high availability here.

Personally, I understand that push consumption mode is only suitable for scenarios where the consumption speed is much higher than the production speed. If it is a large flow and concurrent scenario, Pull consumption is mainly used.

Once the client is hung, no rebalance is made before pulling, and no rebalance is made until 20 seconds after the client is rebalance. Messages on the queue associated with the broker cannot be consumed in a timely manner, resulting in a backlog. What to do: POP, new consumption mode

<<< swipe left and right to see more >>>

No need to rebalance a POP consumption. Instead, all brokers are asked to fetch messages for consumption. The broker internally allocates messages from its three queues to waiting PopClients according to an algorithm. Even if PopClient 2 has a hang, the messages on the internal queue will be consumed by Pop Client1 and Pop Client2. This avoids a build-up of spending. [1]

5 Kuaishou Trillion kafka cluster smooth expansion [2]

To achieve smooth, the producer needs to be insensitive to implement partition migration.

The general principle is to synchronize the data of the partition to be migrated with the data of the new partition and continue for a period of time until all consumers catch up with the start node of synchronization. Then, the route is changed and the original partition is deleted to complete the migration.

<<< swipe left and right to see more >>>

The same data synchronization idea applies to Facebook’s distributed queue Dr Solution.

6 Kuaishou/Meituan optimization of Kafka cache pollution [3]

Kafka’s high performance comes from sequential file reads and writes and OS cache pagecache support. Kafka performs very well in single-partition, single-consumer scenarios. However, if different partitions exist on the same machine, or even if consumption patterns are mixed with real-time and delayed consumption, PageCache resources will compete, resulting in cache contamination and affecting the efficiency of the broker’s services.

Meituan deals with real-time/delayed consumption cache contamination

Data is distributed on different devices according to the time dimension, and the near-real-time data is cached in SSD. In this way, when PageCache competition occurs, the real-time consuming job reads data from SSD, ensuring that the real-time consuming job is not affected by delayed consuming jobWhen a consumption request reaches the Broker, the Broker directly retrieves and returns data from the corresponding device based on the relationship between the message offset it maintains and the device. The read request does not flush data from HDDS back to SSDS, preventing cache contamination. At the same time, the access path is clear, and there is no extra access overhead caused by Cache Miss.

Kuaishou deals with cache contamination caused by follower data synchronization

Two objects are introduced into the broker: a block cache; The other is the Flush queue.

A write request from Producer is written to the flush queue as a message at the broker’s end, and then written to a block in the block cache. Data in the Flush queue is asynchronously written to disk by other threads (through page caching). Ensure that the queue is not affected by followers

The consumer first retrieves data from the block cache and, if hit, returns it directly. Otherwise, data is read from disk. This read mode ensures that the consumer’s cache miss reads do not populate the block cache, thus avoiding contamination.

conclusion

It can be seen that the basic starting point of solving cache pollution is to disentangle tasks at different consumption rates or different data production sources, and the idea of divide-and-conquer avoids the influence of cache on each other.

Application of 7CMQ in red envelope payment scenario [4]

The process behind the red envelope operation is simplified as follows: read the balance from account A, subtract it, and write the result back to account A; Then open the red envelope to do addition operation on B account, write the result to B account.

Can bear pressure is limited due to billing system (usually related to accounting system influence due to reasons such as locks, transaction processing efficiency), may cause the failure of books, if according to the real-time business logic, you need to open A red envelope in real-time rollback (rollback to A account again addition), and after the introduction of CMQ, The business link becomes writing failed requests to the CMQ, and the high availability of the CMQ ensures data consistency until the accounting system is finally booked successfully. Simplified the accounting system due to the system pressure caused by the failure of the red envelope account rollback caused by additional system operations.

Part3 summary

This paper starts from the role of message queue, from alibaba Double 11, Kuaishou, Meituan, Wechat red envelope and other cases, on the message queue itself optimization scheme and business efficient use of message queue, expounds the role of message queue in high concurrency optimization scenarios. If you have any questions, please leave comments to discuss and learn from each other.

Recommended reading

1.1. Architecture optimization: Cluster deployment, load balancing 1.2. Realization of load balancing under 1 trillion traffic 1.3. Architecture optimization: Clever use of message-oriented middleware 1.4. Architecture optimization: Storage degradation with message queues 1.5. Storage optimization: explain index optimization for mysql 1.6. Storage optimization: detailed explanation of sub-database sub-table 1.8. Supplement: Ali database middleware source code analysis 1.9. Storage optimization: Cache is king among many strategies

The resources

[1] copyright statement: this article to CSDN blogger “alibaba cloud native” original articles, follow the CC BY 4.0 – SA copyright agreement: blog.csdn.net/alisystemso…

[2] Kuaishou Trillion level Kafka cluster application Practice and technology evolution: _www.infoq.cn/article/Q0o… *

[3] Meituan practices Kafka as an application layer cache: www.infoq.cn/article/k6d…

[4] gala WeChat Lucky Money case: cloud.tencent.com/document/pr…

Copyright notice: This article is an original article by Coder’s Technical Path. Reprint without permission is prohibited.