preface

My last company was a catering system, and the amount of concurrency in the system was huge during the lunch and dinner rush hours every day. To be on the safe side, the company requires all departments to be on duty at mealtimes in case of online problems.

I was in the kitchen display system team, which was downstream of the order. After the user places the order, the order system will send kafka message to our system. After reading the message, the system will do business logic processing, persist the order and dish data, and then display it to the dish drawing client. This way, the chef knows which dishes to cook for which order, and when some dishes are ready, they can be served through the system. The system automatically notifies the waiter to serve the dish. If the waiter finishes serving the dish, he/she will modify the dish serving status, and the user will know which dish has been served and which has not been served. This system can greatly improve the efficiency of the kitchen to the user.

The key to all this, it turns out, is messaging middleware:kafkaIf it has a problem, it will directly affect the function of the kitchen display system.

Recently, I accidentally got a copy of the notes written by a big boss of BAT factory. All of a sudden, I got through to both my supervisor and my supervisor. I felt that the algorithm was not as difficult as I imagined. BAT boss wrote the brush notes, let me get the offer soft

Next, I’d like to talk to you about what pits you’ve stepped in in kafka for two years.

Order problem

1. Why ensure the order of messages?

At the beginning, there were few merchants in our system, so we didn’t think too much in order to quickly implement the function. Since the communication is through the message middleware Kafka, the order system will put the detailed data of the order in the message body when sending messages. As long as we subscribe to the topic, the display system can obtain the relevant message data and then process its own business.

But there was a key element to the scheme: ensuring the order of the messages.

Why is that?

The order has many states, such as: order, pay, complete, cancel, etc., the impossible order message is not read, read the payment or cancel message first, if it is true, the data will not produce confusion?

Well, it seems that it is necessary to ensure the order of messages.

2. How to ensure message order?

We all know thatkafkathetopicIt’s disordered, but onetopicContains more than onepartition, eachpartitionThe interior is in order.Then the thinking becomes clear: just make sure that when producers write messages, they write to the same person according to certain rulespartitionDifferent consumers read different onespartitionTo ensure the sequence of production and consumer messages.

That’s what we did in the beginning, the sameMerchant codeAre written to the same messagepartition.topicCreated in the4apartition“And then deployed4Two consumer nodes, composed ofConsumer groups, apartitionCorresponds to a consumer node. Theoretically, this scheme can guarantee message order.

Everything seemed to work out perfectly, and we went live.

3. Accidents happen

The feature has been online for a while, but at first it was normal.

However, the good times did not last long. I soon received complaints from users, saying that some orders and dishes could not be seen on the menu client.

I have located the reason. During that period, the company’s network was often unstable, the business interface reported timeout from time to time, and the business request could not connect to the database from time to time.

This can be devastating to sequential messages.

Why do you say so?

Suppose the order system sends three messages: “Place”, “pay” and “complete”.However, our system failed to process the “order” message due to network reasons, and the data of the following two messages cannot be entered into the database, because only the data of the “order” message is the complete data, while other types of messages only update the status.

This was compounded by the fact that we didn’t have a fail-retry mechanism. The problem becomes: once the “order” message fails to enter the database, the user never sees the order or dish.

So how can this urgent problem be solved?

4. Resolution process

The initial idea was to retry 3-5 times immediately if the processing failed while the consumer was processing the message. But what if some requests don’t succeed until the sixth time? It is not possible to retry all the time. This synchronous retry mechanism blocks the reading of order messages from other merchants.

Obviously, using the synchronous retry mechanism above can seriously affect message consumers’ consumption speed and throughput in the event of an exception.

In this case, we have to use asynchronous retry mechanism.

If asynchronous retry is used, processing failures are saved to the retry table.

But a new problem immediately arose: how to ensure order with only one message?

Saving a message does not guarantee order, if: “order” message fails before asynchronous retry. At this point, the “pay” message is consumed, which must not be consumed normally.

At this point, the “pay” message should be kept waiting, and every once in a while, to determine whether the message in front of it has been consumed.

If this were to happen, two problems would arise:

  1. The “pay” message is preceded only by the “order” message, which is relatively simple. But if a certain type of message is preceded by N more messages, how many times do you need to judge? This judgment is so coupled with the order system that it is equivalent to moving part of their system logic to our system.
  2. Influence consumer consumption rate

At this point, a simpler solution emerges: when processing the message, the consumer first determines whether the order number has data in the retry table, and if so, directly saves the current message to the retry table. If not, the service is processed. If an exception occurs, the message is saved to the retry table.

Then we set up a failure-retry mechanism with elastice-job, and if the message fails after seven attempts, the status of the message is marked as failed and the developer is notified by email.

Finally, due to the network instability, the problem that some orders and dishes could not be seen on the client was solved. Now the merchant at most occasionally delay to see food, than to see not food too much.

Message backlog

With the marketing promotion of the sales team, there are more and more merchants in our system. This is followed by a growing volume of messages that consumers can’t handle, often resulting in backlogs. The impact on merchants is very intuitive. Orders and dishes on the cooking client may not be seen until half an hour later. One or two minutes can also endure, half the delay of the news, to some of the temper of the merchant where can endure, immediately complained over. We were getting a lot of complaints from merchants about late orders and food.

Although the addition of server nodes can solve the problem, but according to the practice of the company to save money, system optimization should be done first, so we started to solve the problem of message backlog.

1. The message body is too large

althoughkafkaClaims to supportMillions of TPS, but fromproducerSend a message tobrokerYou need a networkIO.brokerWriting data to disk takes one diskIO(write operation),consumerfrombrokerThe message goes through the disk firstIO(Read operation), again through the networkIO.

A simple message from production to consumption requires two NETWORK I/OS and two disk I/OS. If the message body is too large, it will increase the IO time, which will affect the kafka production and consumption speed. When consumers are too slow, there is a backlog of messages.

In addition to the above problems, the message body is too large, can also waste disk space on the server, if not careful, may be insufficient disk space.

At this point, we have reached the point where we need to optimize the message body size problem.

How do you optimize it?

We reorganized the business, and there was no need to know the intermediate state of the order, just one final state.

So good, we can design it like this:

  1. The message body sent by the order system only contains key information such as ID and status.
  2. After the system consumption message is displayed in the kitchen, the order details query interface of the order system is called by ID to obtain data.
  3. The kitchen display system determines whether there is data of this order in the database. If there is no data, it will be put into the database, and if there is, it will be updated.

As expected, after such adjustment, the message backlog problem did not appear again for a long time.

2. The routing rules are incorrect

Don’t be too happy, one day at noon there were complaints from merchants that orders and dishes were delayed. When we checked kafka’s topic, there was a backlog again.

But this is weird. Not all of itpartitionThere’s a backlog of messages. There’s only one.

At first, I thought there was something wrong with the node consuming the partition message. But no abnormality was found after the investigation.

That’s weird. What’s the problem?

Later, I checked the log and database and found that several merchants had a particularly large order volume, which happened to be assigned to the same partition, so that the message volume of this partition was much higher than other partitions.

At that time, we realized that the rule of routing partitions according to the merchant number when sending messages was unreasonable, which might lead to too many messages for consumers to process, while consumers would be idle for some partitions due to too few messages.

To avoid such uneven distribution, we need to adjust the routing rules for sending messages.

After thinking about it, routing with order number is relatively more uniform, and there will not be a particular number of messages for a single order. Except when someone keeps adding food, which costs money, there are not many messages for the same order.

After adjustment, routes are routed to different partitions based on the order number. Messages with the same order number are sent to the same partition each time they arrive.

After adjustment, the problem of message backlog did not appear again for a long time. During this period of time, the number of our merchants grew very fast and more and more.

3. Chain reaction caused by batch operation

In a high-concurrency scenario, the message backlog problem, so to speak, is always present and cannot be fundamentally solved. On the surface, it has been solved, but behind do not know when, will appear again, such as this time:

One afternoon, the product came and said: Several merchants have complained, they said that the dishes are delayed, please check the reason.

This time the question came up a bit strangely.

Why do you say so?

First of all, this time is a little strange, usually problems, isn’t the lunch or evening peak meal? Why is it in the afternoon this time?

According to the accumulated experience, I directly looked at the data of Kafka topic, and sure enough, there was a backlog of messages, but this time each partition has a backlog of more than one hundred thousand messages without consumption, which is hundreds of times more than the number of pressurized messages in the past. This time there was an unusual backlog of information.

I’ll check the service monitor to see if the customer hung up. Good thing it didn’t. The service log is checked again and no exception is found. At this time I have a little confused, luck asked the order group what happened in the afternoon? They said there was a promotion activity in the afternoon, and they did a JOB to update the order information of some merchants in batches.

It suddenly dawned on me that the problem was caused by their batch messages in the JOB. Why weren’t we notified? It’s so lame.

Although know the cause of the problem, it is in front of the backlog of tens of thousands of messages how to deal with it?

In this case, you cannot directly increase the number of partitions. Historical messages have been stored on four fixed partitions, and only new messages can be sent to the new partition. We focus on existing partitions.

Adding service nodes is also not an option. Kafka allows multiple partitions to be consumed by one consumer in a group, but does not allow a partition to be consumed by multiple consumers in a group.

I guess we’ll just have to multithread.

In an emergency fix, I switched to a thread pool for message processing, with both the core and maximum threads configured to 50.

After the adjustment, sure enough, the number of message backlogs kept decreasing.

But now there was a more serious problem: I received an alarm email, two nodes of the order system went down.

Soon, a colleague of the order team came to me and said that the concurrency of our system calling their order query interface had increased dramatically, which was several times higher than expected, resulting in the failure of two service nodes. They integrated the query function into a single service, deployed 6 nodes, failed 2 nodes, if not processed, the other 4 nodes will also fail. Order service can be said to be the company’s most core service, it will be a great loss of the company, the situation is extremely urgent.

To solve this problem, you can only reduce the number of threads first.

Fortunately, the number of threads can be dynamically adjusted by ZooKeeper, so I changed the number of core threads to 8 and the number of core threads to 10.

Later, the operation and maintenance restarted the two nodes where the order service was suspended and returned to normal. Two more nodes were added just in case. In order to ensure that there would be no problems with the order service, we maintained the current consumption speed, and the message backlog of the kitchen display system returned to normal one hour later.

Later, we had a review meeting and came to the conclusion that:

  1. Batch operations of the order system must be notified in advance to the downstream systems team.
  2. The downstream system team multithreaded call order query interface must do pressure test.
  3. This is a wake-up call for the order query service. As the core service of the company, it is not good enough to cope with high concurrency scenarios and needs to be optimized.
  4. Monitor message backlogs

By the way, for scenarios that require strict message order, you can change the thread pool to multiple queues, each processed by a single thread.

4. The table is too big

To prevent the message backlog problem from recurring later, the consumer continues to multiprocess messages.

But we still got a lot of emails in the middle of the day alerting us to a backlog of Kafka topics. We were trying to find out the cause when the product came to us and said, “Another merchant has complained about the delay of the food, please hurry to see.” This time she looked a little impatient, did optimize many times, again the same problem.

From the layman’s point of view: Why is the same problem never solved?

In fact, they do not know the pain in the heart of technology.

On the surface, the symptoms are the same, they’re all delayed, and they know it’s because of a backlog of messages. But they don’t know the underlying causes, and there are many reasons for the backlog. This is probably a common problem with message-oriented middleware.

I was silent, forced to locate the cause.

Later I checked the log and found that it took up to 2 seconds for consumers to consume a message. How can it be two seconds when it used to be 500 milliseconds?

Oddly enough, the consumer code hasn’t changed much either. Why is this happening?

Check the online food table, single table data volume unexpectedly reached tens of millions, other table is the same, now single table to save too much data.

Our team sorted out the business, in fact, dishes only displayed in the last three days on the client.

This is easy to do, we server stored excess data, better to archive the excess data in the table. So the DBA archived the data for us, keeping only the last 7 days.

After this adjustment, the message backlog problem was solved and calm was restored.

The primary key conflict

Duplicate entry ‘6’ for key ‘PRIMARY’ : Duplicate entry ‘6’ for key ‘PRIMARY’

This problem occurs when two or more SQL statements with the same primary key are inserted at the same time. If the first insert is successful, the second insert will report a primary key conflict. The primary key of a table is unique and cannot be duplicated.

I carefully examined the code and found that the code logic would first query the table for the existence of the order based on the primary key. If the order exists, the status would be updated. If the order does not exist, the data would be inserted.

This judgment is useful when the amount of concurrency is small. However, in high concurrency scenarios, if both requests find that the order does not exist at the same time, the primary key conflict will occur when one request inserts data first and the other request inserts data again.

The most common way to solve this problem is to lock.

I also thought so at first, adding pessimistic database lock is definitely not good, too bad performance. Add optimistic database lock, based on the version number, generally used for update operations, such as this insert operation is almost never used.

The rest can only be distributed lock, our system is using Redis, we can add distributed lock based on REDIS to lock the order number.

But then I thought about it:

  1. Adding distributed locks may also affect message processing speed for consumers.
  2. Consumers depend on Redis, and if Redis had a network timeout, our service would be in trouble.

So, I’m not going to use distributed locks either.

Instead, use mysql’s INSERT INTO… ON DUPLICATE KEY UPDATE syntax:

INSERT INTO table (column_list)
VALUES (value_list)
ON DUPLICATE KEY UPDATE
c1 = v1, 
c2 =v2, ... ;Copy the code

It will try to insert data into the table first and update the field if the primary key conflicts.

After modifying the old INSERT statement, there are no more primary key conflicts.

Database master/slave delay

One day soon after, I received another complaint from the merchant that after placing an order, the order could be seen on the menu client, but the dishes were not complete, and sometimes the order and dish data could not be seen.

This question is different from previous ones. Based on previous experience, we first looked at the backlog of messages in Kafka topics, but this time there is no backlog.

Then I checked the service log, and found that some of the data returned by the order system interface were empty, and some only returned the order data, but did not return the dish data.

This is very strange, I went directly to the order team colleagues. They scoured the service and found nothing wrong. At this time, we think of the same, will be the database problem, go to the DBA together. Sure enough, the DBA found that the primary database synchronized data to the secondary database with an occasional delay, sometimes up to 3 seconds, due to network reasons.

If our business process takes less than 3 seconds to send and consume messages, we may not find data or not find the latest data when we invoke the order details query interface.

This problem is very serious and will lead to our data error directly.

To solve this problem, we also added a retry mechanism. When the interface is called to query data, if the returned data is empty, or only the order is returned without dishes, the re-trial table will be added.

After adjustment, the problem of merchant complaint is solved.

Repeat purchases

Kafka supports three modes for consuming messages:

  • At most once mode

Once at most. Ensure that each message is committed successfully before consuming it. Messages may be lost, but not repeated.

  • At least once mode

At least once. After each message has been successfully processed, commit. Messages are not lost, but may be repeated.

  • Exactly once mode

Pass exactly once. Offset is processed with the message as a unique ID and atomicity is guaranteed. Messages are processed once, not lost and not repeated. But that’s hard to do.

The default kafka mode is at least once, but this mode can cause repeated consumption problems, so our business logic must be idempotent to avoid repeated data.

Our business scenario saves data using INSERT INTO… ON DUPLICATE KEY UPDATE syntax, insert if not present, UPDATE if present, naturally supports idempotency.

Multiple environmental consumption issues

Our online environment was divided into pre(pre-release environment) and PROD (production environment), both of which shared the same database and shared the same Kafka cluster.

Note that kafka topics are prefixed to distinguish between different environments. Pre starts with pre_, such as pre_order, and production starts with prod_, such as prod_order, to prevent messages from being strung across different environments.

However, when the operation and maintenance switched nodes in pre environment and configured topic, it made a mistake and became a Topic of ProD. Just that day, we had a new feature on pre environment. Unfortunately, some proD messages were consumed by consumers in pre environment, and consumers in Pre environment failed to process messages because the message body was adjusted.

The result is that the production environment loses some of its messages. Fortunately, the production consumer solved the problem without much damage by resetting the offset and re-reading that part of the message.

Afterword.

In addition to the above problems, I also encountered:

  • kafkatheconsumerThe use of automatic confirmation causesCPU usage 100%.
  • kafkaOne of the clustersbrokerThe node hangs, then hangs again after being restarted.

Recently, I accidentally got a copy of the notes written by a big boss of BAT factory. All of a sudden, I got through to both my supervisor and my supervisor. I felt that the algorithm was not as difficult as I imagined. BAT boss wrote the brush notes, let me get the offer soft

These two problems are a little complicated, SO I will not list them one by one. Interested friends can follow my public account and add my wechat to find my private chat.

Thank you very much for the two years of using message middleware Kafka experience, although encountered a lot of problems, stepped on a lot of pits, but also took a lot of detours, but let me accumulate a lot of valuable experience, rapid growth.

Kafka is a very good message-oriented middleware, and most of the problems I encountered were not kafka itself (aside from a bug that caused 100% CPU usage).

One last word (attention, don’t fuck me for nothing)

If this article is of any help or inspiration to you, please scan the QR code and pay attention to it. Your support is the biggest motivation for me to keep writing.

Ask for a key three even: like, forward, look.

Pay attention to the public account: [Su SAN said technology], in the public account reply: interview, code artifact, development manual, time management have excellent fan welfare, in addition reply: add group, can communicate with a lot of BAT big factory seniors and learn.