Daily operations


Troubleshoot problems


How can the lack of Didi open source


Didi LogiKM one-stop Kafka monitoring and control platform


This article came out of a question from a group of friends, and I wrote this article

Progressive group plus V :jjdlmn_

Kafka column sorting address please stamp here 0.0

Define the nouns: pre-migrated Broker: OriginBroker, post-migrated copy TargetBroker

The premise

I encourage you to read the following articles (if you can’t click on them, I haven’t posted them yet)

[[kafka source] ReassignPartitionsCommand source code analysis (duplicates of scalability, data migration, and redistribution, copy across the migration path)] () [[kafka operations 】 enlarge shrinks capacity, data migration, a copy of the duplicates of redistribution, across the path of migration] ()

Operation and maintenance control of Kafka’s soul mate Logi-Kafkamanger (4) — cluster operation and maintenance (data migration and cluster online upgrade)

If you don’t want to bother, just look at the graph I’ve drawn below and you can figure out for yourself what might go wrong; And how

All anomalies

Kafka column sorting address please stamp here

1. If TargetBroker is not online, migration script execution will fail

TargetBroker if not onlineWhen the task script is executed, the validation will not even pass

Scene demonstration

BrokerId role state A copy of the
0 Common Broker normal test-0
1 Common Broker outage There is no

Now migrate the partition topic-test-0 from Broker0 to Broker1


sh bin/kafka-reassign-partitions.sh --zookeeper xxxxxx:2181/kafka3 --reassignment-json-file config/reassignment-json-file.json  --execute --throttle 1000000

Perform abnormal

Partitions reassignment failed due to The proposed assignment contains non-existent brokerIDs: 1
kafka.common.AdminCommandFailedException: The proposed assignment contains non-existent brokerIDs: 1
        at kafka.admin.ReassignPartitionsCommand$.parseAndValidate(ReassignPartitionsCommand.scala:348)
        at kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:209)
        at kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:205)
        at kafka.admin.ReassignPartitionsCommand$.main(ReassignPartitionsCommand.scala:65)
        at kafka.admin.ReassignPartitionsCommand.main(ReassignPartitionsCommand.scala)

2. TargetBroker crashed during the migration process, causing the migration task to be in progress

Usually this happens when you write to a node
/admin/reassign_partitions/After that, we have one over N
targetBrokerAn outage prevents the Broker from creating new replicas, synchronizing the Leader operations and moving forward

Scene demonstration

To simulate this situation, we can manually write the node/admin/reassign_partitions/Redistribute information such as:

  1. Create a node to write the following information, where Broker-1 is not online; [Fixed] The simulation crashed during allocation;

    {"version":1,"partitions":[{"topic":"test","partition":0,"replicas":[1]}]}
  2. see /broker/topics/{topicName}The node in the
  3. Next you should send a LeaderAndIsr request to Broker-1 to create a replica and synchronize the Leader; But Broker-1 is not online at this time; As a result, the task is still in progress. If you want to do some other reassignment, you will be prompted as follows

    There is an existing assignment running.

    The solution

Once you know what is going on, you have a clear idea of how to solve the problem. Just restart the Broker that failed.

3. The migrated replica cannot find the Leader, so TargetReplica cannot synchronize the replica all the time

As long as the Leader service of the migrated replica is suspended and a new Leader has not been elected, there is no place for synchronization


This is similar to scenario 2, but is different in that another Broker may have failed

Scene demonstration

BrokerId role state A copy of the
0 Common Broker normal There is no
1 Common Broker outage test-0

Now migrate the partition test-0 from Broker1 to Broker0

{"version":1,"partitions":[{"topic":"test","partition":0,"replicas":[0],"log_dirs":["any"]}]}



Look at the picture above,TargetReplicaWill receiveLeaderAndIsrThe copy is then created, and it is written in ZKTargetBrokerAR information of;

Then start to synchronize the replica information of the Leader. Who is the Leader at this time? Is the Broker – 1test-0; (only one copy), and then ready to desynchronize,OriginBrokerIf you’re not online, you can’t synchronize, soTargetReplicaThe copy is created, but the data has not been synchronized; The following

  1. TargetReplicaCreated, but no data; And becauseOriginBrokerBroker0 is not online, so there is no deleted copy (kafka-log-30 below is Broker0; Kafka – logs – 31 is Broker1)

  2. Because the entire partition reassignment task was not completed, /admin/reassign_partitions/ has not been deleted

    {“version”:1,”partitions”:[{“topic”:”test”,”partition”:0,”replicas”:[0]}]}


  3. The node in /broker/topics/{topicName} will be updated to the following figure, whereAR RRThey haven’t been cleared yet

  4. brokers/topics/test/partitions/0/stateThe node looks at the Leader as -1 and is not added to the ISRTargetBroker



    As long as the synchronization is not successful, the whole partitioning process will continue;

The solution

If the OriginBroker fails and one copy goes offline, the other replicas will assume the role of Leader If there is only one copy, this will cause the exception, and you simply need to restart the OriginBroker

4. The current limit causes the redistribution to never be completed

Kafka column sorting address please stamp here

We generally do partition copy reassignment task, will generally add a flow limit value


--throttle : Transfer rate between brokers during migration, in bytes/ SEC


Note that this value is a flow limit between brokers, not just for the several partitioned copies of the migration; It is the traffic that contains the normal data synchronization of the other topics themselves. So if you set the current limit very low, the rate is even lower than the normal synchronization rate


Or if your synchronization rate is slower than your message creation rate, the task will never be completed!

Scene demonstration

  1. Creates a reassignment task with a current limit value of 1

     sh bin/kafka-reassign-partitions.sh --zookeeper xxxx:2181/kafka3 --reassignment-json-file config/reassignment-json-file.json  --execute --throttle 1 
  2. Basically this rate is never going to be done,admin/reassign_partitionsThe node is always
  3. Current limiting configuration in ZK

The solution

Set the current limiting threshold a little higher, rerun the above script, and increase the current limiting value

     sh bin/kafka-reassign-partitions.sh --zookeeper xxxx:2181/kafka3 --reassignment-json-file config/reassignment-json-file.json  --execute --throttle 100000000

Be sure to verify the end of the task to remove the current limit value! Otherwise he will always exist;

5. Data volume is too large, synchronous thief is slow

This situation is a very common thing, it is not an exception, you can’t do anything about performance problems, but often we do data migration will ignore a problem;
There is so much outdated data that it makes no sense to migrate it.


Check out my previous post
Operation and maintenance control of Kafka’s soul mate Logi-Kafkamanger (4) — cluster operation and maintenance (data migration and cluster online upgrade)





Reducing the migration of valid data can greatly increase the efficiency of data migration;

The solution

< font color = red > cutting down on the amount of data migration < / font > if you want to migrate the Topic has a large amount of data (the default reservation 7 days), can be dynamically adjusted before migrating temporary retention. Ms to reduce the amount of data; Of course doing this manually is really annoying, but you can be smarter about it

Operation and maintenance control of Kafka’s soul mate Logi-Kafkamanger (4) — cluster operation and maintenance (data migration and cluster online upgrade)

Visualized data migration, partition copy redistribution;

Set the current limit, reduce the amount of data migration, and automatically clean the current limit information after migration

Search for problem ideas

I’ve listed every possible solution I can think of to the problem above; So there’s an encounter


How do you quickly locate and resolve the reassignment that is going on all the time? There is an existing assignment running.

1. Look at the data in/ admin/reassign_partitions

Suppose a task is as follows; There are two partitions on the Broker[0,1] test-1 partition on the Broker[0,2]

{" version ": 1," partitions ": [{" topic" : "test", "partition" : 0, "replicas" : [0, 1]}. {" topic ":" test ", "partition" : 1, "replicas" : [0, 2]}]}

Broker1 in the figure is down,test-0 will not complete, and test-1 will complete. The /admin/reassign_partitions node is

{" version ": 1," partitions ": [{" topic" : "test", "partition" : 0, "replicas" : [0, 1]}]}

<font color=red> > <font color=red> > <font color=red> > < / font > < / font > </font>

The test-0 partition is incomplete and the corresponding Broker is [0,1].

2. Brokers /topics/{TopicName}/ Partitions /{partition number}/state

I know by step 1 test – 0 there is a problem, I just directly node/brokers/switchable viewer/test/partitions / 0 / state data Here are two kind of situations

  1. The following

    {"controller_epoch":28,"leader":0,"version":1,"leader_epoch":2,"isr":[0]}

    ISR:[0], only 0; Normally it should be [0,1] as I set above; The problem is that a copy of Broker 1 is not in the ISR; The next question is to check why Broker 1 is not added to ISR;

  2. As follows, Leader :-1 case

    {"controller_epoch":28,"leader":-1,"version":1,"leader_epoch":2,"isr":[0]}

    Leader :-1 means there is no leader currently; Newly added replicas have no place to synchronize their data and are confused. So the next thing to check is whether all the other copies of the TopicPartition are down. How do I determine the other brokers? To see if AR is normal; AR data can be seen in brokers/topics/{topicName};

    Of course you canDidi-Logikm one-stop Kafka monitoring and control platformTo make it easier to check this step; The following

3. Follow step 2 to determine if the corresponding Broker is abnormal

If a Broker exception is found, the restart is complete.

4. Query current limit size

If step 3 has not resolved the problem and there are no Broker exceptions, then it is time to review the traffic limitation issue

  1. Let’s first look at the nodes/config/brokers/{brokerId}Whether the current limiting information is configured;

  2. And node/config/topics/{topicName}The information of

  3. If you see that the Broker node is not included in the ISR, you will have no synchronization rate problem
  4. If the current limit value found in the query is relatively small, it can be appropriately increased

    
         sh bin/kafka-reassign-partitions.sh --zookeeper xxxx:2181/kafka3 --reassignment-json-file config/reassignment-json-file.json  --execute --throttle 100000000
          

    5. Re-perform a reassigned task (stop the previous task)

    If the above still does not solve the problem, then it may be that you copy too much data, migrated too much data, or your targetBroker network is not good, etc., the network transmission has reached the limit, which belongs to the problem of performance bottlenecks, perhaps you should consider whether to redistribute the problem; Or find a dead of night to do the redistribution operation;

Scene demonstration

  1. The test-0 partition, which was only in the Broker [0], is now reassigned to [0,1], using the--throttle 1Simulate the slow network transmission rate, performance bottlenecks, etc



This node is always there, always in progress, and 'adding_replicas' is always showing [1]
  1. You can also see that Broker-1 is alive
  1. But it’s not in the ISR
  2. Judged that the synchronization rate may be worse, the TargetBroker may not be in a good network condition, or the pressure itself is also quite large; Change the TargetBroker
  3. Delete the node /admin/reassign_partitions directly, and then re-perform the reassignment task; Reallocated to [0,2]

    {" version ": 1," partitions ": [{" topic" : "test", "partition" : 0, "replicas" : [0, 2]}]}



    You can see that the new allocation has been written to zk;

    But there are no changes in the Topic nodeAR ARS



    This is becauseControllerAlthough received the node’s new notification/admin/reassign_partitions; However, during validation, it has stored the previous reassignment task in its memory. Therefore, for the Controller, it thinks that the previous task has not ended properly, so it will not use the backdoor process.

  4. Re-elect the Controller role, reload/admin/reassign_partitions; As I analyzed in the article [[Kafka source code] Controller boot process and election process source analysis](), the Controller reelection is reloaded/admin/reassign_partitionsNode and continue task execution; After the switch, it looks like this. The change is normal



    <font color=red > </font> </font> </font> </font> </font

    There are simpler ways, of courseDidi LogiKM one-stop Kafka monitoring and control platformThe following



    Specify some idle brokers as controllers,And switch right awayIs a wise choice;

The solution

  1. The data volume is too large because there is a lot of out-of-date data; If you don’t consider cleaning up stale data when you redistribute; So let’s just redistribute

    But only one task can be assigned at a time, so you can only delete it by force/admin/reassign_partitions; And then redistribute it;

    Note that when redistributing, please be sure to set temporary data expiration time, reduce the migration of data; And also to letControllerLet’s switch;
  2. To sum up,

    ①. Delete the node/admin/reassign_partitions

    ②. Re-execute the reassigned task

    (3).ControllerRepeat election

Checking tools + thinking

After analyzing the above problems, it is still quite troublesome to start the investigation of this problem. Look at this index and so on.


</font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font>


Since the investigation idea, visualization, automation, tool is not difficult it;


So I am
Didi LogiKM one-stop Kafka monitoring and control platformWe are going to raise an ISSUE to simply implement such a function.


See when the time is free to complete it, if you are interested, can also complete it together!

Real Case Analysis

On Friday, when I was about to get off work, one of my classmates asked me the following question, and then I replied.


Later, for specific analysis, a small group was brought together to look for clues

Progressive group plus V: jjdlmn_



In the process of partition redistribution, this student lasted for a long time and was in progress. Later, he went to Baidu and asked to delete the redistribution task node in ZK. I alerted the node and immediately deleted the node, only to find that one of the migrated TargetBroker failed. After they were restarted, the reassignment task continued, meaning that the TargetBroker was able to complete the copy assignment as normal.

Problem analysis

In fact, this problem is the second case we analyzed above. 2. TargetBroker failed during the beginning of the migration process, causing the migration task to be in progress all the time.

The task cannot be completed because the TargetBroker is down. At this point, just restart the TargetBroker;

Although they directly removed the node /admin/reassign_partitions; It’s not a big problem; The next time the reassignment task is started, the Controller memory will still contain the previous message, so the next task will not be executed. But if you reassign the Controller, then it just keeps going, it doesn’t matter;

Although they delete nodes this time, they also start the next allocation inside; But because it restarts the TargetBroker; Let the original task smoothly carried on; Even if you do not switch the Controller, it will not affect the next reassignment; (The previous one for which the Controller was notified is over because it went smoothly)

If you have any other exceptions that may occur, or other questions about Kafka, ES, Agent, etc., please contact me and I will add to this article



Star
To build

Didi LogiKM one-stop Kafka monitoring and control platform


Kafka column sorting address please stamp here