This is the 20th day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

The fault phenomenon

In production, an application has 45 nodes, and each node integrates a ZK client. That is, these 45 ZK clients listen to the same ZK path. When the value of the ZK path node is updated, the 45 nodes will receive notification of the node change and conduct relevant business processing.

The value of the ZK node is updated 6 times in parallel (although it is parallel, there is still a time difference between setData each time, which may be relatively small). It was found that some clients were not notified (because some application nodes performed unexpected results).

Since it was production, o&M restarted the 45 application instance nodes to get the latest value of the listening path and recovered it.

Later, I checked the log and found that each node would print a log when it listened to the notification. Then, 45 nodes monitored 6 changes, which should have 45*6=270 logs, but in fact, there may only be more than 100 or 200 logs. There are always some clients who did not receive notification when the value of a monitored path changed. But the last notice or two, they all came through.

Local recurrence

Fault recovery cannot be performed directly in the production environment.

The listener uses the DataCache interface encapsulated by the Curator API used by the ZK client.

I could only emulate locally, launching 50 ZK clients on my own machine. The ZK server uses a test environment, and due to possible network factors, the ZK server cannot be started locally to simulate.

The problem is that it recurs a little bit longer, and the process is no longer described.

Problem orientation

By tracing related source code of Curator and some simulation of other scenarios, I narrow the scope of the problem to three aspects: zooKeeper native client, network and ZooKeeper server.

Using the native ZooKeeper client interface test, the code is simple and logical:

It is to listen for node changes, get node data, re-register listeners, print a line of information, no business logic at all. Only one client is started. Then start a client and loop to update the value of the node five times:

Occasionally, the listening client will print less than 5 logs, such as this one:We only got three taps.

Because the problem location time is a little tight, there are other things to busy, no other reason, is really do not want to follow the source code. But it’s not clear whether it’s the ZK client code or something here on the server.

There was no way. For the sake of time, all I could think of at this time was to capture the packet and try to see whether the server did not send the notification, or the client did not receive or received it. There was something wrong when dealing with it.

Because the server is not convenient for me to log in, first catch the client (my machine), this machine is easy to catch, using Wireshark looks convenient. Then, after several attempts, there was finally a problem:

The client receives less than five notifications, but only 0, 1, 2, and 4. When the other end sets the data, it is normal:

The following is a packet capture analysis content I will say the results of the analysis:

  1. There are actually four notifications from the server
  2. Each time the client receives a notification, the next TCP request is to pull data (listeners are registered in the request).
  3. The TCP pull request is sent, but the server returns notification of the next data change
  4. The server returns data for the last ZK path requested.

(Note the order) The server sends notification – the client receives notification and requests the value of the path node – The server returns notification of the next data change – The server returns the value of the data requested by the last client – The client receives this notification and requests the data of this notification…….

The data client of P.S. setting node is also in my machine, and I filtered the request packets of data setting of this client during the above analysis. The first several packets are normal. When the client receives the node value changed to 3, the client sends a request to pull the data packet, and the server returns the data value of node 4… In the end, there is no notification of the change of node value 4. The client originally sent a pull request when the node value changed 3, but received a value of 4. The value of 3 was lost, and the node value changed to 4, and finally did not send a notification.

IP is not easy to display. Although it’s Intranet)

The following are captured packets: in order:

  1. Zk client for setting data sets path to a value of 4
  2. The zK client listening for the change is requesting the value of path (this is notification that the node has changed to 3 when it wants to get the value of 3).
  3. ignore
  4. The ZK server returns the listening ZK client with a Path value of 4.

The following ACKS can be ignored, and the following two transmissions are irrelevant. There are no other notifications or requests. Finally, the connection is disconnected.

This time there is no way, it seems that the problem in the ZK server here, only to turn over the source code.

Finally, from the start class, trace back to the WatcheManager class. The following core code triggers the listener:

    public Set<Watcher> triggerWatch(String path, EventType type, Set<Watcher> supress) {
        WatchedEvent e = new WatchedEvent(type,
                KeeperState.SyncConnected, path);
        HashSet<Watcher> watchers;
        synchronized (this) {
            // This is the watchTable variable declared earlier
/* private final HashMap
      
       > watchTable = new HashMap
       
        >(); private final HashMap
        
         > watch2Paths = new HashMap
         
          >(); * /
         ,>
        ,>
       ,>
      ,>

            // Listeners registered for each path are placed in the watchTable map
            // This code is in the synchronized block, add listener method also has synchronization operation, and is competing with this code to lock
            // The listener is removed first.
            // This means that when a new listener is not registered, the node changes and the listener is not registered
            // The client will not be notified
            watchers = watchTable.remove(path);
            if (watchers == null || watchers.isEmpty()) {
                if (LOG.isTraceEnabled()) {
                    ZooTrace.logTraceMessage(LOG,
                            ZooTrace.EVENT_DELIVERY_TRACE_MASK,
                            "No watchers for " + path);
                }
                return null;
            }
            for (Watcher w : watchers) {
                HashSet<String> paths = watch2Paths.get(w);
                if(paths ! =null) { paths.remove(path); }}}for (Watcher w : watchers) {
            if(supress ! =null && supress.contains(w)) {
                continue;
            }
            // Notify the client
            w.process(e);
        }
        return watchers;
    }

Copy the code

Take a look at the Chinese notes I added above.

Plus the grab bag, can analyze the server node values change to 3 notice sent to the client, the client listener has been removed, and then another set of data the client Settings node value for 4 of the request is sent, about 0.05 ms, after listening to the client found that to get the data request (including registered listener) at this time, However, it may be too late, the server data is changed to 4, and the listener is not registered when the notification is sent to the client (the increment listener that triggered the listener also has synchronization operation).

I calculated the time from receiving the notification of node change to registering the new listening request, about 30ms passed on my computer.

It takes about 33ms from setting the value of the node to sending the packet to the node, because it is difficult to estimate the transmission time without capturing the packet at the server end.

In the actual production environment, however, the nodes are under greater pressure, and the processing capability on the service side cannot be guaranteed. So if the node changes when the old listener fails and the new one is not registered, it will not receive notification naturally.

The solution

Since the operation is frequent, the new listener is not registered when the data is updated, or the server may receive the request and not registered during transmission.

Let’s try not to do it too often. Give it some time. As I do, dynamically fetch application instance listeners (number of ZK clients), convert parallel to serial, and calculate a reasonable latency appropriately (if necessary, add a latency to the two changes to avoid client listeners being too late to register).