Users often submit an Issue labeled “Help Wanted” on Tdengine’s GitHub. Most of these issues are not bugs, but simply confusing usage problems due to the lack of familiarity or understanding of the TDEngine mechanism. We will share some common issues on a regular basis and hope you can learn something from them. In this issue, we share a solution to achieve high availability of the TDEngine client.

How does Tdengine make the client highly available?

Recently, on Tdengine’s GitHub, we met with two cluster users who both mentioned the high availability of the Tdengine client:



The common question of users shows that this problem is very representative, so we select it, hoping to form more thoughts on the optimization of products from the user’s point of view.

We have refined this question to “When the TDEngine client uses TAOS-H FQDN-P Port to connect to the cluster, will the connection fail if the client node is down?” The answer is clearly “yes”.

One user told us: “It’s a shame that the Tdengine server is ready for high availability, and the client is not.”

In fact, it is true that the connection failed. However, TDengine does not encourage users to connect to the cluster in this way. Why do you say that? Let’s go through it a little bit.

Suppose a user is connecting to a TDEngine cluster and the node he is connected to fails. At this point, we need two kinds of high availability: one from the server and the other from the client.

High availability of the server refers to that when a node of TDengine fails and fails to respond within the specified time, the cluster will immediately generate system alarm information and kick out the damaged node. At the same time, automatic load balancing will be triggered, and the system will automatically transfer the data on the data node to other data nodes.

High availability of the client refers to the fact that TDEngine immediately specifies other available database servers for the client to continue the connection if the connection fails.

The point is, can TDengine achieve such a function? B: Sure.

Here’s the real reason we don’t recommend specifying FQDN in URLs or connections like TAOS, but instead using the client configuration file TAOS.cfg to connect to the cluster. In the latter case, the client will automatically connect the configured FirStep node. If the node fails at the moment of connection, the client will connect the node represented by the secondEP parameter.

It is worth noting that as long as one of the two nodes is successfully connected, the client is no longer in question. Because firstEP and secondEP are only used at the moment of a connection, they do not provide a complete list of services, but rather the destination of the connection. As soon as the cluster is connected for a few tenths of a second, the node will automatically get the address information of the management node. The probability of both nodes going down at the same time is extremely low. Later, even if both firstEP and secondEP are down, the basic rule that the cluster can serve is not broken and still works. This is how TDEngine maintains high client availability.

Of course, this is not the only way to maintain high client availability. Both users use Load Balance to wrap a layer of load balancing around the outer layer. In the process, both men run into the same problem. It turns out that when they were doing the 4-layer network load, they were only using the TCP port, and the connection failed. Thus, their common question on GitHub is how Tdengine achieves high availability on the client side.

Below, we will analyze why they still have the problem of connection failure when doing network load balancing. The official documentation for Tdengine explains this: Considering that in the Internet of Things scenario, the packets written to data are generally small, RPC also supports UDP connections in addition to TCP connections. When the packet size is less than 15KB, RPC will use UDP to connect, otherwise it will use TCP to connect. Operations larger than 15KB, or query class, are transmitted using TCP.

So that’s the answer. The packet that makes the connection is less than 15KB, and it’s a UDP connection. Therefore, when they all add UDP forwarding rules, they have successfully completed the network load balancing around the cluster. The benefits of such a setup are not only that the client can also be highly available, but also that the “development” and “operation” scenarios will be clearer and easier to manage.

Interestingly, Yakir-Yang, the first problem solver, gave a warm answer to the second questioner, the Stringhuang, after discovering another problem. He had just gone through the exact same question and could immediately see the pain point of the other questioner. Thus, the three parties, who do not know each other, have a harmonious technical interaction in this open source community. The end result: After the questioners learned the mechanics of TDEngine, they also succeeded in building a familiar high-availability strategy on the new product.

That’s what we’re most happy to see.

Have you learned how to make the TDEngine client highly available? If you have any problems using TDEngine, you can submit your Issue to GitHub. In addition to getting official technical support, you can also communicate with a lot of like-like-thinking users

https://github.com/taosdata/T…

About the author: Chen Yu, a former database manager in IBM, has entrepreneurial experience in “We Media” in other industries. At present, I am responsible for community technical support and related operations in Taos Data.