This post from Doodle Intelligence’s Jun Song Liu at PingCAP DevCon 2021 includes TiDB’s use in the IoT space, especially in the smart home industry.

About Doodle Intelligence

Doodle Intelligence is a global IoT development platform that sets the standard for interconnected development, connecting brands, Oems, developers, retailers and industries with intelligent needs. Based on the global public cloud, smart scenes and smart devices can be interconnected. Covering hardware development tools, global public cloud, intelligent business platform development three aspects; Provide comprehensive empowerment from technology to marketing channels to build a neutral and open developer ecosystem.

At present, graffiti has more than 100,000 partners at home and abroad, and the number of ecological customers in IoT PaaS and IoT developer platform has reached more than 320,000, involving manufacturing, retail, operators, real estate, pension, hotel (PaaS) and so on. Graffiti has been empowered by European and American brands as well as Chinese brands, including Philips, Haier and three major Chinese operators.

Real-time response to massive data: Selection history of TiKV

Doodle’s equipment handles 84 billion requests a day worldwide, with an average processing peak of 1.5 million TPS and an average response time requirement of less than 10 milliseconds. Doodle is an Internet of things industry, different from the traditional industry, with no low peak points and a large amount of writing, so it has spent six years trying to select the most suitable data architecture.

Graffiti has such a large amount of data, because the people at home will be used to intelligent devices, such as intelligent robot electric light, sweep the floor, after equipment connected to the Internet and graffiti platform has the ability to communication, and various timing trigger of smart devices, such as home camera patrolling, sweep robot location information needs to be reported to graffiti Zeus platform. Zeus system, as the most important role of graffiti platform, is responsible for handlingData reporting and service topologyAs shown in the figure below, MQTT messages from smart devices collected by the application gateway are sent to Kafka and NSQ, which are consumed by the Zeus system for decryption, processing, and storage. This article mainly describes the product selection process from Zeus to storage.

AWS Aurora

Doodle used AWS Aurora in its early days. Aurora is similar to PolarDB at Aliyun, yesDeposit is separateGraffiti has been running steadily on Aurora for three years, and Aurora has fully met demand in its first three years of use. Iot before 67 are unpopular, not so popular, intelligent household equipment users with much, but later along with the expansion of business, equipment grow exponentially in recent years, every year, three to five times in Aurora swelled cannot withstand the amount of data, especially the Internet response time requirement is within 10 milliseconds, depots table even if, Breaking up the cluster also doesn’t meet doodle’s business needs.

Apache Ignite

So Doodle started trying to use Apache Ignite, also a distributed oneKV system, similar to TiKV of PingCAP, it is based on JAVA architecture for data sharding. Its sharding is relatively large, 1G of data is a Partition, and its expansion is not as linear as TiKV. If the volume of doodle doubled, the machine would have to be shut down while it expanded, and there would be a risk of data loss. At this time we mounted Aurora behind a Ignite as a disaster backup, and data was written to Aurora simultaneously. As the volume of business exploded, however, even Ignite couldn’t meet doodle’s business needs, requiring expansion, which required downtime, something the Internet of Things couldn’t tolerate.

TiDB 3.0 and 4.0

At the time doodle tried to replace Ignite Cluster in 2019, the U.S. storage facility had grown to 12 nodes. Coinciding with PingCAP’s TUG event in Hangzhou, we did a validation test of TiDB 3.0. However, TiDB 3.0 didn’t meet the requirements of doodling because of the high latency, and it was abandoned after several months of trying. Time to 2020, TiDB 4.0 is online. We also tested TiDB 4.0, which has a great improvement over 3.0, but the problems of high latency and insufficient throughput still exist. At this time, PingCAP research and development team conducted in-depth analysis on this problem, and found that the main time is spent on SQL PARSER layer, while the storage of TiKV layer is completely idle because of the large amount of graffiti writing and high requirement for delay, which is completely not up to expectations. Since all the delays are consumed in THE SQL PARSER layer, and the data written by the Internet of Things has a high TPS, but the business logic is not that complex, can we remove the SQL layer and directly write to the TiKV layer? We refer to PingCAP’s official API documentation for TiKV and claim that it hassupport JAVA, GO and RustAnd began to try and explore.The results of the online application were pleasantly surprised and received the recognition of the whole company. After that, we launched TiDB 4.0 in all regions of the world. After one year’s test, we found no problems in normal operation. Originally, we needed 12 machines, but now we only need 3 machines with the same configuration, that is to sayThe hardware cost a quarter of what it would have been. When the graffiti throughput went online, there were already 200,000 TPS. In terms of the cluster in North America, the version at that time was 4.0.8. The response time of the query was 99% 150 microseconds, and the write time was 360 microseconds (less than one millisecst).

New challenges: deployment across regions

But we are not happy as long as meet the new challenge, because when AWS deployment is the deployment of three available area, such as Frankfurt a deployment is ABC three areas, three copies of communication between consumes traffic, and traffic is to charge, and graffiti all application is deployed in three areas, also need to call to different regions, TiKV doesn’t have the same zone call strategy as Double, so the cost is high, even though doodle is only a quarter of the machine it used to be. In the current solution, RPC-based compression is implemented to reduce network traffic. However, this traffic can only handle Region replication traffic, and the cross-region replication traffic of application codes is not reduced. We found that the reason for this problem is that the server of TiKV does not carry out server side filtering. The data stored by TiKV needs to be retrieved locally for application filtering, and then plugged back. This person communicated with the R&D team of TiKV.Later versions may introduce **** server-based filtering to reduce server load and traffic costs may also come down.

Cost reduction and efficiency increase: Architecture upgrade from X86 to ARM

The IoT industry is focused on cost reduction because the gross margin in the IoT industry is very low and we need to reduce the cost per module. In June 2020, AWS launched the C6G product, whose cost performance was claimed to be 40% higher than that of the previous generation C5. Therefore, we tried AWS C6G, but found that the response time was 6 to 7 times slower than X86 architecture when TiUP was compiled and deployed directly. That is, TiUP deploys a general-purpose compiled version, which is not that appropriate for hardware. After testing and verification, it is found that the existing TiKV version does not support SSE instruction set, that is to say, the RocksDB version used by TiKV 4.0 does not support SSE instruction set.

The SSE instruction set is mainly used for CRC check, HASH and floating point operations. At that time, the compromise solution was mixed deployment. TiKV uses X86 architecture, while other nodes use ARM architecture, but this also brings inconvenience. If the version is upgraded, the mirror pointing to is sometimes X86 and sometimes ARM, which will be very troublesome. So the whole thing cuts back to X86 architecture. This year, TiKV has been released in version 5.0, which supports the Aarch 64 optimized CRC32C instruction set, SSE 4.2 instruction set, but only if RocksDB is larger than 6.1.2. The version of RocksDB for TiKV 5.0 is 6.4.6, and the optimization of TiKV for SSE instruction set can be found on TiKV, that is to say, TiKV 5.0 now fully supports SSE instruction set, which will be included in the focus of testing in the second half of the year. That could bring costs down even further.

The business outlook

In the future, with the help of TiDB 5.0 and 5.1, Doodle is confident to receive several times of business growth, and it is expected that the flow of TiKV will increase three to four times by the end of the year. The big data platform also uses TiDB as a large screen display, and the device pipeline of the Internet of Things is also considering using TiKV 5.1 as storage to improve the ease of use to a greater extent. The deployment of TiDB ARM version is also planned in the second half of the year.