Xu Jia Qing, Bigo DBA, TUG Head of South China.

Founded in 2014, Bigo is a fast-growing technology company. Based on powerful audio and video processing technology, global real-time audio and video transmission technology, artificial intelligence technology, CDN technology, Bigo launched a series of audio and video social and content products, including Bigo Live, Likee, IMO, Hello voice, etc., has nearly 400 million monthly active users worldwide. Products and services have covered more than 150 countries and regions.

Use of TiDB 4.0 in Bigo

We started using TiDB 4.0 beta earlier this year and built a cluster of test environments that kept up with the latest iteration of TiDB, so we recently upgraded to 4.0 GA.

For the launch of TiDB in the production environment, we are very brave and bold to deploy two sets of production environment clusters. The two clusters are not very large in scale, but more analytical businesses. One is the analysis of network monitoring, which is characterized by large growth of data volume, and SQL partial analysis class, and the response time by certain requirements; There is also a downstream storage of big data. The data after big data analysis can be used by online real-time services. The amount of data in a single table is usually not small, and most of them are background summary business of operation. We used TiUP for cluster deployment, which is the official preferred deployment method. In short, TiUP is much better than TiDB Ansible and solves most of our problems.

In addition, we use more components and features of TiDB 4.0, including Pump, TiFlash, and more. Since Bigo’s business covers the whole world, we hope to deploy our services in all continents (or regions) around the world. However, the delay of services across continents is unacceptable for some businesses. Therefore, we use Pump synchronization to synchronize data between continents. As for TiFlash, I will share more of my practical experience in a moment, as those of you who are familiar with me in the TiDB community know, I always say “TiFlash really smells” on various occasions.

Why are we using TiDB 4.0?

On the one hand, there are new requirements on the business, and as DBAs we try to meet them.

Such as TiDB 4.0 support through character set ordering rules to control whether or not they were case sensitive, and before that we can’t control, so there are often business students said to us after you TiDB service deployment, the sort of character level just like “fake”, really like before, of course, is also false, because there is no way to control.

There are also some e-commerce, financial platforms will use pessimistic transactions, but TiDB 4.0 to achieve pessimistic transactions, for business does not need to pay attention to consistency or data conflicts and other issues.

On the other hand is the demand of operation and maintenance.

We are now using TiUP’s package management approach to deploy TiDB, including upgrade iterations. Before we used TiDB Ansible, we needed to do a lot of work such as authorization, and it is not as flexible through Playground. What TiUP does better is that it is more flexible, and its package management mechanism does not require the lowest level SaaS team to grant some permissions for every license, which alleviates more interaction. And we can use TiUP to see the status of the real cluster.

There is also the backup function. Many of our core businesses have been trying to do some backup on TiDB, but before, they could only use Mydumper or disk snapshot, which may not be friendly to DBA operation and maintenance. Now TiDB 4.0 has a very complete backup function. So we will do everything we can to bring TiDB 4.0 to more businesses.

Of course, the most important and core reason is TiFlash. In terms of architecture, TiFlash has a copy of column storage next to KV. This copy can also be read consistently through MVCC and so on. More importantly, it is a column storage.

As we understood it, we would classify all requests online or we would classify all requests as OLTP and OLAP, and that would tell us that real-time business requests online are OLTP requests and aggregate big data are OLAP requests, but if we really think about it, Is OLTP really the only request for a real online business scenario? Or is all the data aggregated by big data really only OLAP queries? Not necessarily. Many business scenarios will have some operational requirements, there will be real-time reporting queries. If we use the old big data set, it’s usually a T+1 or T+N time, not real time. But the online business needs to go to the management platform to do the summary class query, sometimes it is directly thrown to the online TP class storage, of course, some businesses will change the index, data heterogeneity to do. But now TiFlash gives us another option to solve the real-time analysis scenario by adding a new column copy.

In the bottom right corner of the figure above, I have an SQL column (which processes some sensitive information). As you can see, this SQL has over 100 rows, a bunch of tables, groupby, some conditions, and a SUM calculation. This is a real-time business request, usually placed on MySQL online or TiDB with only TiKV engine, which may run to the minute level or more. In this business, we bravely tried TiFlash. It has been proved that by using TiFlash engine directly, we can reduce such requests to about 50 seconds, which is more friendly to business and its data is real-time. Because each Region of TiFlash is synchronized through TiKV’s Region, consistency is guaranteed through Raft.

In addition, TiDB itself can be combined with our existing big data, through TiSpark directly into the engine layer below, including TiKV and TiFlash.

Of course, after making such optimizations, the business side often looks at us and says: Can the DBAs be more helpful? So we asked the PinpCAP technical team for some help. Can you try a little harder? The answer is yes.

In TiDB 4.0 GA version, two new operators are added to TiFlash, which can push down more operators and merge more regions at the same time. That is to say, two more optimization parameters are added. Although these two optimization parameters are turned off by default, we can turn them on by setting. When turned on, the improvement was noticeable, at least in our tests, more than doubling from 25 seconds to 11 to 12 seconds. At this point, it’s actually not that far removed from real-time online queries, which means we have very real-time online analytics.

So now, our business will tend to choose TiFlash, rather than by big data that a set of data of import and export, because all of my big data import, export, involving data pipeline link is relatively long, and the lack of back and forth checking, this is also when we were in the final selection engine choice TiFlash reasons.

In general, all TiDB 4.0 clusters across Bigo’s entire online business will be equipped with at least one copy of TiFlash. For DBAs, stability is important, and TiFlash brings with it the guarantee that at least it doesn’t get worse, even if the TiFlash copy dies, we can still call the operator on TiKV without other TiFlash copies. This is why we, as DBAs, chose TiDB 4.0 and TiFlash.

What else will we do with TiDB 4.0?

1. TiCDC attempts in synchronization scenarios

Our business will have a multi-point synchronization issue with writes on multiple continents. We started off with Pump+Drainer, but since we get a bit of multi-point writing issues, there’s a bit of Drainer work that needs to be done. The Pump+Drainer deployment has some usability issues and consumes a lot of resources and creates unwanted intermediate binlogs, so we’re trying to use the new TiCDC data synchronization tool to synchronize multiple TIdbAs and different data sources. We may also do some of our own development, including resolving conflicts, merging data, and so on.

2. Pd-based service discovery

Because many business deployment architectures of TiDB take TiKV and TiFlash as the deployment of engine layer, TiDB Server layer or a stateless service, most of the business will be connected to stateless service, and then add a layer similar to proxy for forwarding, so as to achieve load balancing. This method is not very elegant. For example, after it is connected to the container, it may be done through the elastic expansion and shrinkage of the container or other ways, so that the proxy cannot timely perceive the changes of the back end.

A good aspect is that PD has some interfaces based on ETCD, from which we can directly use and find available TiDB nodes. When elastic expansion is carried out, such as when a wave of traffic peak comes, elastic expansion of a stateless TiDB will be very fast, and PD will quickly find that DB. We can find that service directly from the Client and send traffic directly to it. In this way, an expanded service can be put online more quickly, which does not require more transformation for the business. If we use the traditional way, we may need to re-register the expanded DB in the proxy after it is really registered with the service, which will slow down the whole process.

3. More dashboard-based exploration

Here is a hotspot map of what is actually the TiDB Dashboard:

As a DBA, sometimes there will be some “opposition” to the business, such as business students: I did not make requests, I did not make any changes, my data is not hot, very uniform. But now that we have TiDB Dashboard, we can see for ourselves what that data looks like. In this hotspot map, the horizontal axis is the time axis, and the vertical axis is the partition and fragment of each data. You can see the amount of this data in each fragment at a certain time. If the business students say, “My data has no hot spots, my data is completely uniform, and my data is ok”, the DBA can take out a hot spot map and clearly point out which data has hot spots and which data has changes. Are there any data trends or changes in the business that lead to hot spots?

In many cases, business students are not able to fully control their own data and change trends. Having such a “complete data snapshot” will bring great convenience to our work. Of course, Dashboard also brings more benefits, such as better checking for slow queries, a way to check various logs, and even a one-click grab of the Perf flame chart on Dashboard to analyze what’s wrong with the current cluster.