About the author: Sun Xiaoguang, head of Zhihu Technology Platform, joined a team with Ning Xue (@inke), Menglong Huang (@pingcap), and Bo Feng (@Zhihu) to participate in the TIDB Hackathon 2019. Their project TiSearch won the CTO Special Award.

“Search” is a very important behavior when people use various APPS. For a product like Zhihu, which features massive high-quality content, it is of great importance to help users reach the content they are looking for accurately and quickly with search. Full-text search is a basic capability that hides behind a simple search box.

At present, we are gradually migrating more and more business data to TIDB. At present, we can only use SQL LIKE for simple retrieval of content on TIDB. However, even without considering performance issues, SQL LIKE still fails to fulfill some common information retrieval requirements in search scenarios. For example, in several scenarios as shown in the following figure, the use of LIKE alone may lead to ambiguous query results or results that meet the search criteria cannot be returned.

Due to the lack of full text retrieval capability of TIDB at present, we still need to synchronize data to search engine in the traditional way. In the process, we need to do a lot of tedious data pipelining work to maintain full text index of business data according to business characteristics.

In order to reduce such repeated work, in this year’s TIDB Hackathon, we tried to introduce the function of “full text retrieval” into TIDB, providing the ability to search text data stored in TIDB anytime and anywhere. Here’s the final look:

The project design

It was very difficult to get TIDB to support full text search in just one day of Hackathon time, so at the very beginning, we chose a very safe design solution — using ElasticSearch (later referred to as ES) to extend the full text search capability of TIDB.

Why ES? On the one hand, we can make full use of the mature ecology of ES to directly acquire Chinese word segmentation and query comprehension ability. In addition, the strong combination effect brought by ecological integration is also in line with TIDB’s value of advocating community cooperation.

Considering the workload, for the data synchronization scheme of full-text index, we do not adopt the TIKV Raft Learner mechanism, nor use the way of TIDB Binlog for synchronization. Instead, we use the most conservative double-write mechanism to directly add the process of full-text index update in the writing process of TIDB.

The structure is shown in the figure above. TIDB serves as a bridge between ES and TIKV, and all the interactive operations with ES are embedded inside TIDB and completed directly.

In TIDB, we added metadata records that support FULLTEXT index to the table, and created corresponding indexes and MAPS on ES. For each text column in FULLTEXT index, we added metadata records that support FULLTEXT index to the table. We all add it to Mapping and specify the Analyzer we need so that we can do full-text retrieval of these text columns on the index.

With the help of ES index, we only need to perform corresponding update operations on ES index when we write data or update data, so as to keep the TIDB and ES data in sync. For the query, the flow is now as follows:

  1. The TIDB parses the Query sent by the user.
  2. If a hint is found for the Query with a full-text retrieval, the TIDB sends the request to ES and uses the ES index to Query the record primary key.
  3. After the TIDB gets all the record primary keys, the actual data is obtained inside the TIDB to complete the final data reading.
  4. The TIDB returns the result to the user.

The future planning

In the short 24 hours of the Hackathon, we were able to verify the possibility of integrating TIDB and ES. Of course, we won’t be satisfied with this double-written solution. In the future, we will refer to TIFLASH, synchronize data changes to ES in real time based on Raft Learner, and build TIDB into a real HTAP database that can support real-time full-text retrieval, as shown in the figure below:

Using Raft Learner, for the write process:

  • TIDB will write the data directly to the underlying TIKV.
  • TIKV will synchronize the written data to ES Learner node through RAFT protocol, and write to ES through the Learner node.

For the read process:

  • The TIDB parses hints to the Query sent by the user with full text retrieval.
  • TIDB will send the request to ES Learner node.
  • ES Learner node first uses RAFT protocol to ensure that the node has the latest data, and the latest data has been written to ES.
  • ES Learner node reads the corresponding record primary key through the index of ES and returns it to TIDB.
  • The TIDB retrieves the complete data using the record primary key and returns it to the client.

It can be seen that, compared with the previous scheme of allowing TIDB to double write to ES and TIKV, TIDB does not need to interact with ES in writing. In terms of reading, TIDB can also guarantee the latest data from ES through RAFT protocol, ensuring the consistency of data.

Of course, we need more help to implement the above features, and we hope to work with the community to implement this very cool feature.

Write in the last

Thanks to my brief experience in the Zhihu search team, I have had a very intuitive feeling about the value of search and the workload of business access search. In an era when more and more data exists in TIDB, the value of full-text retrieval of certain fields of business data at any time is great. This value is not only reflected in the realization of SQL in the past difficult to do some things, the greater significance is to provide the full text retrieval ability to the business side in a close to free way, to build a bridge for users to connect the relational database and search engine, do write at any time, at any time to search. If you have any ideas about this, please contact me at [email protected].

Read the original article at https://pingcap.com/blog-cn/fulltext-search-with-tidb-and-elasticsearch/