Sun Xiaoguang, head of Zhihu Technology Platform, teamed up with Xue Ning (@Inke), Huang Menglong (@Pingcap) and Feng Bo (@Zhihu) to participate in TiDB Hackathon 2019. Their project TiSearch won the CTO Special Award.

“Search” is a very important behavior in the use of various apps. For products like Zhihu featuring massive quality content, it is even more crucial to help users reach the content they are looking for accurately and quickly by means of search. “Full text search” is an essential capability hidden behind a simple search box.

Currently we are gradually migrating more and more business data to TiDB, where we can only use SQL Like for simple retrieval of content. However, even without considering performance problems, SQL Like still fails to meet some common information retrieval requirements in search scenarios. For example, in several scenarios as shown in the following figure, using SQL Like alone will result in ambiguous results or results that meet search conditions cannot be returned.

Due to the lack of full-text retrieval capability of TiDB, we still need to use the traditional way to synchronize data to search engine. In the process, we need to do a lot of tedious data pipeline work according to business characteristics to maintain full-text index of business data.

To reduce this duplication of effort, in this year’s TiDB Hackathon we tried to introduce “full text search” to TiDB, providing the ability to search text data stored in TiDB anytime, anywhere. Here is the final look:

The project design

It was hard to get TiDB to support full text retrieval in just one day of Hackathon, so we chose a safe design solution from the very beginning — to extend full text retrieval capabilities for TiDB by integrating Elasticsearch (later ES).

Why ES? On the one hand, we can make full use of the mature ecology of ES to directly acquire Chinese word segmentation and query understanding ability. In addition, the combination effect brought by ecological integration also accords with TiDB’s value of advocating community cooperation.

Considering the workload, we did not adopt TiKV Raft Learner mechanism or TiDB Binlog method for the data synchronization scheme of full-text index, but adopted the most conservative double-write mechanism to directly add the full-text index update process into the writing process of TiDB.

The architecture is shown in the figure above. TiDB acts as a bridge between ES and TiKV, and all interactive operations with ES are directly completed inside TiDB.

Inside TiDB, we added additional metadata records supporting FULLTEXT index to the table, and created corresponding indexes and Mapping on ES. For each text column in the FULLTEXT index, We all add it to the Mapping and specify the desired Analyzer so that the text columns can be retrieved in full text on the index.

With the help of ES index, we only need to perform corresponding update operation on ES index when writing data or updating data, so as to keep TiDB and ES data synchronized. For queries, the flow is now as follows:

  1. TiDB parses the Query sent by the user.
  2. If the Query is found with a hint for full-text retrieval, TiDB sends the request to ES, which uses the ES index to Query to the record primary key.
  3. After TiDB gets all the record primary keys, it retrieves the actual data inside TiDB to complete the final data read.
  4. TiDB returns the results to the user.

The future planning

Hackathon just 24 hours, let us verify the possibility of integrating TiDB and ES, of course, we will not be satisfied with this double writing scheme. In the future, we will refer to TiFlash, synchronize data changes to ES in real time based on Raft Learner, and build TiDB into a real HTAP database that can support real-time full-text retrieval, as shown in the figure below:

Using Raft Learner, for the write process:

  • TiDB writes data directly to the underlying TiKV.
  • TiKV synchronizes written data to ES Learner node through Raft protocol, and writes data to ES through the Learner node.

For the read process:

  • TiDB parses to the Query sent by the user with the hint for full-text retrieval.
  • TiDB sends the request to the ES Learner node.
  • The ES Learner node first uses Raft protocol to ensure that the node has the latest data on it and that the latest data has been written to ES.
  • The ES Learner node reads the corresponding record primary key through the index of ES and returns it to TiDB.
  • TiDB retrieves the complete data using the record primary key and returns it to the client.

As you can see, TiDB doesn’t need to interact with ES when writing to ES and TiKV. In terms of reading, TiDB can also get the latest data from ES using Raft protocol, ensuring data consistency.

Of course, to implement the above features, we also need more help, we hope to work with the community to complete this very cool feature.

Write in the last

Thanks to my short experience in zhihu search team, I have had a very intuitive feeling about the value of search and the workload of business access search. In an era when more and more data exists in TiDB, there is great value in being able to perform full-text retrieval of certain fields of business data at any time. This value is not only reflected in the past SQL difficult to do some things, the greater significance is to provide the ability of full text retrieval in a close to free way to the business side, to the user to build a connection between the relational database and the search engine bridge, do write at any time, at any time search. If you have any ideas, please contact me at [email protected].

pingcap.com/blog-cn/ful…