preface

I used PostgreSQL to build a Chinese full text search system, optimized the database configuration and word segmentation, the basic query is fully supported, but still found some annoying problems in the process of use, including query effect and query efficiency. Fortunately, it all worked out.

Among them, I think the process is very instructive. Today, I will summarize and share it.

Blog welcome to reprint, please bring source: http://www.cnblogs.com/zhenbianshu/p/8253131.html


Use b-tree indexes to optimize query results

Word segmentation problem

In the beginning, it was a word segmentation problem:

  • The Chinese language is extensive and profound.Table tennis auction, Nanjing Yangtze River bridgeNo word segmentation plug-in can achieve 100% accuracy in this ambiguous sentence segmentation, including of course the one we are usingscwsWord segmentation repository;
  • Our search content is Poi site names, and many site names lack semantic meaning, resulting in a higher probability of ambiguous words;

SCWS supports a more flexible word segmentation level. In order to separate as many words as possible to contain the target results, we set the word segmentation level of SCWS to 7 (see above for those who do not understand), but also introduced a more bizarre problem: searching Tian ‘anmen does not find Tian ‘anmen Square…

The reasons are also puzzling:

  • Tian 'anmen SquareThe segmentation result vector TSV is'Tian 'an ':2' Tian 'anmen Square ':1 'Square ':4' Menguang ':3;
  • The query vectorTo_tsquery ('parser', 'Tiananmen ')TSQ turns out to beTian 'anmen & Tian 'an & Anmen;
  • The query statement isSELECT * FROM table WHERE tsv @@ tsqBecause TSV doesn’t have TSQgatesVector, failed to match.

B-tree indexes

Common sense: when searching for a place, most people will enter the part before its name first. Based on this consideration, I introduced B-tree index to support prefix query in the table, and cooperated with the GIN index of the original word segmentation to solve this problem.

Like Mysql, PostgreSQL supports the use of b-tree indexes through the like ‘keyword %’ statement. SELECT * FROM table WHERE TSV @@tsq OR name LIKE ‘keyword%’; SELECT * FROM table WHERE TSV @@tsq OR name LIKE ‘keyword%’;


Optimize query efficiency with subqueries

GIN index efficiency issues

Then came a new problem:

PostgreSQL’s GIN Index stores pairs (key, Posting List), where Posting list is a set of row ids with keys present. Such as data:

Row ids Participles vector
1 Test participle
2 Word segmentation results

Then the index content is test =>1 word =>1,2 result =>2, when we want to query the data containing word segmentation vector, we can quickly find the first and second columns.

However, this design also brings another problem. When the Posting list corresponding to a key is too large, data operation will be slow. For example, in our data, there are many hundreds of thousands of data with restaurants, and one of our requirements is to sort the query results in reverse order according to the score column. The database response timeout will reach 3000 ms.

Our expected response time is 90% within 50ms, and while statistics show that 90% of requests are compliant, the other 10% are completely unusable and unacceptable.

The next optimization is for these bad cases.

The cache

For this kind of response timeout problem, one would surely think of the all-purpose cache: put the response timeout query results into the cache, and check the cache first.

However, only a small number of times out, the cache hit rate is worrying. Although this small subset of queries is available, all query statements have an additional fetch operation.

In order to improve the cache hit ratio, I also calculated the proportion of search quantity and timeout rate of each keyword length, and found the following situation:

  • 1 byte (1 letter), 3 byte (single word) keyword timeout rate is the highest, but not more than 30%;
  • The search volume of 1-byte and 3-byte keywords accounted for about 30%;
  • The timeout rate of other length keywords is about 10%, which is very embarrassing.

This situation dispels my desire to cache only keywords of certain lengths.

Not only is the hit ratio problem, but also the cache expiration time and cache update are big pits. Based on the above considerations, the cache scheme was completely abandoned.

table

If one method does not work, then change the direction. Since the result set of some keywords is too large, we will make it smaller. The strategy we adopted at the beginning is to divide tables.

Since Poi sites all have region attributes, we divided these data into multiple data tables by region ID. The original maximum keyword result set was hundreds of thousands, but after splitting into multiple tables, the maximum keyword result set in each table was tens of thousands. At this time, the sorting performance was improved, basically between 100 and 200ms.

During the query, users are first located to the region by location, and the table to be queried is determined according to the region ID, and then the result is queried from the corresponding table.

The disadvantages of this scheme are also numerous:

  • It is very dependent on positioning, and it takes time to locate the calculation area.
  • Search the edge of the area is very painful, obviously very close, if it is divided into different areas with the user will not be able to search.
  • Multiple tables are very difficult to maintain.

The subquery

Finally, flexible consideration of business requirements, the introduction of sub-query proposed a rather perfect scheme:

The user enters meaningless keywords such as hotel and guesthouse in the search box, which is different from searching Haidilao. At this time, the user does not know what he needs and has no clear expectation for the search results.

At this time, we do not need to be very leng to the national name with restaurants, hotels are out of the location of sorting, such sorting results are not necessarily satisfied with the user. We can only take a part of Poi locations to the users, and if the results are not satisfied with the users, we will improve the keywords, and if the keywords are slightly improved, the result set will be greatly reduced.

Subqueries are very effective for result set filtering. For example, we can use subqueries to filter out a large number of useless data in the case of large paging queries.

In this example, we use the limit statement in the subquery to limit the number of result sets taken, thus greatly reducing the sorting pressure. SELECT ID FROM (SELECT * FROM table WHERE TSV @@tsq OR name LIKE ‘keyword%’ LIMIT 10000) AS TMP ORDER BY score DESC.

After this optimization, the worst performance of query statements can be stabilized under 170ms.


Replace B tree indexes to eliminate slow queries

Multiple index efficiency problem

I thought the optimization ended here, but once WHEN I tried to query the two keywords Zhongguancun and East, I clearly felt the difference in response time, the time difference of about 100ms is very obvious.

Subquery statement is the key to the efficiency of this SQL statement, so I began to analyze the east of the keyword subquery SQL statement, first OF all, I tried to adjust the limit value in the statement limit, found that even if only 1000, the response time is more than 100ms.

After removing OR name LIKE ‘keyword%’, the total number of items does not change much, and the result set is reduced from 13W to 11W. However, the efficiency is greatly improved after adding limit:

SQL Results the number of branches The response time After add limit SQL The response time
WHERE tsv @@ tsq OR name LIKE 'keyword%' 13W 2400ms   WHERE tsv @@ tsq OR name LIKE 'keyword%' LIMIT 10000 170ms
WHERE tsv @@ tsq 11W 1900ms   WHERE tsv @@ tsq limit 10000 25ms

By contrast, it is clear that the GIN index for word segmentation queries does not match perfectly with the B-tree index for prefix queries.

If you want to fetch 1W data from one index, you can fetch it directly. If you want to fetch 1W data from two indexes, you have to worry about how much data to fetch from each index.

Replace the B-tree index

After analyzing the problem, we need to find a solution to the problem. How can we merge two indexes into the same index? Indexing GIN into a B-tree index is obviously impossible, so try using a participle instead of a B-tree index.

There were three options:

  • Modify the open source thesaurus SCWS to add a feature to split prefixes. However, I was worried about bugs and had to change the PostgreSQL word parser plugin to accommodate SCWS parameter changes.
  • Use the PostgreSQL array type (text[]) store the segmentation result, and add prefix words to this field flexibly. But populating the array fields requires a callSELECT to_tsvector('parser', 'nane')It is troublesome to use scripts to process the results after query and then write to the array.
  • Modify the tsVector segmentation vector field by manually adding the prefix word vector to the field. However, word segmentation vector is different from text and cannot be directly spliced.

The last, the best solution, of course, is the least change, so I just query the PostgreSQL vector, or find the vector splicing method, using: : tsvector string into a vector, then using the | | joining together to the original word vector, SQL statements similar to SELECT to_tsvector (‘ parser ‘and’ keyword ‘) | | ‘prefix: : tsvector.

When querying, you can query the prefix directly using WHERE TSV @@to_tsquery (‘parser’, ‘keyword’). As a result, the response time of the subquery can be significantly reduced, to around 50ms, and the response can be accelerated by reducing the LIMIT value.

After that, the B-tree index can be retired


summary

This is my PostgreSQL keyword query from the effect to the efficiency optimization process, the effect and efficiency is fully up to standard. Of course, the user experience can also be optimized, such as the addition of typos recognition, intelligent recognition of pinyin initials, etc. It is certainly not easy to polish a product, and further efforts are needed.

PostgreSQL is an open source product with a lot of holes in it.

There is no denying that many open source products or tools have various pits, but there is no need to give up eating for fear of choking. The Linux/Git products we have been using are still open source products, but how many people cannot live without them? And there won’t be problems with closed source products? PostgreSQL is undeniably niche, but it has its own characteristics, and its share has been growing in recent years. What the future holds is unclear.

If you have any questions about this article, please leave a comment below. If you think this article is helpful to you, you can click the recommendation below to support me. The blog is always updated, welcome to follow.