“This is the first day of my participation in the August More Text Challenge. For details, see: August More Text Challenge.”

preface

Chinese search has always been a headache. While trying to figure out how to participle the words, at the same time trying to figure out how to sort them. Because in the real business, it is not simply to order by word, you need to consider the attributes of the product, such as number of visits, sales, and so on. From the earliest days when SQL Like was used for simple searches, it turned out that word segmentation was not possible when searching for two words. To solve this problem, we use the elasticSearch+IK parser instead. The following problem is how to participle, although IK has its own basic thesaurus, but the actual business field will have their own professional name, such as I sit in the industry insurance, a variety of insurance company names, insurance types, product names are not in the basic word thesaurus, that also need to maintain the expanded thesaurus. It is undoubtedly increased the maintenance workload of our code farmers. Alas, is there a product that solves all of these things, or provides a dumb-ass feature for operations and product managers to use? Alibaba’s OpenSearch addresses this pain point. Okay, so that’s it, so what’s the next thing we’re going to talk about is what’s the pitfall that OpenSearch brings to the pain point?

participles

The word segmentation of OpenSearch is different from ik, which is a dictionary + semantic analysis of Ali machine learning. A problem will be demonstrated below.

Let’s first search for “a million yuan bao 2021 edition” to see the participle results

If we pay attention to the three words “wan Yuan bao”, we will find that in this semantics, it is divided into two words “wan yuan” and “bao”.

Let’s look at the participle result of “wan Yuan bao” again

It is not difficult to find that it is divided into two words: “wan” and “Yuan Bao”.

In practical application, the customer search habit is always like to input short words, in the search “wan Yuan bao”, the word is not in the inverted index, will lead to the key results can not be found, resulting in production problems.

The solution

So what’s the solution? Opensearch is a mature product because it can fill in the gaps with other features of its own. A closer look reveals that custom words take precedence over semantic analysis. We can define a word segmentation, and then dictionary management to maintain the word “wan Yuan”, (if it is maintenance, remember to rebuild the index). Let’s see what happens.

Have been divided into “ten thousand yuan” “protect”, so that there will not be because of the inversion index inside did not lead to search out. Perfect problem solving.