Introduction: The large-scale short text clustering system is designed to accurately and efficiently summarize the massive search query, condensed into the meaning of cohesive expression clear “demand”, not only can better meet the needs of users, but also can find the content to meet the long and short version. How to ensure the high accuracy of clustering system and how to improve the efficiency of clustering system is the focus of our team. We gradually improved the effects and performance indicators of the system by means of multi-level resolution, precise matching semantic similarity and error correction. Based on our practical working experience, this paper shares the design and practice of large-scale short-text clustering.

**First, background and problem introduction**

Search is the scene where users clearly express their needs. Baidu search receives massive requests every day, and the expression of query is diverse. The core purpose of the large-scale short text clustering system is to summarize short texts represented by Query, accurately and efficiently condense short texts with the same meaning and different expressions into “demands” with cohesive meaning and clear expression. This approach can not only reduce the volume of short text, but also better meet the needs of users, help content providers, provide better content. At present, I have assisted the content production of Baidu UGC products.

In this section, we first introduce short-text clustering and common clustering algorithms, and finally analyze the difficulties and challenges of clustering algorithms in search scenarios.

**1.1 Short text clustering problem**

Clustering is a commonly used unsupervised algorithm, which divides a data set into different classes or clusters according to a certain distance measure, so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects not in the same cluster is as large as possible.

In general, the search query is dominated by short texts. From a large number of short texts, all texts with the same meaning as far as possible are aggregated together to form a “demand cluster”, which is short text clustering problem.

**For example:**

Query =” What if the screen is broken?” Query = “Help, phone screen is broken” is the same requirement that can be aggregated into a cluster; Query =” what is the meaning of February 2 “, query=” what is the meaning of the long head “and query=” what is the meaning of the long head” are the same requirements, can be clustered into a cluster.

It can be seen that the short text clustering problem in the search Query scenario has an obvious difference from the conventional clustering problem: the connotation of ** “cluster” is relatively concentrated, ** that is, after the aggregation is completed, the magnitude of the cluster is relatively large, and the data in the same cluster are close to each other.

**1.2 Common algorithms and frameworks**

**simhash**

In the field of text clustering, Simhash is a commonly used local sensitive hash algorithm, which is often used as a web page deduplication algorithm in search engines. It tries to map similar documents to the same hash value with a high probability, thus achieving the purpose of clustering (de-duplication).

The Simhash algorithm generally divides the word in the document, gives the weight of each word, calculates a hash value for each word, sums the weighted hash values of all words, and then reduces the dimensionality of the final document by using the corresponding binarization. The importance weighting of words is used, so important words have a greater effect on the hash value of the final document, while less important words have less effect on the hash value of the final document, making it more likely that similar documents will produce the same hash value.

Simhash is an efficient algorithm in the field of long document clustering (de-duplication), but for short text clustering, the effectiveness of SimHash is greatly reduced due to the greatly reduced text length, and the accuracy of the output results cannot be guaranteed.

**Vectorization clustering**

Another common text clustering method is to vectorize the text first and then use the conventional clustering method for clustering. Among them, there are many ways of text vectorization, from the early TF-IDF, to the convenient word2VEc, and even the text vector produced by Bert, Ernie and other pre-training models can all be used as the vectorization representation of text. In general, vectorization methods imply distributed assumptions, producing text vectors that can solve the problem of synonyms to a certain extent.

There are various clustering methods, such as kmeans, hierarchical clustering, single-pass and so on. Special attention should be paid to the vectorization method when designing the distance measure of the clustering algorithm. Different distance measures should be designed for different vectorization methods.

**There are three problems with this kind of algorithm:**

1. The clustering algorithm represented by Kmeans has hyperparameter “cluster number”, which can only be judged by experience and auxiliary indicators;

2. For short text clustering, the number of clusters is often very large, which will lead to a serious decline in the performance of the clustering algorithm;

3. The accuracy of clustering results is affected by both vectorization and clustering algorithm, and it is difficult to express fine-grained differences of data.

**Other algorithms and comparisons**

Although short-text clustering algorithm has many application scenarios in the industry, it is still a relatively minority branch in the research field. It mainly adopts improved text vectorization representation schemes, for example, sentence vector is designed as the first principal component of word vector weighting matrix, instead of direct weighting. For example, cluster guided vector iteration is adopted to improve vectorization representation, such as SIF+Aut. But no matter which kind of improvement, will increase the large computation overhead, is not practical in the industry.

**1.3 Difficulties in large-scale short-text clustering**

The problem we face is large-scale short-text clustering, which also contains three implicit constraints:

1. The accuracy of clustering results is strictly required;

2. The number of short texts is very large;

3. High time requirement of data output;

It is not difficult to see that conventional algorithms can hardly meet the above three requirements in terms of efficiency and accuracy. Industry doesn’t have a mature framework to solve this problem; There is still a certain distance between academic algorithm and industrial application. We need a new way of thinking to solve the problem.

**When designing the algorithm, the following issues need to be considered:**

1. Large data magnitude and high system throughput: The magnitude of query search is self-evident. Efficient calculation of hundreds of millions of levels of data tests algorithm design ability and engineering architecture ability.

2. Clustering algorithm has high accuracy requirements: Firstly, clustering algorithm is an unsupervised algorithm, which uses distance measurement in vector space to measure the aggregation degree of clustering results. There is no concept of accuracy in essence; However, in the search Query scenario, the clustering algorithm has a clear measurement index: through the unified text similarity assessment criteria, it can be posteriori evaluated whether the results are similar in the same cluster and whether the data are not similar in different clusters, so as to measure the clustering accuracy. This puts a “magic spell” on the short text clustering algorithm: not all clustering algorithms are suitable, and clustering algorithms that are more closely combined with text similarity need to be considered.

“Closer together” is considered from the definition of distance in a vector space, for example, the measure of similarity given by a similarity model is not necessarily “distance” in a vector space. Specifically, a well-defined “distance” on a vector space needs to satisfy the triangle inequality:

Distance (A,B)+distance(B,C)>=distance(A,C),

There is not necessarily A stable quantitative relationship between similarity(A,B)+similarity(B,C) and similarity(A,C). Therefore, the clustering model cannot be used directly, and the similarity can only be used as “off-site guidance” to realize the clustering algorithm under the guidance of similarity. This leads to the general clustering algorithm can not be directly used for short text clustering.

3. Text similarity requires high precision and short time consuming: Text similarity is a scenario-dependent problem. For the same Query pair, there may be completely different judgment results in different scenarios. In the search scene, the accuracy of similarity is very high, often one word difference, is completely different requirements, so the model needs to be able to accurately capture the fine-grained differences of text; On the other hand, we want to aggregate query with the same requirement into a cluster as much as possible to reduce the case of missed recall. In other words, for short text clustering, the accuracy and recall rate of text similarity are very high; In addition, in order to adapt to large-scale calls, text similarity service needs to have the characteristics of response time, easy to expand and so on.

4. Complex text representation: Text representation refers to converting text into semantic vector in some way for subsequent text clustering. Text vector can be produced in various ways. For example, in Simhash, the weighted hash function is used to express text information. You can also use word2vec word vector weights to produce text vector information. In addition, the categories of the text, keywords and other information, is also an important part of the text representation; In text vectorization, we should focus on how to embody similarity in vector space. Text representation is an important and basic algorithm. Choosing different text representations determines different measures of similarity, which directly affects the selection of clustering model and indirectly affects the final result.

5. Error discovery and repair: From text representation to text similarity and then to clustering algorithm, errors will be accumulated in each step, thus affecting the final result. For large-scale text clustering, this problem is particularly obvious and easy to be ignored.

**1.4 The general idea of large-scale short-text clustering**

Considering the above difficulties, we adopted the overall idea of multi-stage splitting and breaking one by one.

**Figure 1: The general idea of large-scale short-text clustering**

1. We first split the large short text at multiple levels to basically ensure that the query with the same meaning will enter the same bucket with a high probability:

1.1 First-level split: It is necessary to ensure that the split items have mutually exclusive meanings at the semantic level, that is, the query with the same meaning must be entered into the same bucket.

1.2 Two-level split: After the first-level split, the query magnitude in each bucket is still large, so two-level split is needed to ensure that the subsequent calculation magnitude is controllable.

2. Perform fine semantic aggregation for the queries in the same bucket, and merge the queries with the same meaning into a cluster.

3. For semantic clusters in the same first-level bucket, perform error check and merge the clusters to be merged.

**Description:**

1. Why split? Suppose our query quantity is of the order ofSo in the worst case, we need to proceedIn order to complete the text clustering; However, if we split the data with a high probability of mutual exclusion,, then only need to proceedSubsimilarity calculation, the magnitude is far less than;

2. Why is multilevel splitting needed? If the original data is split horizontally, the number of similarity calls will be reduced. But the finer the horizontal splitting, the less semantic exclusivity is guaranteed, which leads to a proliferation of error checks; If we adopt multistage splitting to ensure the accuracy of top splitting, the amount of data for error checking will be reduced.

3. How to conduct semantic clustering? After data splitting, text clustering needs to be carried out in each bucket, which is also the core of the whole scheme. Currently, there are three ways to solve this problem:

3.1 Retrieval clustering based on text literals: although the amount of data divided into the same bucket has been greatly reduced, the magnitude of data is still large if similarity calculation is carried out in pairs; We found that a regular keyword search guaranteed the recall of most similar Queries by simply calculating correlations for the recalled queries.

3.2 Clustering based on text representation: Although keyword search can cover most similar queries, the recall is not complete. In order to solve the problem of incomplete keyword recall caused by synonym problems and sentence pattern transformation, we use the method based on text representation: Fine-grained clustering is carried out for the vectorized Query, which only needs to calculate the similarity of the query in the same class.

3.3 Retrieval and clustering based on text representation: considering that fine-grained clustering algorithm has weak control force, vector retrieval can also be introduced to solve the problem of incomplete recall by means of text vector recall.

**Validity analysis**

1. Solve the problem of large data magnitude and short time consuming: the current process can split the data into smaller data blocks and allocate them to different computing units to reduce the time consuming; We have estimated that it would take 58 years to calculate 100 million data points by traditional pairwise comparison; The layered approach takes only four days. Although hierarchical vectorization cluster (without similarity) can reduce the time to 2 days, but the accuracy and recall rate are low.

2. Optimization of similarity and improvement of computing performance of similarity service: We customized optimization of short text similarity and multi-instance deployment, and the throughput capacity of similarity service has been improved by 100 times;

3. High accuracy of clustering algorithm: the introduction of error correction mechanism improves the accuracy of the overall short-text clustering algorithm from 93% to 95%, and the recall rate from 71% to 80%;

Based on the above analysis, a compromise was made between the computational performance and the computational effect. Hierarchical vectorization clustering + optimized similarity was adopted as the final scheme.

**2. Evolution and thinking of large-scale short-text clustering algorithm**

The above is the solution to the problem of large-scale short-text clustering summarized by us after a period of exploration. In the process, we did two iterations of large releases.

**2.1 V1.0: Clear split thinking**

At the beginning of solving this problem, we are faced with two technical solutions: one is to explore the representation suitable for the short text and transform the problem into a special vector clustering problem, such as simhash; The other is to split the data set, reduce the size of the data, speed up the similarity calculation and solve the clustering problem; After analysis, we believe that scheme 1 has no intermediate index to guide the optimization direction in selecting the appropriate text representation, so it is difficult to iterate optimization. Therefore, the idea of splitting the data set is clarified.

First, according to query classification and other business requirements, we split Query to ensure that the semantics of the split results are mutually exclusive.

The second step is two-level split: In order to ensure that the level of data in the same bucket after two-level split is moderate and convenient for fine semantic aggregation, hierarchical bucket division is adopted:

1. Compute high-level semantic features for Query and perform binarization operations to produce a vector of 256 dimensions composed of 0 and 1

2. First, take the first 50 dimensional vectors and hash them. If the number of queries in the cluster exceeds a certain number, the query dimension is extended to 100 dimensions and coarse clustering is performed again until the dimension reaches 256 or the number of queries in the cluster is less than a certain number

The third step is to carry out fine semantic aggregation for the query in each bucket, and merge the query with similar meaning into a cluster.

**Pros and cons of V1.0**

It can be seen that, due to the idea of data splitting, when the data is divided into buckets and refined semantic clustering is carried out, the data semantics between each bucket are mutually exclusive, so the data can be produced only by semantic aggregation in each bucket. This method is convenient for parallel computation and contributes greatly to shortening computation time.

**In the process, we also found that some areas need to be improved:**

1. The accuracy of clustering results strongly depends on the similarity model and is a fine-grained similarity model. If the coarse-grained similarity model is used, it will lead to false recall, so a model that can distinguish fine-grained similarity degree is needed.

2. Using hierarchical tumble barrels for secondary split, does not necessarily guarantee the semantic similarity of data into a bucket, hierarchical query vector representation, using the splitting of implied the vector on the semantic expression, has the characteristics of refine gradually with the increase of dimensions, but in the process of query output vector representation, not applying this guidance, There is no guarantee that this hypothesis will work; In V2.0 we changed the way of secondary split to overcome this defect;

3. Lack of error discovery and correction mechanism: no matter how high the accuracy of similarity is, no matter how accurate the classification of the short text is, there will be errors; We need mechanisms to detect and correct errors.

**2.2 V2.0: Fine-grained similarity is introduced**

We made three changes to address the issues identified in V1.0:

**Introduce fine-grained similarity**

A typical use scenario of this clustering is to merge search queries at the semantic level. Queries with different expressions of the same requirements are merged into one “requirement”. Therefore, the standard for identifying similar Queries is relatively strict. The existing similarity model is no longer applicable. Through the detailed analysis of query, we found that syntactic analysis, phrase collocation and other features can greatly help improve the accuracy and recall rate of similarity model. Therefore, we developed a set of similarity model integrating syntactic analysis, phrase collocation and other features, and achieved good benefits.

**The clustering model is introduced in the secondary resolution**

In V1.0, the accuracy of hierarchical bucket splitting can not be guaranteed, so a more accurate two-level split method is needed to achieve the purpose of data bucket splitting. The second level split only needs to ensure that similar short texts are assigned to the same bucket with a high probability. It does not require that any two short texts in the bucket are similar. This setting is very suitable for solving the problem by using the vectorized post-clustering method. On the one hand, considering the performance of pre-training model, we use the traditional word vector weighting method to produce short text word vector. By means of Kmeans clustering, the short texts of first-level buckets are clustered, thus ensuring the accuracy of second-level splitting.

There might be a question here, why not use vector recall to solve the problem? In fact, the essence of vector recall is vector clustering, supplemented by efficient vector search, to achieve the purpose of vector recall. In the secondary split, vector lookups are not required, and there would be additional time overhead if introduced, so we do not use vector recall.

**Error correction**

In v1.0, errors accumulated layer by layer and were not corrected. To overcome this, we introduced error correction operations in the last step of output.

There are two main types of errors: one is incomplete clustering, the data that should be aggregated together is not aggregated together. This kind of error is mainly caused by multi-level split, which can be solved by cross-level verification. Another error is inaccurate clustering, mainly due to the calculation of similarity. We focus on solving the problem of incomplete clustering.

In order to reduce the magnitude of data to be checked, we limited the scope of error test to the inside of the two-stage bucket. At the second level, we first use the refined semantic aggregation method to get the clustering results. The correlation of cluster centers in the same two-level bucket was verified. If the correlation is high, further validation is followed by merging.

**The effect of the v2.0**

After the launch of V2.0, it can process and complete the fine clustering of large-scale short texts in a very short time, with the clustering accuracy reaching 95% and the recall rate reaching 80%. It has already served the content production of Baidu UGC business line.

Continue to optimize

V2.0 basically realizes the function of large-scale short text fine clustering, but there is still a lot of room for change. For example, persistence of clustering results, double calculation problem, more efficient aggregation algorithm, etc., we will continue to optimize in the future to provide more efficient and stable algorithm support.

**Third, summary**

Search is a place where users clearly express their needs. In-depth understanding and summary of search queries can not only provide users with a better search experience, but also find the shortcomings of content satisfaction. We realized the clustering function of large-scale short texts through multi-level splitting and fine aggregation, which assisted Baidu UGC business line in content production and improved production efficiency.

Read the original article mp.weixin.qq.com/s/-QCY1v6oC…

Recommended reading

Billions of traffic baidu search in Taiwan, is how to do observable construction?

Billion level traffic search front end, is how to do architecture upgrade?

Exploration and application of elastic near line computing in Baidu information flow and search business

———- END ———-

**Baidu Geek Talk is now online**

Interested students can also join the “Baidu Geek wechat Communication Wechat Group” to continue communication within the group.

If you are interested in other positions in Baidu, please join baidu Geek official Recruitment group to keep up with the latest job trends. There is always a JD for you.

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding