Robben is a senior engineer at Tencent

Introduction:

Search functions are ubiquitous in Internet products. When your project scale is large baidu search | business search or WeChat searches for the mass public, developed a search engine, add a variety of customized requirements and optimization, is a very natural thing. But if just ordinary even entrepreneurial teams for small to medium sized projects | start-ups, directly take the wheel is a more reasonable choice.

ElasticSearch is just such a search engine wheel. More importantly, in addition to regular full-text search, it also has basic statistical analysis capabilities (most commonly aggregation), which makes it more powerful and useful.

Are you still using database like for full-text product retrieval? Ditch her and use ElasticSearch!

ElasticSearch (ES) is an open source search engine based on Lucene. Lucene is a set of open source document retrieval base library written in Java, including word, document, domain, inverted index, segment, correlation score and other basic functions, and ES is the use of these libraries, build a can be directly used to use the search engine products. Intuitively, Lucene provides car parts, while ES sells cars directly.

The birth of ES is also an interesting story. ES writer Shay Banon – “A few years ago he was an unemployed engineer who had followed his new wife to London. His wife wanted to study to be a chef in London, and he wanted to develop an app for her to easily search for recipes, so he came across Lucene. Building search directly with Lucene was problematic and involved a lot of repetitive work, so Shay took Lucene and abstracted it to make it easier for Java programs to embed search. After a while, he created his first open source work, Compass, which means’ Compass ‘in Chinese. Shay then found a new job working in a high-performance distributed development environment, where he saw a growing need for an easy-to-use, high-performance, real-time, distributed search service. He decided to rewrite Compass from a library to a standalone server. Rename it Elasticsearch. “

To quote (www.infoq.com/cn/news/201…

You can see how much love there is in tinkering with programmers, even though the recipe search Shay Banon promised his wife is reportedly not yet available……

This paper briefly introduces the principle of ES and Wetest’s experience in using ES. Because ES itself involves a wide range of functions and knowledge points, some key points that may be used in actual projects are highlighted here.

Important concepts


Cluster: ES is a distributed search engine, usually composed of multiple physical units. These physical machines are configured with the same cluster name to discover each other and organize themselves into a cluster.

Node: an Elasticearch host in a cluster.

Primary shard: A physical subset of the index (described below). The same index can be physically cut into multiple fragments and distributed to different nodes. Sharding is implemented as an index in Lucene.

Note: the number of shards in an index in ES is specified during index creation and cannot be changed after creation. So when you start building an index, expect the data size and allocate the number of shards to a reasonable range.

Replica Shard: Each main shard can have one or more replicas. The number of replicas is set by users. ES will try to distribute different fragments of the same index to different nodes to improve fault tolerance. An index can work as long as not all of the shards machines are down. The concepts of master, replica and node are as follows:

Index: Logical concept, a collection of retrievable document objects. Similar to the concept of database in DB. Multiple indexes can be created in the same cluster. For example, a common approach in production environments is to index the data generated each month to keep the magnitude of the individual indexes manageable. Index -> Type -> Documents, documents in ES are organized in this logical relationship.

Type: The next-level concept of an index, roughly equivalent to a table in a database. An index can contain multiple types. I feel that in practical use, the level of type is often not used much, directly in an index to build a type, under this type of document collection and search.

Document: the concept of a Document in a search engine, is also a basic unit that can be retrieved in ES, equivalent to a row in a database, a record.

Field: corresponds to a column in a database. In ES, every document is actually stored as JSON. A document can be viewed as a collection of fields. An article, for example, might contain information about the subject, abstract, body, author, date, and so on, each of which is a field, which is finally consolidated into a JSON string and dropped to disk.

Mapping:

The mapping of Elasticsearch can be explicitly specified and automatically created from document data.

Elasticsearch has a nice RestFul API that allows you to do everything directly through HTTP requests. Add document to index Twitter where type is Tweet and id is 1:

Accordingly, retrieve the document according to the User field:

Key Configuration Items

1, index number of shards:

The number of shards, preferably relative to the number of nodes. Theoretically, it is best to have no more than two shards for the same index on a single machine, so that each query is as parallel as possible. However, since the number of Shards in ES is determined, it cannot be adjusted again. Therefore, considering the rapid growth of data, it is ok to allocate more at the beginning. Another common idea is to define ES indexes by latitude in time (such as month) – because you can dynamically adjust the number of shards added to the index. In other cases, such as the Wetest aggregation example shown below, many shards (200) are defined because the data needs to be as fragmented as possible by channel, but too many shards are usually not recommended and ES can be expensive to manage.

Heap memory:

The official recommendation is half of the available memory, which is achieved by defining environment variables in the environment in which ES is started. Such as the export ES_HEAP_SIZE = 10 g

3, cluster. Name:

Logical name of the cluster. Only the machines with the same cluster name can logically form a cluster. For example, there are five ES instances on the Intranet, which can form several ES clusters that do not interfere with each other.

4, discovery. Zen. Minimum_master_nodes:

This is the minimum number of master machines used for distributed decision making of the cluster. As with common distributed coordination algorithms, n/2+1 is recommended for more than half of the machines in order to avoid brain splitting

5, discovery. Zen. Ping. Unicast. Hosts:

List of machines in the ES cluster. Note that ES single point does not configure the list of all machines in the cluster, just like a connected graph. As long as each machine is configured with other machines, and these configurations are interconnected, ES will eventually find all machines and form a cluster. Such as [‘ 111.111.111.0 ‘, ‘111.111.111.1’, ‘111.111.111.2]

mapping

Mapping is similar to the table structure of a database, and defining a mapping means creating an index. Unlike a database, an index does not need to explicitly create a mapping. For example, in the above example of inserting document data into the Twitter index, ES will automatically create an index and mapping based on the field and content of the document if the index has not been defined at the time of execution. However, the index fields created in this way may not be what we need. Therefore, it is better to manually define the mapping beforehand to create the index yourself.

Here is an example of creating a mapping. This example creates a mapping for user, blogpost and other types under my_index. Below properties is the definition of various fields, including string, value, date, etc.

In the red box, there are two things to note about this example:

1. User_id is a string, but its index is not_analzyed. After the original document segmentation, these words are used to establish the inverted index. When online retrieval, the user’s query words are segmsed, and the results of segmentation are used to pull the zipper results, merge, correlation sorting of multiple inverted indexes to obtain the final result.

However, for some fields of string type, you do not want to create inverted rows, but just want to accurately match, for example, the name of the user, just want to find the person whose name is exactly “zhang SAN”, rather than “Zhang Si” and “Li SAN” obtained after word segmentation, in this case, you need to define the index field. This field is divided into three types: NO, analyzed and NOT_analyzed. No means that the field is not indexed at all, analyzed and searched in full text, and NOT_analyzed is the keyword query method of perfect matching.

2. Date type: When creating the Mapping, you need to specify multiple possible time formats in format. When you create a document, ES will automatically determine which one it is based on the fields entered in the document. But it’s intuitive to imagine that specifying an explicit time format when creating a document, without the overhead of ES dynamic judgment, should provide a slight performance boost. In addition, epoch_second and epoch_millis should not be mixed. If they must be mixed, specify which one is inserted. Epoch_second is a second timestamp, but ES takes precedence over EPOch_second as a millisecond. As a result, the time is reduced by 1000 times, and the latest time is changed to some time in 1970.

The following figure lists the data types, built-in fields, and parameters that can be carried by the mapping operation in the current VERSION of ES. I won’t go into detail here for space reasons:

Here are the two key built-in types and the two mapping parameters that are highlighted in the red box in the figure above. All of these directly affect the performance of the final index access:

1) _source: es will put all fields into a raw JSON into disk, so this can be interpreted as full raw data, which can not be used for indexing, but can be returned when needed. Be careful not to disable it, for example, using script to update will not be supported after it is disabled.

2) _all: a “pseudo” field used to implement fuzzy full-text indexing. When you build an index, you put all the fields together into a single string, then you slice and invert the “big” field, and then the field is discarded without actually falling onto disk. When full-text retrieval is performed, the document zipper is pulled from this large inversion if the query field, such as title or body, is not specified (which is common). As you can imagine, some tag or value type fields, such as date and score, which are meaningless in full text retrieval, can be excluded from _all, while text fields, such as title and doc, can be included in _all. These are all possible and best things to specify when building a Mapping.

3) DOC_VALUES: DOC_VALUES and field_data are parameters used to aggregate (described below) and sort statistics, and are enabled by default. Sorting, aggregating, this kind of work in the document global, using inverted index is certainly not appropriate. Therefore, for the not_analyzed (i.e., non-inverted) fields, DOC_VALUES uses a column pattern (see hbase) to store forward rows of documents for global statistics. Doc_values are stored on disk and can be disabled if you know that some fields are just for display and not for statistics. Doc_values will definitely not index the analyzed fields, but instead use the following field data.

4) field_data: To analyzed text fields, such as text, in fact, there are statistical requirements (such as ES also supported by some key words to aggregate statistics of documents, but the task of commonly used method is through offline tools, such as hadoop or single analysis, push to online index after ready, directly in the ES to calculate actually feel a bit strange). It’s not suitable for search engines, but if you do, ES will dynamically load the data into a field data. So, if you think about it, this is a very memory intensive operation that probably eats up the JVM heap!! Es is only open by default, but not loaded, and only dynamically loads memory (lazy) when you need to make analyzed sorts and aggregations. So try not to open Pandora’s box with queries, or turn it off altogether.

The aggregation

Who says search engines are only for search? ES can not only search, but also perform statistics directly on the result set of the search. At present, the stable non-experimental polymerization of ES is mainly divided into two types: Metrics Aggregation and Bucket Aggregation.

Index aggregation mainly refers to general set mathematical statistics operations, such as this example in the official Guide: Find all the red cars traded and find their average price:

The result is something like this:

Magical ~ index operation also includes other, such as maximum, minimum, sum, number, geographical coordinates operation and so on. Bucket Aggregation is the Aggregation of buckets. Bucket aggregation refers to grouping documents into groups based on a given field, performing further aggregation operations within the group, and returning bucket-level results. More intuitive understanding, such as: histogram, time statistics and so on. For example, the following example is term aggregation in bucket aggregation, which is divided into buckets according to the color field after accurate matching, and then further nested average price aggregation and further bucket aggregation by manufacturer.

The statistical result is similar to the following, there are 4 red cars, the average price is 32,500, including 3 Hondas and 1 BMW:

This is a simple example. In our WeTest public opinion, there is a function of forum hot posts, which is to make real-time statistics of the TopN posts with the most replies in a certain data source (such as Baidu Tieba), a certain forum (such as King of Glory Bar) and within a certain period of time (such as 3 months).

The implementation method of this function online is not described in detail. It is roughly the idea of scanning corresponding data from the database and Hbase to maintain a heap and obtain TOP N data. On the one hand, it takes a little time; on the other hand, a large number of requests may bring pressure to DB and Hbase access. Therefore, we wanted to find an alternative solution, and we thought of using ES.

To do bucket aggregation with ES, we first design how to store documents (that is, all user comments). Because of the large amount of data (billion levels), we first came up with the idea of dividing the document into different indexes by time (such as month), and then aggregating the Top posts with the most comments on the index for a given month (such as 3 months). However, this is problematic: when aggregating multiple ES indexes, ES does not aggregate TopN from the results of all indexes, but aggregates TopN from each index separately. This is a pit to be careful of when using. As a result, the TopN aggregated directly on multiple indexes is not a real TopN (for example, in 3 months, each month is not a Top 1, but the three months add up to a Top 1). Local optimum is not equal to global optimum.

So, in terms of time cut, the road is basically blocked. That has to be partitioned spatially. Billions of gigabytes of data, hundreds of gigabytes of data, and if you don’t slice it, you have to search through hundreds of gigabytes of files every time, just think how slow it is. . From space segmentation, we also need to consider two issues:

1) How to hash data into shards.

2) How many shards to split. For the first question, since our aggregate statistics are under each channel (i.e. forum) and not across channels, shards are allocated by channel ID, hash the same forum data into one shard. Thus, each time a channel is requested for an aggregate result, the request is routed to the corresponding SHard by the channel ID. For the second question, it depends on the scale. We have hundreds of GIGABytes of data and thousands of data sources, so we hope that the content on each shard will be as small as possible to ensure that the aggregation on a single shard will be faster. Of course, the number of Shards should not be too large, otherwise it will introduce very large management overhead to ES. To sum up, the number of shards we choose is 200.

Unfortunately, ES can only hash the route based on the key you specify, which results in data not being evenly averaged across different shards, with the maximum exceeding 10GB and the minimum only a few tens of MB. It would be nice if ES someday opened up custom routing rules or methods for balancing shards data.

One of the things ES is often criticized for is being slow to index, taking a few days to index a billion data sets. This is easy to understand, there is no such thing as a free lunch, read and write performance is often mutually exclusive, fast read and retrieve means a lot of index and auxiliary data pre-built, which is bound to be slow to write. The choice depends on the actual business scenario. The following is the way to call the interface to aggregate Top posts within a specified period of time in a forum after the index is built.

Then, we tested ES and existing online services respectively by continuously counting Top30 hot posts in the hottest TopN channels (N is a different number) :

The five result graphs above intuitively reflect the time consuming of obtaining results by using the conventional statistics method and ES aggregation statistics method on the current Wetest public opinion line.

From the results, we can roughly infer the method of ES statistical aggregation operation: first retrieve all the data that meet the filtering conditions, and then conduct sorting and aggregation operation in memory. In other words, the larger the magnitude of data that meets the conditions, the slower the aggregation operation. With this principle in mind, the resulting graph is easier to understand:

1) In the continuous hot post aggregation of the hottest Top1000 channels, the performance of ES is mostly better than the existing implementation. This is because most of the channels in the Top1000 are divided into very small shards, some of which are only a few MB, and the amount of data is very small. Aggregation in such shards is very fast.

2) In terms of time latitude, ES is slower than existing methods in most cases, while ES is faster in cases of one month or one day. This is because in the condition of 3 months, the magnitude of data that meets the conditions increases (the largest topic has 30,000 comments), and the computing efficiency of ES decreases sharply.

3) From Top1000 to Top10, the total time of ES gradually becomes worse than that of existing methods. This is because, in terms of spatial latitude, Top10 channel’s qualified data magnitude is very large, so the computing efficiency of ES decreases significantly.

After doing this experiment, the aggregation speed of ES on the WeTest header data source is not faster than now, but the effect on the middle and long tail is better, which indicates that the aggregation of ES is greatly affected by the amount of candidate set data, so whether to switch this method has not been finally decided. However, this experiment proved the powerful power of ES aggregation, at least, without writing any code, just through the interface call to complete the statistical operation of such a large amount of data, it is still very convenient, and the performance is good. It is also a good option to separate requests through ES aggregation if a self-implemented statistical operation will increase the DB’s stress.


WeTest product public opinion, one-stop understanding of your product reputation and user preferences.

Click wetest.qq.com/bee to experience it now!

Dear readers, in order to provide better website content, we hope you fill in our questionnaire, we will randomly draw readers feedback 20Q coins to show our gratitude! Entrance to the questionnaire:Wj.qq.com/s/1221194/2…

Commercial reprint please contact Tencent WeTest for authorization, non-commercial reprint please indicate the source.

The original link: wetest.qq.com/lab/view/30…