From the platform to China | Elasticsearch in ant gold practical experience

Shan Ren is the person in charge of general search product of Ant Financial service. General Search currently has trillions of documents and serves hundreds of business parties. It is the largest search product within Ant. His ant Middleware search team focuses on building simple and trusted search products, and is the largest search service provider in Ali Economy. At present, we focus on abstract search solutions under various complex scenarios, and strive to make search available to everyone and everyone will use it.

This post is based on his comments at the 2018 Elastic China Developer Conference

Hello, everyone. I am Shanren from the middleware team of Ant Financial, and currently I am the person in charge of the general search product of Ant. Today I’d like to share with you the theme of Elasticsearch in Ant Financial’s Middle Platform Experience.

Generic search based on Elasticsearch is the largest search product within Ant, with over a trillion documents serving hundreds of businesses. The development of general search can be divided into two stages: platform and platform.

Today I’m going to talk about what pain points we addressed for the business during these two phases of development and how we addressed those pain points.

Source power: Complex architecture, difficult operation and maintenance

Like most large enterprises, Ant also has a self-developed internal search system, which we call master search.

However, due to the high customization of the system, the general service access is more complicated and the cycle is longer. For a large number of emerging small and medium-sized businesses, iteration speed is particularly critical, so it is difficult to meet the main search.

The main search can not meet, and the actual business to use, how to do? You’ll have to build your own. In the past few years there have been many small search systems inside the ants, including ES, Solr, and even lucene itself.

Business pain points

However, due to the fast iteration speed of the business, it is very expensive to operate and maintain these search systems. Like the ES, although it’s simple to build, it takes a lot of expertise to use in a real production environment. As a business department, it is difficult to invest manpower in operation and maintenance.

And because of the ant’s own business characteristics, many businesses need high availability guarantee. As we all know, the high availability of ES itself can only be deployed across machine rooms at present. Not to mention the allocation strategy of cross-machine room deployment, it is difficult to complete the business just by accessing a nearby point.

For these reasons, such scenarios are rarely highly available, and the business layer would rather write two sets of code and prepare one for the bottom line. It is easier to degrade directly in disaster recovery than to have high availability.

Architecture (

From overall architecture level, each business to build a search engine caused the chimneys, all kinds of repetitive construction, and the small and medium-sized business data volume is compared commonly small, often a business a three-node cluster only tens of thousands of data, the overall resource utilization is very low, and as the search engine chosen version, deployment way are not consistent, is also difficult to ensure the quality. At the architectural level it can only be assumed that there is no search capability.

The “low cost, high availability, less operation and maintenance” Elasticsearch platform was created

Based on these pain points, we came up with the idea of building a standard search platform that would free the business from operations and unify the infrastructure from an architectural level. Provide a simple and trusted search service.

Architecture:

How to do “low cost, high availability, less maintenance”? Let’s first take a look at the overall architecture, as shown in the figure, first of all, we represent the two boxes of two machine rooms, our whole is a multi-machine room architecture, to ensure high availability

The top layer is the user access layer, which includes API, Kibana and Console. Users can directly use our products just as they can use native ES APIS.
In the middle is the Router layer, which is responsible for sending user requests to the corresponding cluster, and is responsible for some intervention logic.
Each machine room has a Queue, which is responsible for peak load shifting and Dr Multiple write.
There are multiple ES clusters in each machine room, and the data of users finally falls into a real cluster, or a group of peer high availability clusters;
The red on the right is Meta, a one-stop automated operation and metadata management for all components;
At the bottom is Kubernetes. All of our components run on k8s as containers. This frees up a lot of physical machine operations and makes rolling and restarting things very easy.

Low cost: Multi-tenant

After looking at the whole, here is a point by point introduction of how we do, the first goal is low cost. At the architecture level, cost optimization is an annual topic. So what does it mean to reduce costs? It’s actually increasing resource utilization. There are many ways to improve resource utilization, such as improving compression ratio and reducing query overhead. But a simple and effective way to do this on a platform is to multi-tenancy.

Today I will focus on our multi-tenancy solution: the key to multi-tenancy is tenant isolation. Tenant isolation is divided into logical isolation and physical isolation.

Logic isolation

First of all, let’s introduce our logical isolation scheme. Logical isolation is to let the business still use the same usage as before, that is, transparent access, but actually access only a part of the real cluster belongs to their own data, and can not see the data of others, that is, to ensure the level of permission. And one of the things that ES is good for for logical isolation is that ES accesses are actually indexed. So the question of our logical isolation becomes how to make users see only their own tables.

We save the mapping between the user and the table through the console, and then intervene during access through the router, the routing layer described earlier, so that the user can only access his index. Specifically, routing layer OpenResty + Lua implementation, we will request process is divided into the four steps of right, Dispatch, the Filter, the Router, Reprocess

In the Dispatch phase we’re structuring the request, pulling out its user, app, index, action data
Then enter the Filter phase, which includes write filtering and rewriting. The Filter is divided into three steps

Access does Access interception like limiting traffic and checking rights,
Action intercepts and processes specific operations, such as DDL, that is, creating tables, deleting tables, and modifying structures. We forward these operations to the Console for processing. On the one hand, it is convenient to record the corresponding information of index and APP, and on the other hand, because creating and deleting tables still affects the performance of the cluster. We can further restrict the user by forwarding it to the Console to prevent malicious behavior from affecting the system.
Params is a request to rewrite, where we do the rewriting according to the specific index and action. For example, remove indexes that the user does not have access to, for example, change the Kibana index to the user’s own unique Kibana index to achieve multi-tenancy for Kibana, for example, simple compatibility between different versions of ES. There are a lot of things you can do in this step, but there are two things you need to be careful about. First, try not to parse the body. Resolving the body is a very performance affecting behavior and should be avoided except for special rewriting. And use ES itself to turn off the index function in the body, so that rewriting can be much faster. For _all and getMapping which access all indexes, if we replace all indexes of the user, the URL will be too long. We create an alias with the same name as the application name, and then rewrite it into this alias

After filtering, we go to the real Router layer, which makes the actual routing requests based on the Filter results, either to the real cluster or to our other microservices.

Finally, there is Reprocess, which is the final processing after we get the business response, where we rewrite some of the results and log asynchronously

The above four steps are the general logic of our routing layer. Level permissions are controlled through the permission relationship between APP and index, and routes are rewritten through index to share the cluster.

Physical isolation

After the logical isolation, we can guarantee the level permissions of the business, so is it ok? Obviously not. In fact, there are still great differences in the access of different businesses. Only logical isolation will often cause mutual influence between businesses. This is where physical isolation is needed. However, we have not found a very good plan for physical isolation at present. Here we share with you some of our attempts

First, we use the method of service stratification, that is, to separate different purposes, different importance of the business, for the key master link business can even monopolize the cluster. For others, we mainly divide them into two types, log type with more write and less search and retrieval type with more search and less write, and allocate them to our preset clusters according to their different requirements and traffic estimation. However, it should be noted that there is always a difference between the declared and the actual, so we also have a regular inspection mechanism, which will carry out cluster migration of online services according to their real traffic

After service stratification, we can basically solve the scenario where low-importance businesses affect high-importance businesses. However, there are still some businesses at the same level, such as marketing activities, which cause sudden traffic. What to do about this kind of problem?

Generally speaking, this is a fully restricted stream, but since all of our accesses are long connections, this is not easy to do. As shown in the figure on the right, the user accesses multiple routers through one LVS, and then we access multiple ES nodes through the LVS. We need to limit the flow, that is, to ensure the total number of tokens on all routers. Generally limited flow has two kinds of schemes, one is current limit dimension will play on the same instance all requests, also is to the same table all access to play on a machine, but in the ES traffic so high scenario, this is not appropriate, and because we have been in front of a layer of the LVS load balancing, and then do a routing layer will appear too complex.

The second solution is to evenly divide tokens, but because of the long connection problem, some nodes are already restricted, while others have little traffic.

So what to do?

Since token usage is unbalanced, let’s make it unevenly distributed as well. So we adopted a fully confined flow scheme based on feedback. What is feedback based? That is, we use inspection to collect the dosage regularly, and give more to the more, and give you less to the less. So what’s the standard for giving more and giving less? Then we need decision making unit to deal with, at present we take the solution is simple proportioning. One thing to note here is that when a new machine comes in, the final state is not reached at the beginning, but gradually. Therefore, some strategies need to be set for this convergence period. At present, because our machine has good performance and is not afraid of burst burr, we set all release, and then carry out current limiting after stability.

Keepalive_timeout specifies the timeout period for a long connection. The default value is 75s. However, there is an optional value for this parameter, which specifies the timeout period written in the response header. If this parameter is not written, the client will reuse the Connection immediately after the server releases it, causing a Connection Reset or NoHttpResponse problem. The occurrence frequency is not high, but it really affects the user experience. Because of the random low frequency, we always thought it was the client problem, but later found that it was the original release order problem.

Now that service stratification and full restricted flow are complete, can you have a good sleep? Unfortunately, no, because the ES syntax is very flexible, and there are many costly operations, such as aggregating hundreds of billions of pieces of data, or using wildcards to do an infix query, and writing a complex script that can overwhelm our entire cluster, so what about this situation? At present, we are still in the exploratory stage. At present, a more useful way is to recover after the fact. That is, we find some time-consuming tasks through inspection, and then punish their subsequent operations, such as downgrade or even circuit breaker. In this way, you can avoid lasting effects on the entire cluster. However, a flash of RT rise is inevitable, so we are also trying to intercept in advance, but this is more complicated, interested students can communicate with each other offline

High availability: Peer to peer multiple clusters

Low cost brings us to our second goal, high availability. As I mentioned before, ES itself actually provides a cross-machine room deployment scheme, which can be carried out by marking, and then ensure nearby service query through preference. I won’t go into the details here. However, this solution requires two and three centers, and many of our external output scenarios do not have three centers due to cost considerations, only two and two centers, so how to ensure high availability of dual-machine rooms is a challenge for us. In the following, I will mainly share with you our peer-to-peer multi-machine room based high availability solution, we provide two types, a total of three solutions for different business scenarios.

We have two types: write multiple read and write multiple read

We adopt the scheme of cross-cluster replication. By modifying ES, we increase the ability of translog to push data from the primary cluster to the standby cluster. It is similar to CCR 6.5, but we use push mode instead of pull mode, because we have done tests before, for massive data write, push performance is much better than pull. During Dr, the active/standby data is exchanged, and data in transit is restored. The upper layer ensures single-write, multi-read, and failover logic. This scheme is synchronized through THE T ranslog of ES itself, with simple deployment structure and accurate data. Similar to the standby database, it is more suitable for high availability scenarios with no high requirements for writing RT.

Write more and read more. We offer two options

The first solution is more convenient, because many key link business scenarios are synchronized from DB to search, so we get through the data channel, can automatically write from DB to search, users do not need to care. So for the high availability of such users, we use the high availability of DB to build two data pipelines and write to different clusters respectively. This allows for high availability, with absolute assurance of final consistency.
The second solution is in the case that there is a strong requirement for writing RT and there is no data source, we will adopt the middle layer of multiple write to achieve high availability. We use the message queue as the middle layer to implement the double write. When a user writes a file, the write succeeds only when the queue is also written successfully. If one file fails, the entire file fails. The queue then guarantees push to another peer cluster. Use the external version number to ensure consistency. However, due to the middle tier, consistency assurance for Delete by Query is somewhat impotent. Therefore, it is only suitable for specific business scenarios.

Finally, in terms of high availability, I also want to say that for platform products, the business actually does not care about what technical solutions are and how to achieve them. The business only cares about whether they can access nearby to reduce RT and ensure the availability of automatic switching during disaster recovery. So we block out these complex types of high availability and these applicable scenarios on the platform, leaving it up to our backend to make it easy for users to self-access. In addition, we also move the read-write control and DISASTER recovery operation to our own system, which is not aware of the user. Only when users can be so transparent and have high availability will our platform truly become a high availability search platform.

Less operations

The last goal is to reduce operation and maintenance. We have already shared a lot about operation and maintenance today, so I will not repeat it here. Just a brief introduction to the four principles that we developed during the process of building the overall o&M system. Self-contained, componentized, one-stop-all, automated.

Self-contained ES does a pretty good job, a JAR can be started, and our entire system should be just like a single ES, a simple command to start, with no external dependencies, so it’s easy to output.
Componentization means that each of our modules should be plugable to fit different business scenarios, such as those that do not need multi-tenancy or those that do not need peak filling.
One station refers to the control of all our components, such as Router, queue, ES, and many micro-services, should be managed in one system. It is absolutely not possible for each component to have its own set of controls.
Let’s forget about automation. Everybody understands that
On the right is our platter page, showing the router, ES, and queue accesses. Of course, this is the data of the mock

Look back at business: no operation, but still unhappy

Now that we have a “low cost, high availability, less operation” Elasticsearch platform, we’ve solved the business pain points mentioned above, so are users happy with it? We spent the better part of a month conducting a survey of our business and found that although the business had been liberated from operation and maintenance, it was still shackled by search.

We are mainly divided into two types of users, data analysis and full text search.

The data analyser mainly thinks that the configuration is too complicated, he just wants to import a log data, he has to learn a lot of field configuration, and he only needs to use it once for a long time. Every time he forgets it, he needs to relearn it, which is very time-consuming. Second, irrelevant logic heavy, because the data analysis class is generally reserved for many days of data, expired data can be deleted, in order to achieve this function, the data analysis of students to write a lot of code, but also to control different aliases, is very troublesome.
And the full text retrieval class of students have three main pain points, one is the word configuration is complex, two is difficult to modify the field, reindex is too complex, but also to create their own alias, and then control seamless switching. The third point is that debugging is difficult. Although it has been explained now, students who have used it should understand. If you want to tease out the specific reasons for scoring, you still need to open a large chunk of cache in your mind. For those of you who are not familiar with ES.

These pain points can be classified into two pain points: high learning cost and atomic interface.

Search in the middle: Abstract logic, free business

High learning cost and atomic interface are pain points for business, but they are advantages for ES itself. Why high learning cost? Because of the rich features. And why interface atoms, in order to allow the upper layer to be flexible.

This is great for expert users, but it’s a real hassle for the business.

So we started our second phase, the search center. What is the middle desk? It is to move some common business logic down to reduce the business logic and let the business focus on the business itself.

Why can’t businesses do that? Of course you can do it. But as the saying goes, “the world’s martial arts, only fast not broken”, the lighter the front desk, the more can adapt to this very fast changing business demands.

Therefore, the main objectives of our search platform are two points: one is to reduce the cost of business learning and speed up the speed of getting started. This time we will introduce how to reduce the learning cost of low-frequency operations such as configuration classes; The second is to abstract complex logic to speed up business iteration. This time we will mainly introduce which two kinds of business logic are abstrused.

Reduce learning costs

Reduce the cost of learning. How do you do that? As we all know, a black screen turns to a white screen. But a lot of the whiteness is just putting the command on the Web, the carriage return button. Does this really reduce learning costs for users? I think it goes without saying that this is not going to work.

We’ve tried a lot of visualizations and taken a lot of detours, but there are three things we need to understand to really reduce learning costs for users

The user layer, distinguish the small white users and expert users, do not let the opinions of expert users affect the overall minimalist design of the product, for the small white users must be the less the better, the fewer choices, the shorter the path, the more timely feedback, the better the effect. Like the so-called silent majority, many small white users do not actively speak out, and will only give up with complex configuration. The following figure shows the different pages when configuring the table structure for expert users and small white users. For expert users, it basically visualizes all the functions of ES to speed up the use. For small white users, it is to completely block these function points, so that it can be used directly.

Wizarding configuration, wizarding configuration is really just a restriction that determines what the next step is based on the user’s previous input. Avoid a page that opens up to a bunch of configuration items that can overwhelm the user, not to mention cost of learning. The wizard-style configuration is used to reduce the user’s choice and memory cost. Restrictions do not necessarily mean constraints on users, appropriate restrictions can reduce the cost of understanding users. For example, the picture on the right is one of our word segmentation configuration. It is very simple to guide the user to select the corresponding dictionary after selecting the Chinese dictionary

And third, deep structure leveling, what deep structure leveling is, it’s like a word participle, it’s like a similarity, it’s all at the index level, we abstract it out and make it global. Users can create their own global word segmentation, similarity, and can share it with others, as a resource. And then in index, you refer to that participle. Although what is done here is only to change the word segmentation from index level to global level, it really reduces a lot of business operations, because in a business scenario, there are often multiple tables, which often use the same set of word segmentation. With this global segmentation the user only needs to modify one place to work in all places.

Abstract complex logic

Ok, having said some lessons about white screen, I’d like to share with you two new table structures for abstract encapsulation of complex logic. The two scenarios are data analysis scenarios, in which we abstract the log type table, and full-text retrieval scenarios, in which we abstract the alias type table.

Type log table just as its name implies is to save the log, which is said before for business data analysis, often only a few days, for example, we now have a business scenario, there is a es log tables, just want to keep three days, so we gave him create indexes according to the day, and then write the index mount to today, the query index all mount, Using router to automatically rewrite the alias, the user still passes in es, but the write operation is actually executed in ES_write, and the query is executed in ES_read. Of course we don’t actually build indexes by day, we use Rollover to create many indexes to ensure the speed of massive writes. But the overall logic is the same as this

For full-text retrieval, the main pain point is the change of table structure and word segmentation, dictionary changes, need to rebuild the index. So we abstracted a table structure called an alias table. The user creates a table es, and what is actually created is an alias for ES, which we will match to our actual index. With this alias, we can automatically help the user to complete the index reconstruction operation. For index reconstruction, we have two methods. One is when the user has configured the data source, we will rebuild directly from the data source and switch directly after reconstruction. In addition, if there is no data source and the data is written directly to the API, we currently use the REindex of ES and the message playback of our message queue to achieve. To be specific, we first submit the reindex, and the data starts to be forwarded to the queue. After the reindex is completed, the queue will play back from the reindex, and the alias will be changed

conclusion

To sum up the content shared this time, we first built a “low cost, high availability, less operation and maintenance” ES platform to release the business from operation and maintenance, and then further built the search center platform, by reducing the cost of business learning, sinking the common business logic to accelerate business iteration, enabling business. Of course, the search platform introduced here is only the most basic capabilities, we are still further exploring some complex scenarios how to abstract to reduce business costs, that is, vertical search products, interested students welcome to join us, together to build, let the world is not difficult to use the search. We are continuously recruiting all kinds of positions, product, design, front end, back end, operation and maintenance are welcome to join. E-mail: [email protected]

PPT Address:

http://www.sofastack.tech/posts/2018-11-12-01

Or click “Read full text” at the bottom of the text to go directly

Long press attention to get distributed architecture dry goods

Welcome to jointly create SOFAStack https://github.com/alipay