Heterogeneous data half an hour to realize the search function, a system to do

background

For the department of Idle Fish, which is in a period of rapid growth, the business scene is expanding rapidly, and more and more business data have demands for search ability. The time cost of development and maintenance can be huge if you build a separate search engine service for each business in a conventional way.

Can only one set of search engine system support the data produced by different business scenarios? How can heterogeneous data from different scenarios be compatible in one engine? Xianyu set up a general search system to solve this problem according to the actual business needs.

Principles of Search

The search engine used by Xianyu is Alibaba’s HA3 engine, which works with its upper management system Tisplus2. It can be divided into the following subsystems:

1. Dump: The first thing to do to access the search system is to convert DB data into some business logic (merge, join process will be described in detail later) and write it to the file system or message queue according to the file format recognized by engine BuildService for BS to build index. This process can be divided into full and incremental.

2. BuildService (BS for short) : Build index files from dump. The Searcher machine can provide inverted, forward, summary queries only after the index files generated by the BS are loaded.

3. Search service gateway: The service layer encapsulates the unified service interface and shields the service access party from searching for the low-level details of the system.

4. Search Planner (SP for short) : Combined with various capabilities of the Search center, it calls algorithm services to rewrite, forecast and calculate the query string incoming from the gateway, so as to realize functions such as multi-path recall, hierarchical query, page turning and weight reduction, and return the results returned by QRS.

5. Engine online service: QRS and Searcher roles. SP’s query request is sent to the QRS, which forwards the request to multiple Searcher machines and collects the results returned by searcher for consolidation, scoring, sorting, and return.

A simplified version of the entire search system is shown below

Building a search system from scratch for each business scenario is complex and time consuming. We hope to provide a set of general search system, when the new business data access search ability, don’t need business development students proficient in search system principle, as long as what data need to be registered in our system search, can complete the search ability of independent access, almost no development, realize ten minutes fast access to search. The data of multiple services is stored in one set of search engine services and is isolated from each other.

There are two issues involved:

1. How to write heterogeneous data from different business dB into the same engine for index construction, and the writing process is completely automated and transparent, without the participation of business access party in development;

2. How to realize the development of different business scenarios Students do not perceive the existence of other business data in the process of using search and recall, just as they are using a set of engine services independently built for their business.

Our solution

In view of the above two problems, our solution is to build a set of general search system in advance, and realize the basic capabilities of dump, BS, search, Search Planner and gateway layer in advance. The business selects the capabilities it needs (such as keyword rewriting, category prediction, PVlog printing, hierarchical recall, etc.) by setting optional input parameters when invoking the service. The dump process is automated through an intermediate layer, and the dump and search processes are translated and the results are wrapped.

The process of building basic search engine services is not much different from the conventional way, which is not described in detail here. Next, I will introduce in detail the implementation process of this scheme split out of the four technical points: (1) universal search reservation table; ② Metadata registry; ③ Two-layer dump; ④ Online inquiry service.

Generic search reservation table

Normally, we would name large and wide table fields produced by dump as itemId, title, price, userId, and other fields with clear semantics. However, if multiple scenes share a set of engine, this cannot be done, because not all scene data have itemId, title and price fields, and the color field may be needed in a scene but not defined in our engine, so this business scene cannot be supported. Since the key to the problem is that field definitions have semantics, our approach to this problem is to make all fields in the engine completely semantic free, with only type information. We predefined two dimensions as shown in the figure below, with two Mysql tables and two ODPS tables under each dimension (this definition can cover most scenarios already), called the generic search reservation table

Reserve various types of fields for each reservation table and name the fields based on the dimension, position of the table in the dimension, and field type.

1. Name the fields in the first reservation table as DIMapk, DiMAaintr1, DiMA AtExtMultiLevelR1, and DiMaDimBJoinKey.

2. Name the second reservation table in the first dimension dimabinnerMergekey, Dimab Intr1, DIMabIntr2, DIMablonGR1, etc.

3. Name the first reservation table field in the second dimension dimBPK and DIMBAINtr1.

4. Name the second reserved table in the second dimension dimbbinnermergekey, DIMbb Intr1, dimbblong_R1, etc.

Then, the reservation table is connected to the engine’s native dump system according to the structure shown in the figure above, and the index building information is configured. When the engine service is complete, if you directly add a few pieces of data to the common search reservation table, you can already look up the data from the engine’s online query interface. However, this search system is not available for business development students, because the business source table results are completely different from our reservation table structure, and it is difficult for business students to migrate all the data from the source table to the general search reservation table in accordance with our predefined format. “Dimaatext MultilevelR1 =’iPhone6S'” is also unacceptable to business students. Let’s solve these problems.

Metadata registry

We designed a metadata registry. When a new business needs access to search capability, it only needs to fill in the business-related registration information (including business scene label, database, table name, field and other basic information that needs access to search capability) in the registry, and the system will assign a unique business identifier. This identifier is used as the most important identifier for multi-service isolation in dump, BS, and query processes.

Metadata registry structure:

The metadata registry provides the ability to register new services in the form of WEB interface. When users fill in the database name of the service, all the table names will be automatically pulled out through the middleware. A user selects the name of the table to be accessed. All the fields in the table and all preset general search reservation table fields are listed on the page. The user establishes a connection between the source table field and the reserved table field with the mouse, and clicks Submit after completion. The system will check the validity of the mapping relationship established by the user, and write it into DB according to the metadata registry structure above. This registration data will be used in dump, query, etc.

Two layers of the dump

The semantics of each business source table are generally single, and multiple tables can be combined to form a complete picture of a business scenario. Such as:

1. Commodity ID, title, description, picture, seller ID and other information will be stored in the basic commodity information table;

2. The extended product information table stores extended product information, such as SKU information and extended label information, using the product ID as the main key.

3. The seller’s basic information table will use the user ID as the main key to store the user’s nickname, profile picture and other basic information;

4. The seller’s credit information table will store the user’s sesame credit rating with the user’S ID as the main key.

In a typical search request scenario, users with “iPhone6S” to search, in addition to want to see in the search results users information basic goods such as title, description, pictures, etc., also want to see is stored in the extended the sku in the table, the extension information such as the extension tag, as well as the seller’s nickname, head, such as credit rating user dimension information. How to return all the information related to the same item stored in multiple tables in a single recall? This requires that multiple tables are organized in a certain way during dump, assembled into the desired wide table format, and then written into the persistent store for the engine to build indexes.

During dump, we merge and join tables related to this business scenario according to the primary key. Merge multiple tables in the same dimension into large and wide tables based on primary keys. For example, merge is performed between 1 and 2 based on commodity ids. The result is M1. Merge user IDS between 3 and 4. The result is called M2. As a result, there is a column of data in M1 that is the user ID of the seller, and the primary key of M2 is the user ID. Join M1 and M2 according to the user ID to get the final large and wide table, and any data in the wide table contains the complete scene information in 1234.

In the construction of the universal search Reservation table, we have done intra-dimensional merge with dimAPK + IMabinnerMergeKey and DIMB PK + DIMbbinnerMergekey. The connection between the reservation table and BuilderService is completed by inter-dimension join in the way of DIMAPK + DIMbPK. As long as the business student migrates the source table data correctly to the reservation table, the complex dump process described above can be implemented. Data migration not only ensures that all the data in the source table is migrated, but also ensures that the real-time incremental data on the line is migrated. In addition, the process of migration needs to be transformed according to the field mapping information in the metadata registry. This process is relatively complicated.

Our implementation is based on internal middleware platform alibaba “jingwei” do secondary development, write the independent consumption tar package uploaded to jingwei operation platform, according to the business registration information applicable to the business migration task, this part of the work completed by when we are in the development of universal search system, for each business completely transparent access to the classmates. Jingwei platform supports full migration task and incremental migration task. A simple understanding of full migration task is to execute “select * from tablexxx where ID >m and ID” on the source table

Online enquiry service

Since the fields that dump produces data are semantically neutral, the corresponding BuildService build is also semantically neutral.

This seems to support writing heterogeneous data from multiple scenarios to the same engine service in a semantically non-defined way, but it is too unfriendly for business development students. When they invoke the search service in business development, they expect a natural business semantic invocation, as shown in the following code snippet:


     
      param.setTitle("iPhone6S");
      param.setSellerId(1234567L);
      result = searchService.doSearch(param);
     
Copy the code

But now that there is no semantics for the fields, the complexity of their development is greatly increased and even becomes difficult to maintain after a long time, because no one will remember the meaning of “param.setdimaalongr1 (1234567L)” in the code after a month. Is this query by user ID or item ID? Although at the bottom we put the data of multiple businesses into one engine service, we hope that the online query service provided to business development students (namely the users of our system) is the same as the experience of building a set of engines independently. Therefore, there needs to be a translation layer, the general search system received the query request is “title=iPhone6S”, we need to automatically translate according to the metadata registry mapping relationship into “DimaatExtMultiLevelR1 =iPhone6S” and then launch a search request to the engine. The non-semantic fields in the data DO returned by the engine are translated into semantic fields in the source table.

It can be seen that through the search gateway package provided by us, business students can set the query condition “param.settitle (“iPhone6S”)” in a semantic way, and automatically package the non-semantic fields returned by the engine into semantic fields. The business student is completely unaware of the transition process, and it is like using a search engine service built solely for him. The source table field definitions are different for each business access party, and writing a single set of search gateway code will certainly not achieve the above capabilities.

Our solution is that when a user has access to a new business in the metadata registry, the background automation generates a customized two-party package code for the business, including the query entry parameter, return DO and query service interface. Taking POI data access as an example, the developer of THE POI business domain explained in the metadata registry that he needed to do text fuzzy matching according to poiname and precise query including query and not including query according to poicode. According to this registration information, we automatically generate query service input parameters for POI business scenarios for users. Each input parameter is a combination of certain rules. After the gateway online service gets this parameter, it can translate it into a specific query string according to naming rules. The following figure shows the naming rules for parameters

The Demo code for the input parameter is as follows:


     
      public class UnisearchBiz1001SearchParam extends IdleUnisearchBaseSearchParams {
          private Set<Long> unisearch_includeCollection_prefix_poiCode;
          private Set<Long> unisearch_excludeCollection_prefix_poiCode;
          private String unisearch_keywords_poiName;
      }
     
Copy the code

User through online query service into the query conditions, the query service is detected in ginseng IdleUnisearchBaseSearchParams, According to the naming rules using reflection mechanism for determining unisearchincludeCollectionprefixpoiCode is need for the business of the source table poiCode fields do include query (included), Then, the reservation table field named Dimaalongr1 corresponding to poiCode is extracted from the mapping data in the metadata registry to construct the Search Planner query string and perform subsequent query actions. When the engine returns the query result, the gateway query service translates and converts the engine result DO using reflection based on the metadata registration information, packaging it into the POI business specific DO as shown below and returning it to the business development students.


     
      public class UnisearchBiz1001SearchResultDo extends IdleUnisearchBaseSearchResultDo {
          private Long poiId;
          private Long poiCode;
          private String poiName;
      }
     
Copy the code

General search reservation table a total of 8, all the fields add up is more. If all the fields are recalled, in fact, most of the fields are empty fields that the business has not registered, and the returned data will expand dozens of times than the actual data size required. The cost of network transmission, deserialization of a large number of empty fields, and conversion cost of DO fields will lead to high RT of online query services. To solve this problem is relatively simple. We define the whole online recall process as two stages. In the first stage, only the primary key RAWPK that meets the requirements is recalled in the engine according to the user’s query conditions. In the second stage, when the summary of the corresponding data (that is, the information of all fields) is obtained according to the RAWPK list, the DL syntax supported by the engine is used to require only the reserved fields registered by the user to be returned in the second stage. Of course, these work is also achieved in advance by us in the gateway code of the general search system, transparent to the business access students.

Special solution of increment problem

So far everything looks perfect, it seems that we have perfectly solved the automatic packaging of data import, conversion, BS, query and a series of work with this system. All business students need to do is to register on the interface of our business registry.

But underneath the surface lies a bigger problem, the problem of large increments. Since BuildeService is directly connected with the general search reservation table whose structure has been fixed, it means that the original dump layer data source structure cannot be changed. The only change can be the data written into the general search reservation table from the business source table through the refined health system. When there is a new business access in upstream, if the data source table billion level, according to the current migration velocity, jingwei can reach means universal search the reserved table update TPS can reach the level of 50000, and this article 50000 data updates per second pressure will play on the real-time BS system directly, That is, the engine needs to update 50,000 DOC data per second to ensure that the search results are consistent with the source table data. And the search engine’s real-time BS power depends on the real-time memory capacity, so big increment of TPS will be played in short time memory, real-time lead to the source table subsequent update data can’t be BS build into index, then the search system will not be able to search the latest business data (including add, update, delete), called incremental delay problems.

Multiple services share this engine service, and online services that already provide search services cannot accept incremental delay. For the newly connected service, it is completely acceptable that the data cannot be searched during the period when the data is synchronized to the universal reservation table during the first access.

Therefore, we came up with a way to realize the normal real-time feed engine for incremental data of online stock business, while the incremental real-time feed engine is triggered by the full migration of data of new business. Specific implementation is divided into the following steps:

1. Generic search reservation table has a GMTModofied field of type Datetime according to the db table creation specification. Gmtmodofied field is updated with timestamp when reservation table data changes (add, delete, change). This logic is implemented in the DAO layer of Jingwei migration task.

2. Add an additional datetime field to each table of the Generic search Reservation table and name it gmtDropincTag. When jingwei full task is derivative data of new business, we put “DropincTag =true” in jingwei task startup parameters. Accordingly, after we identify this mark in Jingwei independent consumption code, Will be done on data conversion generated DAO layer into and DO the assignment for gmtmodofied gmtdropinctag field the same value, then write to the DB. If dropinctag= True is not displayed in the startup parameters of a non-new service full and incremental precision task, incremental precision tasks of other services are only updated gmTModofied but not gmTDropinctag when written to the DB.

3. In the engine’s native dump layer, we add a layer of UDTF logic code between each generic search reservation table and subsequent MERGE nodes. The UDTF code here is an open port of the dump layer, allowing the access side of the engine to do some special processing on the dump process. Each upstream data will be output to the downstream for merge, join, and output to the BS system after the udTF logic processing. The logic we are implementing here is that if we identify the current full dump process, we will empty the gmtDropinctag that is currently flowing into the data and output it downstream. If it is an incremental dump process, check whether the gmTModofied field and gmtDropincTag field of the current data are the same. If the two fields are the same, execute the drop logic for the data. If the two fields are different, set the GMTDropinc_tag of the current data to empty and output the data downstream. The dump system discards all data with the drop logic and does not output the data to the final data file.

In this way, the incremental data of the old business is still reflected to the search online service through the BS (BuildServcie) system in real time after the normal dump process. However, all the data of new access services are only migrated from the service source table to the search reservation table, and the BS system is completely unaware of the existence of this batch of data. After all the data of the new service has been migrated from the source table to the reservation table, we trigger a wholesale process for the engine service, that is, we re-dump all the data of the general search reservation table in a full dump logic to produce complete HDFS data, and then the offline BS system builds indexes of the HDFS data in batches. It then loads into the Searcher machine to provide the online service. Then, jingwei incremental migration task was started for new services to ensure that changes in the business source table were reflected in the engine in real time.

The effect

Xianyu’s universal search system currently serves three businesses online, and each new business can be accessed within 10 to 30 minutes. Before this system, if a business side wants to access the search capability, it needs to raise the development demand to the search business owner who is proficient in the basic search principle in the team, and wait for the development schedule of about a week. After the search owner completes the establishment of a set of engine services, the business students can enter the business development stage. We use this system to eliminate the single point blocking problem of searching owner and realize the emancipation of productivity with automation technology.

Looking forward to

Xianyu will continue to explore more in the aspect of automation and efficiency improvement, so that developers can free themselves from heavy repetitive work and devote their time to more creative work. There are more deep and challenging projects in xianyu this year. We are looking forward to your joining us and creating miracles together.

The Idle Fish Team is the industry leader in the new technology of Flutter+Dart FaaS integration, right now! Client side/server side Java/architecture/front-end/quality engineer all look forward to your joining. Base Hangzhou Alibaba Xixi Park, we will work together to create community products with creative space, do top-level open source projects in depth, expand technological boundaries and achieve the ultimate!

* Send resumes to small idle fish →[email protected]

More series of articles, open source projects, key insights, in-depth interpretation

Please look for the idle fish technology

Heterogeneous data half an hour to realize the search function, a system to do

Related Posts

Kubelet from Start to Drop series :GPU enhancements

User mode and kernel mode (thread level)

Learn Bert source code in 10 minutes (PyTorch)