Index design in very large scale retrieval

I. Background

1.1 Business background

In the precise advertising scene, the common method of crowd targeting is to label each User with rich labels according to various rules. At the same time, the advertiser (Member) in accordance with the rules of the circle to select the crowd, the system will also be advertising (AD) on a variety of labels. When an AD and a User are marked with the same Tag, it means that the AD has marked this User, that is, the AD will participate in bidding for the presentation of this User.

The difficulty of this optimization is that in a specific business scenario, we need to know clearly which users are selected in an AD circle, and the only thing we know is a group of Tags (Tag List) marked on this AD. At this point, we need a technical product that stores + indexes, so that we can quickly query the ID of all users (User List) mounted under each Tag. In addition, due to the high real-time requirement of advertising business, the system should ensure that the real-time behavior of users can take effect in the delivery system quickly. When a certain behavior of a User causes some tags to be added or deleted on the User, all the User lists mounted under the affected tags will change.



Data size in the business:

  • The total number of users is about 400 million
  • The total number of TAG is about 100W
  • On average, 100W users will be mounted under each Tag
  • On average, 2000 TAGs are queried every second
  • An average of 5W users trigger Tag updates every second, with an average of 10 tags updated per update

As you can see, both the storage scale, the amount of data to retrieve results per second and the amount of data to be updated per second are very large. It is important to note here that the User ID information is a UINT64 number, which will be taken advantage of by future optimizations.

1.2 Problems of introduction of technical selection

We need a powerful search engine to support data retrieval + update. This search engine needs to support the data storage structure of “Tag-> multiple numeric userids” :

  • When queried, all the userIDs mounted under the tag are obtained through multiple tag queries
  • When updated, the change is made through the changed tag and the userid that needs to be added or deleted

However, when we went through the technology selection, we found that none of the existing technology selection could support this data structure. Most of the existing search engines are developed based on “doc search”, which usually abstract the data structure into the forward search of “doc id->doc” paradigm and the backward search of “keyword ->doc id list” paradigm. All inverted indexes are built on forward row data. For example, you must have the following documentation

doc id Doc content
1 I’m eating
2 Sheep are grazing
3 You are eating

It is possible to have the following inverted index

keywords doc id list
I 1
in 1,2,3
eat 1,2,3
rice 1, 3
The sheep 2
The grass 2
you 3

As you can see, we expect a data structure like an inversion list, but in order to produce such a data structure, we have to build a forward table that we don’t need.

1.3 Problems in the design of index structure

Combined with the business background, in order to obtain a chain structure of “Tag-> multiple numeric userids”, we have to build a forward row structure with userid as doCID and all tags under this userid as doc content.

userid The tag information
1 Women’s Wear Men’s Pet
2 Men’s electronic video and audio
3 Books, audio and video women’s wear

We then create an inversion to get the table structure we need

The tag information userid
Women’s clothing 1, 3
Men’s clothing 1, 2,
pet 1
electronic 2
av 2, 3
books 3

In this way, we can retrieve the User List from the tag. It looks like we’ve solved the data structure problem with some redundant storage, but in reality we’ve only done what we need at query time, and we haven’t considered updates yet. There is a serious problem with this index structure: Inversions are secondary to active lists, so you must update the active list to trigger the update of the inverted list; As the “doc content”, the Tag information in the list must be updated as a whole. For example, if a new “electronic” tag is added to the user with user id 1, we must update the entire row in the queue with user id 1 to

1 Women’s women’s men’s pet electronics

This triggers the “electronic” row in the inversion list to change to

electronic 1, 2,

This does not satisfy what we said at the beginning of this section: “When updating, the change can be made through the changed tag and the userid that needs to add or delete.” In this case, if we want to update the User List under the tag, we must provide all the User IDs under that tag. Because there are millions of user IDs under the tag, the network bandwidth and CPU overhead brought by this update mode is huge, so that it is unacceptable.

Analysis of technical difficulties

2.1 Preliminary design of index structure

This section will introduce a clever index structure design method, which can realize the chain structure of “Tag-> multiple numeric userids” by sacrificing part of the storage under the condition of limited technical selection. Although this design method does not really solve the retrieval problem under such large scale data, it still has reference significance for some small and medium-sized scenes. When we encounter the problem mentioned in Section 1.3, one possible way to think about it is to “split up the doc-granularity of the list.” Try breaking up the checklist in Section 1.3 into the following structure

The primary key (userid_tag) userid tag
1 _ women’s clothing 1 Women’s clothing
1 _ men’s clothing 1 Men’s clothing
1 _ pets 1 pet
2 _ men’s clothing 2 Men’s clothing
2 _ electronic 2 electronic
2 _ video 2 av
3 _ books 3 books
3 _ video 3 av
3 _ women’s clothing 3 Women’s clothing

Then index the tag in reverse (ignore the primary key).

tag userid
Women’s clothing 1, 3
Men’s clothing 1, 2,
pet 1
electronic 2
av 2, 3
books 3

It can be found that the established inverted index is consistent with that in Section 1.3 and meets the query requirements. Now let’s look at the update, when we need to add an “electronic” tag to the user with user ID 1, we just need to insert a row in the front-row list

1 _ electronic 1 electronic

The “electron” row in the inversion list will become

electronic 1, 2,

In this way, both retrieval and update issues are solved.

2.2 The problem of data scale

The table splitting method described in Section 2.1 essentially splits “Tag-> multiple numeric userids” into multiple “user_tag insert records” and then builds an inverted index. As mentioned in Section 1.1 above, there are 100W different tags, and on average there are 100W userIDs in each tag. After calculation, the number of user_tag insert records is about 1 trillion. Although the storage footprint of each record is very small (64B or so), due to the design considerations of the retrieval kernel, there is often an upper threshold for the number of DOCs stored on a single machine. Such a large number of records exceeds the upper limit threshold, even if the use of clustered storage, due to the existence of a single threshold, it will lead to a large amount of machine resource overhead and waste.

For example, assuming that a single machine can store up to 2 billion DOC, using the horizontal sorting method, 500 columns are needed to store 1 trillion records, and taking into account the primary and standby disaster recovery, at least 1000 machines are needed. In fact, each machine was not full of disks, only 128GB (2 billion * 64B = 128GB) was used.

In order to solve this problem, it is necessary to tease out exactly what causes such a large number of DOCs. Setting aside the business context that we can’t determine, the main reason for the DOC bloat is the large number of userids under the large number of tags, and the surprising number of tag_userid primary keys after the split. However, there is a lot of duplication between userids on different downmounts, and if we can eliminate this duplication, we can reduce bloat to a large extent.



One easy way to think about it is grouping. Since the business requires that we use Tag for retrieval, we can only group userids. After grouping, each tag is mounted under the UserGroupID, a UserGroup can be mounted under a number of UserIDs. In this way, the DOC primary key becomes the TAG_USERGROUPID, and the expected number can be significantly reduced (note the word “expected”, which is explained in Section 2.4).



However, there is one problem that needs to be solved, that is, although userGroup1 is mounted under Tag0 and Tag1, user1 and user9 are mounted under Tag0, and user5 and user7 are mounted under Tag1. Therefore, when the same userGroup is mounted under different tags, the userGroup contains different users. Since our primary key is TAG_USERGROUPID, this problem can be solved nicely.

Continuing with the example in Section 2.1, assume that user1 and user2 belong to group1 and user3 belongs to group2, then create the forward list as follows.

The primary key (groupid_tag) userid tag
Group1_ ladies’ 1 Women’s clothing
Group1_ menswear 1, 2, Men’s clothing
Group1_ pet 1 pet
Group1_ electronic 2 electronic
Group1_ video 2 av
Group2_ books 3 books
Group2_ video 3 av
Group2_ ladies’ 3 Women’s clothing

The inversion list corresponds to the following structure:

tag The primary key (groupid_tag) Userid (after merging)
Women’s clothing Group1Women’s Wear, Group2Women’s Wear 1, 3
Men’s clothing Group1_ menswear 1, 2,
pet Group1_ pet 1
electronic Group1_ electronic 2
av Group1_ Video, Group2_ Video 2, 3
books Group2_ books 3

As you can see, the retrieved userid results are the same as in Section 2.1.

So far, the retrieval function is working. This approach essentially splits a composite table into simple tables that are related by a virtual foreign key. From an engineering point of view, the essence of this approach is to trade time for space: the computational overhead associated with each retrieval is sacrificed in exchange for relief on storage. Next, we will look at the problems in the update link. In this scenario, the most critical issue to consider in index design is the DOC granularity when updating. As mentioned in Section 1.3, because updates are updated at DOC granularity, if the contents in DOC are too large, an effective update cannot be performed. After undoing the table, the content in DOC is “the userid list contained under the current tag_userGroupId”, and its DOC size is between the two schemes proposed in Section 1.3 and Section 2.1, with certain randomness. If there are more userids under tag_userGroupId, the updating pressure is greater and the efficiency is low.

2.3 Optimization of storage of numerical data

In order to solve the problem of low update efficiency due to the large number of UserIDs contained in TAG_UserGroupid, we need to find an optimized storage method to compress the storage of multi-valued UserIDs. One approach described here is bitmap compression.

When the data is numeric and the distribution of values is relatively centralized, a bitmap store can be set up. Each bit indicates the existence of a userid (0 or 1), and the userid serves as an index to the location of the addressed bit. This approach is still time for space, trading the computational overhead of the Bitmap inverse solution for storage space.



Here, the key point affecting the compression efficiency is whether the bits are distributed centrally after the mapping calculation. If the distribution is too discrete, there will be many bit holes. When there are a lot of bit holes, there are usually two ways to do this:

  • The method of optimizing mapping calculation should ensure that the data after mapping is as concentrated as possible
  • Since it is not always possible to find an ideal mapping calculation method, it is more common to compress the bitmap space again. There are many compression methods, which readers can design by themselves. Here, two compression methods are briefly described. (1) the adaptive storage When the hole is large, the effect of bitmap storage may be inferior to violent storage, can be calculated according to the business characteristics of two kinds of schemes of the critical value, adaptive judgment with which to store (2) the bitmap compression When appear more bytes hole, can use the < byte id, byte value > way to store the empty bytes



2.4 Data grouping rule design

As mentioned in Section 2.2, the core purpose of grouping userids into usergroups is to reduce the number of DOCs, where you need to design the rules for grouping. An extreme case would be to group 400 million users into 400 million usergroups, each with a new user. In order to reduce the number of Doc as much as possible, when designing rules, we should pay attention to the bit distribution characteristics of userid, extract the common part as userGroupId, and group these users into this group.

This is essentially a process of “feature extraction” to extract as many commonalities as possible from numerical samples.

Three summary

So far, the index design in the very large scale retrieval is over, now summarize some key points in the process of problem analysis. 1. In index design, we should not only consider whether the function and performance of retrieval meet the requirements, but also consider the function and performance of update, and give a comprehensive solution. 2. Can take apart the table 3, when it needs to be based on the rules for grouping numeric data, not only integer and modulus of two simple grouping, want to combine data characteristics, given the grouping 4, with better effect when violence storage occupy storage space is larger, can consider to whether with bitmap compression storage; If there are many bitmap cavities, further compression can be considered

If interested, welcome to pay attention to WeChat technical public account