Maybe 99% of the students do not do search engines, but 99% of the students must have realized the search function. Search, retrieval, which in the end contains technical things, I hope this article can give you some enlightenment.

How does the whole network search engine architecture and flow?

The macro architecture of the whole network search engine is shown in the figure above. The core subsystem is mainly divided into three parts (the pink part) :

(1) Spider crawler;

(2) Search&Index establishes the index and query index system, which is mainly divided into two parts:

  • A portion is used to generate index data build_index

  • Part is used to query index data search_index

(3) Rank scoring system;

The core data is divided into two parts (purple) :

(1) Web page library;

(2) Index data;

The business characteristic of whole net search engine decides, this is a “write” and “search” separate system.

How is writing implemented?

System composition: completed by spider and Search&index two systems.

Input: Webmasters generate Internet pages.

Output: Forward and inverted index data.

Flow: as in 1,2,3,4 in the architecture diagram:

(1) Spiders grab Internet pages;

(2) Spiders store Internet pages in a web library (this requires a high level of storage, storing a mirror image of almost the entire World Wide Web);

(3) build_index reads data from the web library and completes word segmentation;

(4) build_index generates inverted index;

How is retrieval implemented?

System composition: by search&Index and rank two systems to complete.

Enter: the user’s search term.

Output: Sorted first page search results.

Flow: such as A, B, C, D in the architecture diagram:

(a) Search_index obtains the user’s search term and completes the word segmentation;

(b) Search_index query inverts the index to obtain a “character match” page, which is the result of the initial screening;

(c) Rank scores the results of preliminary screening;

(d) rank returns the results of the first page after sorting;

Site **** search engine architecture and process?

After all, there are only a few companies that do the whole network search. What most companies want to achieve is actually a site search. Taking the search of 58.com’s 10 billion posts as an example, its overall structure is as follows:

The macro architecture of the site search engine is shown in the figure above. Compared with the macro architecture of the whole network search engine, the only difference is written:

(1) The whole web search needs spiders to passively grab data;

(2) The site search is the data generated by the internal system, for example, the “publishing system” will actively push the generated posts to the build_data system;

Voiceover: The difference seems to be “small”, but the difficulty in the implementation of the architecture is much different. It is very difficult for the whole web search to find the “full” web page in “real time”, while the site search is easy to get all the data in real time.

For spider, search&index, and rank:

(1) Spider and Search&Index are relatively engineering systems;

(2) Rank is a system closely related to business and strategy and algorithm. The difference in search experience mainly lies in this, while the optimization of business and strategy takes time to accumulate. The enlightenment here is as follows:

  • Google’s experience is better than Baidu’s because the former rank is superior

  • It is very difficult for domestic Internet companies (such as 360) to build a search engine whose experience surpasses Baidu in a short time. It really needs time to accumulate

The previous content is too macro, in order to take care of the majority of students who have not done search engine, data structure and algorithm part from forward index, inverted index bit by bit to start.

What is a forward index?

In short, the process of querying entities by key, using a straight index.

For example, the user table:

t_user(uid, name, passwd, age, sex)

The process of querying an entire row by UID is the same as indexing queries.

Another example is web library:

t_web_page(url, page_content)

The process of querying the entire web page by URL is also a straight index query.

After the word segmentation of the web page content, page_content corresponds to a word segmentation collection list.

A simple, straight index can be interpreted as:

Map<url, list>

A data structure that allows content to be found quickly by a web page URL.

Voice-over: The time complexity can be thought of as O(1).

What is an inverted index?

The process of querying a key by item, as opposed to a straight index, uses an inverted index.

For web searches, inverted indexes can be interpreted as:

Map<item, list>

The query word can be used to quickly find the data structure of the page containing the query word.

Voice-over: Time complexity is also O(1).

For example, suppose there are three web pages:

Url1 -> “I Love Beijing”

Url2 -> “I love home”

Url3 -> “Home beautiful”

Here is a forward index:

Map < url, page_content >.

After participle:

Url1 -> {I, ai, Beijing}

Url2 -> {I, love, home}

Url3 -> {home, beautiful}

Here is a straight index after a participle:

Map < url list >.

Inverted index after participle:

I – > {url1, url2}

Love – > {url1, url2}

Beijing – > {url1}

Home -> {url2, url3}

Good – > {url3}

Map

is the inverted index.
,>

Voice-over: See, the process from word to URL is the inverted index.

The forward index and inverted index are the data structures built in advance by spider and Build_index system. Why to use these two data structures is that they can quickly achieve the requirements of “user web page retrieval”.

Voice-over, business requirements dictate architecture implementation, and queries are quick.

What is the retrieval process like?

Suppose the search term is “I love” :

(1) participle, “I love” will be participle {I, love}, time complexity is O(1);

(2) Query the list of web pages containing the item after each participle from the inverted index, and the time complexity is also O(1) :

I – > {url1, url2}

Love – > {url1, url2}

{urL1, urL2} is the final query result;

Voice-over: The retrieval process is also simple: word segmentation, inverted index lookup, result set intersection.

Is it over? In fact, the time complexity of word segmentation and inversion query is O(1). The time complexity of the whole search depends on “finding the intersection of list”, and the problem is transformed into finding the intersection of two sets.

Character urls are not conducive to storage and calculation. Generally, each URL is identified by a numeric url_id. For convenience, list is replaced by list<url_id>.

List1 and list2, how do I find the intersection?

Solution 1: for * for, soil method, time complexity O(n*n)

The number of hits per search term is large, and O(n* N) complexity is clearly unacceptable. The inverted index can be pre-sorted at the beginning of creation. The problem can be converted into the intersection of two ordered lists, which is more convenient.

Voice-over: A rather stupid method.

Scheme 2: Order list intersection, zipper method

Ordered set 1{1,3,5,7,8,9}

Ordered set 2{2,3,4,5,6,7}

Two Pointers to the first element compare the size of the element:

(1) If they are the same, put them into the result set and move a pointer at will;

(2) Otherwise, move a pointer with a smaller value until the end of the queue;

The benefits of this approach are:

(1) Elements in the set are compared at most once, and the time complexity is O(n);

(2) Multiple ordered sets can be performed at the same time, which applies to the intersection of item and URl_id of multiple participles;

This method is like a zipper on both sides of the gear, one by one is like a zipper, so called zipper method;

Voice-over: Inverted indexes are pre-initialized to take advantage of the “ordered” feature.

Scheme 3: parallel optimization by buckets

If list1< URl_id > and List2 < URl_id > can be divided into several bucket intervals, and each interval uses multiple threads to calculate the intersection in parallel, and the union of each thread result set as the final result set. Can greatly reduce the execution time.

For example:

Ordered set 1{1,3,5,7,8,9, 10,30,50,70,80,90}

Ordered set 2{2,3,4,5,6,7, 20,30,40,50,60,70}

To find the intersection, perform bucket splitting first:

The range of bucket 1 is [1, 9]

Bucket 2 has a range of [10, 100]

Bucket 3 has a range of [101, max_int]

So:

Set 1 is split up

Set a,3,5,7,8,9 {1}

Set b,30,50,70,80,90 {10}

A collection of c {}

Set 2 splits into two

Set d,3,4,5,6,7 {2}

Set 20,30,40,50,60,70} {e

A collection of e {}

The amount of data in each bucket is greatly reduced, and there are no repeated elements in each bucket. Multithreading can be used for parallel computation:

The intersection of set a and set d in bucket 1 is x{3,5,7}

The intersection of set B and set E in bucket 2 is y{30, 50, 70}

The intersection of sets C and D in bucket 3 is z{}

Finally, the intersection of sets 1 and 2 is the union of x and y and z, i.e. {3,5,7,30,50,70}.

Voiceover: Multithreading and horizontal segmentation are common optimization methods.

Scheme 4: Bitmap optimization again

After the data is split horizontally into buckets, the data in each bucket must be within a range. If the set conforms to this feature, bitmap can be used to represent the set:

As shown in the figure above, assuming that all elements of set1{1,3,5,7,8,9} and set2{2,3,4,5,6,7} are within the range of the bucket value [1, 16], we can use 16 bits to describe these two sets. The element x in the original set has the x bit of 1 in the 16bitmap. At this point, the intersection of the two bitmaps is only needed to perform the “and” operation. The 3,5 and 7 bits of the result set bitmap are 1, indicating that the intersection of the original set is {3,5,7}.

Horizontal bucket division and bitmap optimization can greatly improve the efficiency of intersection finding, but the time complexity is still O(n). Bitmap requires a large amount of continuous space and occupies large memory.

Voiceover: Bitmaps can represent sets, and it is very fast to find the intersection of sets.

Plan 5: Skiplist

As the most common data structure, skip table can reduce the complexity of intersection of ordered set from O(n) to nearly O(log(n)).

Set 1,2,3,4,20,21,22,23,50,60,70 {1}

Set 2 50; seven} {

1,2,3,4,20,21,22,23 will be traversed invalidly, each element will be compared, the time complexity is O(n), can “skip some elements” each time comparison?

The jumper appears:

Set 1,2,3,4,20,21,22,23,50,60,70 {1} to establish a jump table, level 1 only,20,50 {1} three elements, the secondary is the same as the normal list.

For set 2{50,70}, only the first level common linked list is established due to the small number of elements.

In this way, in the process of implementing “zipper” intersection calculation, the pointer of SET1 can jump from 1 to 20 and then to 50, and many elements can be skipped in the middle without one-to-one comparison. The time complexity of the intersection calculation of hoppers is approximately O(log(n)), which is a common algorithm in search engines.

A quick summary:

(1) The whole web search engine system consists of spider, Search&Index and Rank subsystems;

(2) The difference between the site search engine and the whole web search engine lies in the lack of a spider subsystem;

(3) Spider and Search&Index system are two engineering systems, but the optimization of Rank system needs a long time of tuning and accumulation;

(4) Forward index is the process of quickly finding the list of webpage content after word segmentation by webpage URl_id;

(5) Inverted index is a process of rapidly finding the web page list< URl_id > containing inverted index by item;

(6) The user retrieval process is the process of word segmentation, then finding the list< URl_id > corresponding to each item, and finally finding the set intersection;

(7) The intersection of ordered sets can be obtained by:

  • Double For Loop method, time complexity O(n*n)

  • Zipper method, time complexity O(n)

  • Horizontal buckets, multithreading parallel

  • Bitmap, greatly improve the parallelism of operation, time complexity O(n)

  • Skip table, time order log of n.

Voice-over: That should be enough for the interview.

Most engineers may not be familiar with the “search kernel”, but Internet business, basically involves the “search” function. Or take the post business scenario of 58.com as an example. The title and content of a post have a strong user search demand. In each stage of increasing business, traffic and concurrency, how to realize the search demand?

Primitive stage -LIKE

In the startup phase, this is often done quickly.

Data might be stored in a database like this:

t_tiezi(tid, title, content)

To meet the requirements of title and content retrieval, LIKE can be implemented:

Select tid from t_tiezi where content like ‘% tiezi %’

This is a quick way to meet business needs, and the problems are obvious:

(1) Low efficiency, each need to scan the full table, large amount of calculation, high concurrency CPU easy 100%;

(2) Do not support participle;

Primary Stage – Full text index

How to quickly improve efficiency, support word segmentation, and the impact on the original system architecture as little as possible, the first thought is to establish a full-text index:

alter table t_tiezi add fulltext(title,content)

Use match and Against to implement query requirements on index fields.

Full-text indexes can quickly meet business word segmentation requirements and quickly improve performance (word inversion, at least not full table scan), but there are some problems:

(1) Only applicable to MyISAM;

(2) Because full-text index makes use of the characteristics of database, search demand and common CURD demand are coupled in the database: when the retrieval demand is concurrent, CURD requests may be affected; When CURD concurrency is large, retrieval will be very slow;

(3) When the amount of data reaches the level of millions, the performance will be significantly reduced, and the query return time is very long, which is difficult for business to accept;

(4) It is difficult to extend horizontally;

Intermediate – Open source external indexes

In order to overcome the limitations of full-text indexing, external indexes should be considered when the data volume increases to millions or tens of millions. The core idea of external index is to separate index data from original data, with the former meeting search requirements and the latter meeting CURD requirements. Data consistency is guaranteed through certain mechanisms (double-write, notification and periodic reconstruction).

Raw data can continue to be stored in Mysql. How to implement external indexes? Solr, Lucene, and ES are common open source solutions. ES (ElasticSearch) is by far the most popular.

While Lucene is good, the potential drawbacks are:

(1) Lucene is only a library, which needs to do its own services to achieve complex features such as high availability/scalability/load balancing;

(2) Lucene only supports Java. If you want to support other languages, you have to do your own services.

(3) Lucene is not friendly, which is deadly and very complex, and users often need to have a deep understanding of search to understand how it works. In order to shield its complexity, they have to do their own services;

To address Lucene’s shortcomings, the solution was to “encapsulate an interface friendly service and shield the underlying complexity”, hence ES:

(1) ES is a service with Lucene as the kernel to realize search function and provide REStful interface;

(2) ES can support large amount of information storage and high concurrent search requests;

(3) ES supports cluster, shielding users from complex features such as high availability/scalability/load balancing;

At present, Kuaigou Taxi uses ES as the core search service to realize various search needs in the business, among which:

(1) The demand for “interface time-consuming data collection” with the largest amount of data is about 1 billion;

(2) The demand of “latitude, longitude and geographical location search” with the largest concurrent volume, the average concurrent volume online is about 2000, and the concurrent volume of pressure survey data is about 8000;

Therefore, ES can fully meet the common search business requirements of 1 billion data volume and 5K throughput.

Advanced stage – develop your own search engine

When the data volume further increases, to 1 billion, 10 billion data volume; Concurrency is also further increased to 100,000 throughput per second; Business personality also gradually increase, it is necessary to research the search engine, customized search kernel.

In the stage of customized self-research search engine, the design focus is on the large amount of data and high concurrency. In order to meet the requirements of “unlimited capacity and unlimited concurrency”, the architecture design needs to focus on the “expansibility”, and strive to achieve: the increase of machines can be expanded (data volume + concurrency).

The preliminary architecture diagram of the self-developed search engine e-Search of 58.com is as follows:

(1) Upper-layer proxy (pink) is an access cluster, which is an external portal and accepts search requests. Its stateless performance can ensure that the performance of proxy cluster can be expanded by adding machines;

(2) Middle merger (light blue) is a logical cluster, mainly used to achieve search merger, and scoring ranking, business related rank is realized in this layer, its statelessness can also ensure that the increase of machines can expand the merger cluster performance;

(3) The underlying searcher (dark red box) is a retrieval cluster, the service and index data deployed on the same machine, when the service can load index data to memory, request access from memory load data, access speed is very fast:

  • In order to scale the data capacity, the index data is shard horizontally. Increasing the number of shards allows the performance to expand indefinitely, as shown in the figure above

  • In order to meet the performance scalability of a piece of data, the same piece of data is redundant. In theory, increasing the machine can expand performance indefinitely, as shown in the figure above, each group of searcher has two redundant copies

By doing so, you can actually increase the amount of data that the machine can carry and respond to a higher amount of concurrency.

A quick summary:

To meet the needs of the search business, the search architecture generally goes through the following phases as the volume of data and concurrency increases:

(1) Primitive stage-like;

(2) Primary stage – full text index;

(3) Intermediate stage – open source external index;

(4) Advanced stage – research search engine;

One last advanced topic, on the real-time nature of search:

Why can Baidu retrieve new news published 15 minutes ago in real time? How can 58.com retrieve posts posted 1 second ago in real time?

What are the key points of real-time search engine system architecture?

In order to ensure real-time performance of search engines with large amount of data and high concurrency, there are two key points in architectural design:

(1) Index classification;

(2) Dump&merge;

First of all, in the case of a very large amount of data, in order to ensure the efficient retrieval efficiency of inverted index, any data update does not modify the index in real time.

Voice-over: Because once fragments are generated, retrieval efficiency will be greatly reduced.

Since index data cannot be modified in real time, how do you ensure that the latest web pages can be indexed?

Index levels are divided into full database, daily incremental database and hourly incremental database.

As described above:

(1) 30 billion data in the full index database;

(2) 10 million modified data within one day are stored in the sky database;

(3) 500,000 modified data within 1 hour are stored in the hourly database;

When a change request occurs, only the lowest level index, such as the hour library, is operated on.

When a query request occurs, the indexes at all levels are queried simultaneously, and the results are combined to obtain the latest data:

(1) The full database is a tightly stored index with no fragmentation and high speed;

(2) Day storage is close storage, fast speed;

(3) The amount of data in the hourly database is small and the speed is fast;

Hierarchical indexes can ensure real-time performance. Then, a new problem arises: when will the data in the hourly database be reflected in the daily database, and when will the data in the daily database be reflected in the full database?

Dump&merge, the export and merge of indexes, is done by two asynchronous tools:

Dumper: Exports online data.

Merger: Merge offline data into a higher-level index.

The hour library, once an hour, merges into the day library;

The sky library, once a day, is incorporated into the full library;

In this way, the amount of data in the hour database and the day database will not be particularly large;

If the amount of data and concurrency is larger, it can also increase the weekly library, monthly library to buffer.

A quick summary:

Two key points of real-time search engine architecture:

(1) Index classification;

(2) Dump&merge;

Have you learned new skills about “search” and “retrieval”?

The Architect’s Path – Share technical ideas

Recommended reading:

I’m not Smart enough, but I Just Won’t Take it

Connection Pooling is Surprisingly Simple

Inseparable microservice architecture, inseparable RPC details (worth collecting)!!

Have you been asked in the interview?

This article uses the article synchronization assistant to synchronize