Develop a whole web search engine from scratch

First, a sneak peek

Search engine project address: Dolphin search www.haiteem.com

Second, the preface

Objectively speaking, a standard whole web search engine is very complex, to deal with the amount of data billions, billions and even higher, of course, these data need enough hardware support, individuals only need to understand the technical process logic, and achieve large data (ten million and above) search can.

Let’s start with the basic components of a search engine:

Third, the crawler

Web Crawler is a program or script that automatically crawls the information of the World Wide Web according to certain rules. It is widely used in Internet search engines or other similar websites. It can automatically collect all the page contents it can access to, so as to obtain or update the content and retrieval methods of these websites. In terms of functions, crawlers are generally divided into three parts: data collection, processing and storage.

The crawler strategy

Generally speaking, a crawler needs to develop an appropriate strategy because the content of the Internet is too large. Generally divided into depth first principle and breadth first principle. Depth first is to climb the home page of a website, then continue to climb the inside pages of the home page, and so on….. Then continue to climb to the second site. This way has a very high requirement for the performance of the website, if the crawl frequency is too high, for the general common website, will produce a lot of pressure, and even make the website can not operate normally. Breadth first principle is, first on a website set for a unified home page crawl, and then on the generated chain for a second unified crawl, and so on… The advantage of this is to disperse the flow of pressure, more friendly to the site.

In the process of a large number of web crawls, if you use multithreading. It will be inefficient, so use multithreaded crawling whenever possible, as the network time overhead is greatest during data crawling.

robots.txt

In the current root directory of the website, it has been established that there is a robots.txt, which has the website administrator according to their own needs to write the crawling rules of the web crawler, that is, which pages can climb, which pages are not allowed to climb.

Web page type

Web pages are divided into static Web pages and dynamic Web pages, static Web page data once completed with the page load, crawl up relatively easy, but the dynamic data is completed after the page load, through JS once again request the server to obtain the data, displayed on the page. Ordinary fetching requests cannot get these dynamic data, so for these dynamic data, need another fetching plug-in.

Data to heavy

When the web crawl to a certain number of pages, the number of rapid growth, will inevitably produce a lot of repeated links and content. This is where you have to think about redoing the problem. For the search engine’s massive data to heavy, in general, will consider the fact to heavy, which will greatly reduce the capture of repeated content, than unified crawl unified to heavy more suitable. Fact duplication, need to build a link library, for each link judgment, judgment exists, then skip, do not exist, continue to crawl. After a climb to the end of the task, again for a heavy, and then into the warehouse.

Content similarity algorithm

1. Cosine similarity

Cosine similarity measures the similarity between two vectors by measuring the cosine of the Angle between them. The cosine of a 0 degree Angle is 1, and the cosine of any other Angle is not greater than 1; And its minimum value is negative 1. So the cosine of the Angle between the two vectors determines whether they’re pointing roughly in the same direction. When two vectors have the same direction, the value of cosine similarity is 1; When the Angle between two vectors is 90°, the cosine similarity value is 0; When two vectors point in opposite directions, the value of cosine similarity is negative 1. It doesn’t depend on the length of the vector, it just depends on the direction the vector is pointing in.

N dimension vectors (A1,A2,A3,A4,A5…) , (B1,B2,B3,B4,B5…)

2. SimHash algorithm

Steps: word segmentation, hash, weight, merge, dimension reduction

SimHash can calculate the similarity between texts. We can calculate the simHash value of the document through the simHash algorithm, and judge the similarity by comparing the hamming distance between the simHash values of each text. The smaller the hamming distance, the greater the similarity. General large text to double, size <=3 can be judged as repeated.

1.1. Partial Words:

You can choose the appropriate word segmentation database for word segmentation. Such as “Welcome to China” -> (after participle) “Welcome”, “to”, “China”

1.2, hash:

For each word, the hash value is calculated. The hash value is an N-bit signature consisting of the binary number 01. Set “Welcome” (100101), “Come” (101011), “China” (101011).

1.3 Weighting:

For a given text, the weight is the number of corresponding words after word segmentation. Weighted all the feature vectors, i.e., W = Hash * weight; Here we assume that the weights of the three words are 4, 5, and 9; According to the calculation rules, if 1 is encountered, the hash value and weight are positively multiplied; if 0 is encountered, the hash value and weight are negatively multiplied. For example, the “welcome” hash value “100101” is weighted to get: W(Welcome)= 1001014 = 4-4-4 4-4 4, weighting the hash value “101011” of “come” to get: W(Come)=1010115 = 5-5 5-5 5 5, the rest is calculated according to this rule

1.4, merging,

The weighted results of the above feature vectors are added up to a single sequence string. Take the first two eigenvectors, for example “4-4-4 4-4” of “Welcome” and “5-5 5-5 5 5” of “Come”, and add them together to get “4+ 5-4 + -5-4 +5 4+ -5-4 +5 4+5 4+5”, and get “9-9 1-1 1”.

1.5, dimension reduction

For the summation result of n-bit signature, if greater than 0, set 1, otherwise set 0, so as to obtain the simhash value of the statement. Finally, we can judge their similarity according to the Hemming distance of the simhash of different statements. For example, the dimension-reduction of “9-9 1-1 1 9” calculated above (if someone is greater than 0, it is 1, if someone is less than 0, it is 0), and the 01 string obtained is “1 01 01 1”, thus forming their simhash signature.

Hamming distance: In information coding, the number of bits corresponding to two legal codes with different codes is called code distance, also known as Hamming distance. The example is as follows: 10101 and 00110 start from the first place, the first, the fourth and the fifth place are different, then the Hamming distance is 3.

3. Other methods:

* Manhattan distance; Euclidean distance; Jackard similarity coefficient; , etc… *

Content filtering

A considerable part of Internet content is useless. It is necessary to de-noise the captured content, such as all kinds of labels, illegal content, advertisements and so on. After processing, the structured data is stored in the database.

Fourth, build an index

participles

The premise of index building is word segmentation. For English, word segmentation is very easy, English is separated by Spaces, but for Chinese, there is no separators, which requires a special Chinese word segmentation plug-in.

The logic of Chinese word segmentation is generally as follows:

Prepare a Chinese dictionary, which contains most of the words and short sentences. For search engines, all possible words in a sentence must be extracted to improve the recall rate. For example:

The sentence	participles
You look beautiful today	You, today, innocent, beautiful

Therefore, it is necessary to traverse and intercept the sentence, compare the result with the Chinese dictionary, and save the word if it is a word.

The logic of precise participles is as follows:

Particize a sentence as accurately as possible. For example, I believe tomorrow will be better. The exact participle would be:

The sentence	participles
I believe tomorrow will be better	I, believe, tomorrow, will be better

Now there are a lot of ready-made word segmentation plugins in various languages (PHP, Java, Python, etc.) on the Internet, you can use them directly. If you have a strong interest in this, you can also explore and redevelop a word segmentation plug-in of your own.

Inverted index

An index is a pre-created storage structure based on the content of the target information in order to speed up the process of finding information. For example, a book without a table of contents is theoretically readable, but when you close what you are currently reading, it takes more time to open the book to find it the next time. If we add a few pages of the table of contents, we can quickly see the general content distribution of the book and the location of the pages in each chapter, so that we can search the content more efficiently. The table of contents of a book is a simple index of its contents.

Inverted index is a kind of index technology, which is based on the key attribute value of the information subject.

For example, there are two documents:

id	Document title	participles
1	The search engine crawler crawls the website strategy	Search engines, crawlers, crawlers, websites, policies
2	The workflow of the search engine crawler	Search engine, crawler, work, process

The (simple) inverted index structure is:

word	Document id
Search engine	1, 2,
The crawler	1, 2,
crawl	1
Web site	1
strategy	1
work	2
the	2
process	2

If the user searches for “crawler,” then the document ID set 1,2 that contains “crawler” is returned directly

This is the simplest form of inverted lists. The more complex ones can carry information about a word’s frequency, location, and more in the document.

Index update

Index updates typically include full and incremental indexes. The meaning of full index is: every time all the content is updated once the index, the consumption time is relatively long; Incremental index is: every time the newly added content is updated, and then it is merged with the old index, travel the new index, incremental index has the advantage of consuming less time.

Five, the search

With the gradual maturity of search technology, the search interface has a relatively fixed pattern. In addition to basic search, we will probably cover the following areas.

Automatic search completion will be provided at any time when the user enters the query in the search box. For Chinese, it also prompts the user when they type pinyin. When the search box is entered, real-time feedback completion is needed, so it requires high requirements on the server and data structure, and the delay is at least within 0.2 seconds.

Relevant search terms: When users are not satisfied with the current search results, they may be able to get more useful information by changing the search terms. Multiple related prompts are usually given based on the user’s current search term. It can be seen as a specific application of collaborative filtering in search terms.

Search term preprocessing

Sometimes the search term or sentence entered by the user is complete, but sometimes the search term is incomplete or even contains errors, which is when the error correction function is needed. It also analyzes the user’s search intent and automatically identifies the unindicated content to provide the best results. This section deals with natural language processing.

Search sorting (very important)

For the search engine, the most important and key is the results of the sorting, the most consistent with the requirements of the content in the front. For the sorting algorithm, you need to explore, generally speaking, a document to multi-dimensional rating, as well as a word for the importance of a document rating, then sorting. Of course, this is only a general logic, after all, this belongs to a search engine company’s core technology ~

Google’s MapReduce is well documented on the Internet, and you can look at the literature.

Search Advanced Features

Of course, sometimes there are specific needs for user searches, such as specific site content searches, exclusion searches, searches for content at a specific time, and so on.

It is not necessary to repeat a search. The search results can be cached using various caching tools (Redis, MG), etc. The design of the cache should avoid the occurrence of cache avalanche.

Example search engine personal item address:

Follow the procedures above to complete the project Dolphin Search www.haiteem.com

Develop a whole web search engine from scratch

First, a sneak peek

Second, the preface

Third, the crawler

Fourth, build an index

Five, the search

Related Posts

Similar strings in search engines look for those things

Middleware: ElasticSearch ResHighLevelClient

APK Chinese Handbook & Cash Flow Game released