This article was first published on my blog: Hexo Personal blog SEO Optimization (1) : Introduction to search engine principles

Hexo’s personal blog SEO Optimization (1) : Introduction to search Engine principles

Hexo personal blog SEO Optimization (2) : Site optimization

Hexo personal blog SEO Optimization (3) : Transform your blog to improve the search engine ranking of your articles: RECENTLY, I received an assignment to optimize the official website of an enterprise. In order to complete it, can only catch the duck on the shelf, from scratch to systematically learn SEO knowledge. After a few days of study, but also into the door. Just realized the significance and importance of SEO, feel SEO for personal site (blog is also personal site) significance and importance is self-evident. Some people may argue that in the era of the popularity of we-media, wechat public number and Zhihu, Nuggets, SF and other vertical websites, it is necessary to study SEO and do SEO for blogs. My opinion is that if you want to maintain your blog over the long term, the benefits of learning the necessary SEO knowledge are a long half-life thing, and it’s something you’ll get twice the result with half the effort.

You don’t need to spend a lot of effort optimizing your blog on your site, and then keep your posts updated on a regular basis (or optimize them off-site if you need to). You can improve your search engine rankings and get more traffic to your blog. Whether you want to raise awareness or help your article reach as many people as possible, getting more traffic will get you there.

The following content is a total SEO beginner learning summary, if SEO god see, please spray, but also hope to point out the shortcomings.

Search engine

Some people may not know what SEO is. SEO refers to Search Engine Optimization. Through the website optimization, to improve the website in the search engine ranking, for the website to bring more visits. Before introducing SEO optimization elements and skills, we need to have an understanding of how search engines work. To better understand the meaning of SEO specific operations.

Basic Principles

Search engine process is very complex, we can only here with the most simple process to introduce the search engine is how to achieve page ranking. The working process of search engine can be divided into three stages:

  1. Crawling and scraping: The search engine accesses a web page by following links, retrieving the HTML code of the page and storing it in a database.
  2. Preprocessing: index program to crawl the page data for text extraction, Chinese word segmentation, index and other processing, in preparation for the ranking program call.
  3. Ranking: After a user enters a keyword, the ranking program calls the index database, calculates relevancy, and then generates a page of search results in one format.

Step 1: Crawl and grab

Crawl and crawl is the first step of search engine work, complete the task of data collection.

spider

The programs that search engines use to crawl and crawl pages are known as spiders, also known as bots. A spider is similar to a browser that ordinary users use to access web pages. After a spider makes a page access request, the server returns the HTML code, and the Spider stores the received program into the original page database. To improve crawling and scraping speed, search engines often crawl in parallel with more than one spider.

When spiders visit any web site, they first visit the rotrobot.txt file at the root of the site. This file tells the spider which files or directories can or cannot be fetched. Like different browser uAs, spiders from different vendors carry the name of a specific agent.

Tracking link

To crawl as many pages on the web as possible, spiders follow links on a web page and crawl from one page to the next, just like spiders crawl on a spider web. There are two strategies for crawling traversal:

  • Depth-first: A spider crawls along a page link until there are no other links on the page
  • Breadth-first: a spider finds multiple links on a page, traverses all the first level of links on that page, and then traverses the second level

The programmer must be familiar with these two traversal algorithms. Given enough time, a Spider should theoretically be able to crawl every link on the Web, whether it’s depth-first or breadth-first. But the actual situation is not so, because of the limitation of various resources, search engines are only crawling and collecting a part of the Internet.

So typically spiders are a depth-first and breadth-first mix.

Attract a spider

Through the above introduction can know, spider can not be all pages are included, so SEO is to through a variety of means, attract spider crawl included more pages of their website. Since not all pages can be captured, spiders must try to capture important pages. So how does spider decide which pages are important? There are several contributing factors:

  • Site and page weights. High quality, qualified old site weight is high.
  • Page update. Update frequency of high site weight.
  • Links to imports. To be captured by a spider, an import link must enter the page, whether external or internal. High-quality import links also often increase the crawling depth of export links on a page.
  • Click distance from home page. Generally speaking, the highest weight of the site is the home page, most of the chain is pointing to the home page, spider visit the most frequent is also the home page. So the closer the page from the home page, the weight is relatively higher, the greater the opportunity to be spider crawling.
Address the library

To avoid repeated crawling and crawling, search engines build an address library that keeps track of pages that have been found not yet crawled, as well as pages that have been crawled. There are several sources through the address library:

  • Manual entry of the seed site
  • The spider crawls the page, parses the url, and compares it with the address library. If it does not exist, it is saved
  • A webmaster submits a url to a search engine page (this is common for individual blogs and websites)
File storage

The spider captures data stored in the original page database. The page data is exactly the same as the HTML received by the user’s browser.

Detection of copied content while crawling

Detecting and deleting replicated content is handled in the preprocessing step. However, spiders also perform a degree of copy detection when crawling. With a low weight and a lot of copycat content, spiders may not continue to crawl. This is why a website needs original content.

Step 2: Pretreatment

Preprocessing is sometimes called indexing. Because indexing is the most important step in preprocessing. Preprocessing involves several steps:

  1. Extract text. The search engine extracts the visible text of the page as well as special code containing text information, such as Meta tags, image Alt attributes, and hyperlink anchor text.
  2. Participle, different languages have different participles such as Chinese participle, English participle. The text extracted in the first step is segmented. Different search engines use different word segmentation algorithms, there will be differences.
  3. Go to stop words. Whether in Chinese or English, there are some words with high frequency that have no effect on the content. Such as modal words, interjections, prepositions, adverbs and so on. Search engines remove these words before indexing pages.
  4. Eliminate noise. The vast majority of pages also have a portion of content that does not contribute to the page theme. Such as copyright notice text, navigation bar and so on. Take blogs as an example. Almost every page has article categories, history navigation and other information that is irrelevant to the topic of the page. These are noises that need to be eliminated.
  5. duplicate removal The same article often appears repeatedly on different websites and different urls of the same website. Search engines do not like this kind of repeated content, so search engines will de-weight this part of the content.
  6. Forward index. Also called index. After the front text extraction, word segmentation, noise elimination, heavy, the search engine is unique, can reflect the main content of the page, with the word as the unit of the content. Next search engine indexing procedures can extract keywords, according to the word segmentation procedures divided into good words, the page into a set of keywords, while recording each keyword on the page frequency, frequency of occurrence, format, location. The structure of these pages and keywords is then stored in the index library.
  7. Invert the index. Forward indexing cannot be used directly for keyword ranking. The search engine also needs to reconstruct the forward index database into an inverted index and convert the file to keyword mapping to keyword to file mapping. In this way, when searching for keywords, the sorting program will locate the keyword in the inverted index, and you can immediately find all the files for this keyword.
  8. Link relation calculation. Link relation calculation is also an important part of preprocessing. After the search engine crawls the page content, it must calculate beforehand: what links on the page point to what other pages, what import links each page has, what anchor text the links use. These complex link pointing relationships form the link weight of websites and pages. Google PR is one of the most important manifestations of this link relationship. Other search engines use similar techniques, although they are not called PR.
  9. Special file calculation. In addition to HTML files, search engines can crawl, crawl and index a variety of text-based file types, such as PDF, Word, PPT, TXT, etc.

Step 3: Rank

To this step is to process user input, and then according to the user input keywords, ranking program call index program, calculate the ranking, display to the user. The process is also divided into the following steps:

Search term processing

The user input keywords for word segmentation, to stop words, instruction processing and other processing.

File matching

Find all files that match the keyword by keyword.

Initial subset selection

Due to the large number of files generated during the file matching phase, it is impossible to display all files. So you need to calculate a subset based on page weights.

Correlation calculation

After selecting the subset, you need to introduce the relevance of the pages in the subset. Calculating correlation is the most important step in the ranking process. The main factors affecting correlation are:

  1. Degree of keyword usage. The more frequently used words contribute less to the meaning of the search term. The less frequently used, the greater.
  2. Word frequency and density. In the case of no keyword accumulation, the more times the keywords appear on the page, the higher the density, the greater the correlation. But the importance of that factor is getting lower and lower.
  3. The position and form of keywords. As mentioned in the previous index, the page’s title tag, bold, and H1 are important positions.
  4. Keyword distance. After segmentation, the keywords match completely, indicating that they are most relevant to search matching. Let’s say you searchHexo blog SEOIf it appears consecutively and completely on the pageHexo blog SEO, indicating the highest correlation.
  5. Link analysis and page weight. In addition to the factors of the page itself, links and weight relationships between pages also affect the relevance of keywords, the most important of which is the anchor text. The more import links a page has with search terms as anchor text, the more relevant the page is.
Ranking filtering and adjustment

After the above steps, you have a general ranking. After that, the search engine may have some filtering algorithms that tweak the ranking slightly, but the most important filtering is to impose penalties. Some cheat pages will be weighted down.

rankings

After all rankings are determined, the ranking program will call the title, Description Meta and other information of the original page to display on the page.

Search cache

User searches are largely made up of repetition. So some of the searches are cached.

Query and click log

The search engine records the user’s IP address, search keywords, search time, and click on which result page, and forms the search statistics log. The logs help search engines judge the quality of search results, adjust search algorithms, and predict search trends.

By following the three steps above, you can gain a deeper understanding of how search engines work. This is to write the next station optimization content and blog optimization practice can better understand.