Brief Introduction to the working principle of Google spider crawl (to be updated)

Talking about the working principle of Google spider crawl

- What is a crawler?
- How does the spider work?
- How does a spider view a page?
- - Mobile and desktop rendering
- HTML and JavaScript rendering
- What affects crawler behavior?
- - Internal links and backlinks
  - Click the depth
- Sitemap
- The index shows that
- Are all pages crawlable?
- When will my site show up in a search?
- Repetition problem
- Url structure problem
- conclusion

First, the Google spider looks for new pages. Google then indexes these pages to understand their content and ranks them based on the retrieved data. Crawling and indexing are two different processes, but they are both performed by spiders.

What is a crawler?

Crawlers (also known as search robots, spiders) are software that Google and other search engines use to scan web pages. Simply put, it “crawls” from page to page looking for content that Google hasn’t added or modified to its database yet.

Any search engine has its own spider. As for Google, there are more than 15 different types of crawlers, and Google’s main crawler is called Googlebot. Googlebot crawls and indexes at the same time, and we’ll take a closer look at how it works.

How does the spider work?

There is no central registry of urls, which are updated every time a new page is created. This means Google won’t automatically “alert” them, but they will have to find them online. Googlebot is constantly roaming the Internet, searching for new pages and adding them to Google’s database of existing pages.

Once Googlebot finds a new page, it renders (visualizes) the page in the browser, loading all HTML, third-party code, JavaScript, and CSS. This information is stored in the search engine’s database, which is then used to index and rank pages. If a page has been indexed, it is added to The Google Index – a super-giant Google database.

How does a spider view a page?

The spider renders a page in the latest version of Google Browser. In a perfect scenario, the crawler “presents” the page in the same way that you design and assemble the page. In real situations, things could be more complicated.

Mobile and desktop rendering

Googlebot can “see” your pages with two subtypes of crawlers: desktop Googlebot and smartphone Googlebot. This department is required for desktop and mobile SERP index pages.

A few years ago, Google used a desktop spider to access and render most pages. But that changed with the introduction of mobile first. Google decided that the world was becoming mobile-friendly enough and began using Googlebot, a smartphone, to crawl, index and rank mobile versions of mobile and desktop SERP sites.

However, implementing mobile pre-emptive indexing results proved more difficult than expected. The Internet is huge, and most sites seem poorly optimized for mobile devices. This allows Google to use the mobile-first concept to crawl and index new sites and those old sites, becoming fully optimized for mobile. If a site is inconvenient to move, it is crawled and rendered first hand by desktop Googlebot.

Even if your site has been converted to mobile-first indexing, you will still have some pages crawled by the Googlebot desktop because Google wants to check how your site is performing on the desktop. Google doesn’t say directly that it will index your desktop version if it differs significantly from the mobile version. However, it is logical to assume this, since Google’s main goal is to provide users with the most useful information. Google hardly wants to lose this information by blindly following the mobile-first concept.

Note: In any case, your site will be accessed by mobile Googlebot and desktop Googlebot. So it’s important to take care of both versions of your site and consider using a responsive layout if you haven’t already done so.

How do I know if Google crawls and indexes your site with mobile first in mind? You will receive a special notification in the Google Search console.

HTML and JavaScript rendering

Googlebot may have some problems handling and rendering cumbersome code. If your page code is cluttered, the crawler may not render it correctly and consider your page empty.

As for JavaScript rendering, you should keep in mind that JavaScript is a rapidly evolving language and Googlebot may not support the latest version at times. Make sure your JS is compatible with Googlebot, or your page may render errors.

Pay attention to your JS load time. If the script takes more than 5 seconds to load, Googlebot will not render and index the content generated by the script.

Note: If your site is filled with a lot of JS elements and you can’t live without them, Google recommends server-side rendering. This will make your site load faster and prevent JavaScript errors.

To see which resources on the Page are causing rendering problems (and actually see if you have any problems), log into the Google Search Console account, go to URL Check, enter the URL to check, click the Test Live URL button, and then click “View Tested Page”.

Then go to the “More Info” section and click on the Page Resources and JavaScript Console messages folder to see a list of resources that Googlebot failed to render.

You can now show the site administrator a list of problems and ask them to investigate and fix the errors.

What affects crawler behavior?

Googlebot’s behavior isn’t chaotic — it’s determined by complex algorithms that help crawlers navigate the web and set rules for processing information.

However, the behavior of an algorithm is not that you can do nothing and hope for the best. Let’s take a closer look at what affects crawler behavior and how to optimize page crawling.

Internal links and backlinks

If Google already knows about your site, Googlebot periodically checks your home page for updates. Therefore, it is important to place links to new pages on authoritative pages of your site. Ideally, on the home page.

You can enrich your home page with a block that will have the latest news or blog posts, even if you have separate news pages and blogs. This will enable Googlebot to find your new page faster. This advice may seem fairly obvious, but many site owners still ignore it, leading to poor indexing and low positioning.

In terms of crawling, backlinks work the same way. So, if you add a new page, don’t forget external promotions. You can try guest posts, launch an advertising campaign, or try any other way to get Googlebot to view the URL of a new page.

Note: Links should be followed and let Googlebot follow them. Although Google recently said that no follow link can be used as a hint for crawling and indexing, we still recommend using Dofollow. Just to make sure the crawler actually sees the page.

Click the depth

Click depth to show how far the page is from the home page. Ideally, any page of the site should arrive in less than 3 clicks. Greater click depth will slow the crawl and will hardly benefit the user experience.

You can use Web site moderators to check if your site is related to click depth. Launch the tool, then go to Site Structure > page, and note that you click on the depth column.

If you see important pages that are too far from the home page, reconsider your site structure. Good structure should be simple and extensible, so you can add as many new pages as you need without the negative impact of simplicity.

Sitemap

A sitemap is a document that contains a complete list of the pages you want in Google. You can submit a site map to Google through the Google Search console (Index > Site Map) to let Googlebot know which pages to visit and crawl. Sitemap also tells Google if there are any updates on your web page.

Note: Sitemap does not guarantee that Googlebot will use it when crawling your site. Crawlers can ignore your site graph and continue to crawl the site however they decide. Still, no one has been punished for having a site map, and in most cases it has proved useful. Some CMS even automatically generate a site map, update it, and send it to Google, making your SEO process faster and easier. If your site is new or large (with over 500 urls), please consider submitting a site map.

The index shows that

When crawling and indexing pages, Google follows certain instructions, such as robots.txt, Noindex tags, Robots meta tags, and X-Robots tags.

Robots.txt is a root directory file that restricts some pages or content elements from Google. Once Googlebot finds your page, it looks at the robots.txt file. If a page is found to be crawling with robots.txt, Googlebot will stop crawling and loading any content and scripts from that page. This page will not show up in the search.

Noindex tags, robots meta tags, and X-Robots tags are tags used to restrict crawlers from crawling and indexing pages. The Noindex tag restricts all types of spiders from indexing pages. Use robots meta tags to specify how to crawl and index a particular page. This means that you can prevent certain types of crawlers from accessing the page and keep the page open to other pages. The X-Robots tag can be used as an element of an HTTP header response that may restrict the behavior of crawlers on a page index or browsing page. This TAB allows you to target individual types of crawling robots (if specified). If no robot type is specified, the description applies to all types of crawlers.

Note: the robots.txt file does not guarantee that a page will be excluded from the index. Googlebot treats this document as a suggestion rather than an order. This means Google can ignore robots.txt and index a page for search. If you want to ensure that pages are not indexed, use the Noindex tag.

Are all pages crawlable?

Not at all. Some pages may not be available for crawling and indexing. Let’s take a closer look at these types of pages:

Password protected page. Googlebot simulates the behavior of an anonymous user who accesses a protected page without any credentials. So if the page is password protected, it won’t crawl because Googlebot won’t be able to access it.
Index indicates excluded pages. These pages are from robots.txt, with Noindex tags, Robots meta tags, and X-Robots tags.
Orphan page. An orphan page is an unlinked page from any other page in the site. Googlebot is a spider robot, meaning it finds new pages by following all the links it finds. If there is no link to the page, the page will not be crawled and will not show up in a search.

Some pages are restricted to deliberate crawling and indexing. These are usually pages that are not intended to show up in a search: pages with personal data, policies, terms of use, test versions of pages, archived pages, internal search results pages, and so on.

However, if you want your pages to be crawlable and drive traffic, make sure you don’t protect public pages with passwords, thought links (internal and external), and double-check the index instructions.

To check the crawlability of web pages in the Google search console, go to Index >Coverage report. Note the flags Error (not indexed) and Valid with warning (indexed, but with problems).

Note: If you don’t want Googlebot to find or update any pages (some old pages, pages you no longer need), please remove them from the site map, and if you have pages, please set the 404 Not Found status, or tag them with the Noindex tag.

When will my site show up in a search?

Obviously, after you build your site, your pages will not show up in search immediately. If your site is absolutely new, Googlebot will take some time to find it on the web. Keep in mind that in some cases this “something” can take up to six months.

If Google already knows about your site and you make some updates or add new pages, the speed at which your site’s appearance on the Web changes depends on your crawl budget.

Crawl budget is the amount of resources Google spends crawling your site. The more resources Googlebot needs, the slower the search.

Capture budget allocation depends on the following factors:

Site popularity. The more popular a site is, the more crawling points Google is willing to spend on crawling.
Update rate. The more frequently you update your pages, the more crawling resources your site will get.
The number of pages. The more pages you have, the bigger your crawling budget.
Server capacity to handle crawling. The managed server must be able to respond to spider requests on time.

Note that the crawl budget is not used equally for every page, because some pages consume more resources (because JavaScript and CSS are too heavy, or because HTML is disorganized). Therefore, the allocated crawl budget may not be enough to crawl all the pages as quickly as you might expect.

In addition to serious code problems, some of the most common causes of poor crawling and irrational crawling budgets are repetitive content problems and poorly structured urls.

Repetition problem

There are several pages of mostly similar content. This can happen for a number of reasons, such as:

Reach the page in different ways: with or without WWW, via HTTP or HTTPS;
Dynamic URL – when many different urls result in the same page:
Page version A/B testing.

If not fixed, duplicate content issues can cause Googlebot to crawl the same page multiple times because it will think they are different pages. As a result, crawling resources are wasted in vain and Googlebot may not be able to find other meaningful pages on your site. In addition, repeated content will lower the position of the page in the search, because Google may consider your site to be of lower overall quality.

The truth is, in most cases, you can’t get rid of most things, which can lead to repetitive content. However, you can prevent any duplicate content problems by setting up canonical urls. The canonical tag indicates which page should be considered “master,” so the rest of the URL pointing to the same page will not be indexed and your content will not be duplicated. You can also restrict robot access to dynamic urls. TXT files with the help of robots.

Url structure problem

Human-computer algorithms both admire user-friendly urls. Googlebot is no exception. Googlebot can get confused when trying to understand long and parameter-rich urls. Therefore, more crawling resources are spent. To prevent this, make your url user friendly.

Make sure your URLS are clear, follow logical structure, have proper punctuation, and don’t include complex parameters. In other words, your URL should look like this:

http://example.com/vegetables/cucumbers/pickles
Copy the code

But the truth is, if you are the owner of a large (1 million + web pages) or medium-sized (10,000 + web pages) site whose content changes frequently (daily or weekly), you need to worry about this. In other cases, you just need to optimize your site properly for searching and fix indexing problems on time.

conclusion

Google’s main crawler, Googlebot, operates under complex algorithms, but you can still “navigate” its behavior to the benefit of your site. In addition, most crawling process optimization steps repeat the standard SEO steps we’re all familiar with.

Organizing is not easy, finally, don’t forget ❤ or 📑 support oh