This is the 10th day of my participation in the August More text Challenge. For details, see: August More Text Challenge

What is a reptile

Crawler: a program that crawls web page data. The bottom of the search engine is actually a crawler. Baidu Encyclopedia: Web crawler

Here’s what we need to learn about Python crawlers:

  1. Learning Python Syntax (Basics)
  2. HTMLPage content fetching (data fetching)
  3. HTMLPage data extraction (data cleaning)
  4. ScrapyFramework andscrapy-redisDistributed Strategy (third party framework)
  5. The crawler (Spider), reptile (Anti-Spider), anti-reptile (Anti-Anti-Spider)….

Why we need a reptile

Let’s consider a question first, the mobile Internet era has passed, you look at the current IOS and Android market, now is the era of big data, that has been talking about big data big data, then the so-called “big data era”, data access in what kinds of ways? Baidu search, Taobao to buy this is the way we can suddenly think of, so we statistics under what ways

  1. Enterprise production of user data: Large Internet companies have huge numbers of users, so they have a natural advantage in accumulating data. Data-conscious smes are also beginning to accumulate data. (Baidu Index, Ali Index, Sina Weibo Index)

  2. Data management consulting companies: Usually, such companies have a large data collection team. They usually collect data through market research, questionnaire survey, fixed sample testing, cooperation with companies in various industries, and expert dialogue (after many years of data accumulation, scientific research results are finally obtained). (McKinsey & IResearch)

  3. Public data provided by governments/agencies: the government combines the data reported by local government statistics; Agencies are authoritative third-party websites. (Data of National Bureau of Statistics of the People’s Republic of China data of the World Bank data of the United Nations)

  4. Third party data platform purchase: The data required by various industries can be purchased through various data trading platforms at different prices depending on the difficulty of obtaining the data. (Data Tang State Cloud data market) Guiyang is a province with more big data in China. Guiyang big data Exchange has geographical advantages

  5. Crawler data: If the market does not have the data we need, or the price is too high to buy, then we can hire/become a crawler engineer to collect data from the Internet. (Search for Crawler Engineer online)

How does a crawler grab web data

We usually open any web page through the browser, in fact, have the following three general features

Three features:

  1. Web pages have their own uniqueURL(Uniform Resource Locator) to locate
  2. Web pagesHTML(Hypertext Markup Language) to describe page information.
  3. Web pagesHTTP/HTTPSHypertext Transfer Protocol (HYPERtext Transfer Protocol) protocol to transfer HTML data.

Design idea of crawler:

  1. First determine the URL address of the web page that needs to be crawled.
  2. throughHTTP/HTTPProtocol to get the corresponding HTML page.
  3. Extract useful data from the HTML page: if it is needed, save it. If it is another URL in the page, proceed to step 2

Only Python can do crawlers?

First of all to correct, whether the so-called big data analysis or crawler can also be implemented in other programming languages, such as PHP, Java, C/C++, Python, etc..

So why did we choose Python for the implementation when we first compared the languages

  1. PHP is the best language in the world, but it is not born to do this, and the multithreading, asynchronous support is not good enough, the concurrent processing ability is very weak, crawler is a tool program, the speed and efficiency requirements are relatively high.
  2. Java also has a well-developed web crawler ecosystem and is the biggest competitor to Python crawlers. But the Java language itself is clunky, with a lot of code. Refactoring costs are high, and any changes result in a lot of code changes. Crawlers often need to modify parts of the collection code.
  3. C/C++ runs almost the most efficiently and performs the most, but is expensive to learn and slow to code. Can do C/C++ crawler, can only be said to be a performance of ability, but is not the right choice, overqualified.
  4. Python has beautiful syntax, concise code, high development efficiency, and supports many modules, including rich RELATED HTTP request modules and HTML parsing modules. And powerful crawlersScrapyAnd mature and efficientscrapy-redisDistributed policy. Also, it’s very convenient to call other excuses (glue language)

Classification of reptiles

According to the use scenario: divided into general crawler focus crawler

General crawler: a crawler system used by search engines.

Goal:

It is to download all the web pages on the Internet as much as possible, put them in the local server to form a backup, and then do related processing on these pages (extract keywords, remove advertisements), and finally provide a user search interface

Grasping process:

  1. It is preferred to select a subset of existing urls and place them on the queue to be climbed.

  2. Take these urls out of the queue, parse DNS to get the host IP, then go to the IP server to download the HTML page, save to the search engine’s local server. The crawled URL is then placed in the climbed queue.

  3. Analyze the page content, find other URL connections in the page, and continue with the second step until the crawl condition is complete.

How do search engines get the URL of a new website

1. Take the initiative to submit to the search engine site: (http://zhanzhang.baidu.com/linksubmit/url) such as baidu 2. Set up links to other sites. 3. The search engine will cooperate with DNS service providers to quickly include new websites. (DNS: A technology that resolves domain names into IP.) For example, you will get baidu IP in CMD ping www.baidu.com, directly in the browser, enter the IP general can access baidu

A universal crawler doesn’t have to crawl everything, it has to follow rules:

Robots Protocol: This protocol specifies the rights that a common crawler can access to a web page. Robots.txt is just a suggestion. Not all crawlers comply, generally only the large search engine crawlers will comply. I don’t care if we wrote a reptile. It’s like a bus seat for the sick and pregnant.

Here is a brief introduction of Robots Protocol (also called crawler Protocol, robot Protocol, etc.), full name is “Ro Bots Exclusion Protocol”. Through Robots Protocol, websites tell search engines which pages can be crawled and which pages cannot be crawled, for example:

Taobao: https://www.taobao.com/robots.txt

Tencent: http://www.qq.com/robots.txt

General crawler workflow:

Crawl web pages – store data – content processing – provide search/ranking services

Search engine rankings:

PageRank: According to the traffic (clicks/views/popularity) of a website, the higher the traffic, the more valuable the website is, and the higher the ranking. 2. Bidding ranking: Who gives more money, who ranks higher.

Disadvantages of universal crawlers

  1. A common search engine returns pages, and in most cases, 90% of the pages are useless to the user.
  2. Users from different fields and backgrounds often have different retrieval purposes and needs, and search engines cannot provide search results for specific users.
  3. With the rich form of world Wide Web data and the continuous development of network technology, a large number of different data such as pictures, databases, audio and video multimedia appear, and the general search engine can’t do anything to these files, so it can’t find and obtain them well.
  4. Most general search engines provide keyword-based retrieval, which is difficult to support queries based on semantic information and cannot accurately understand users’ specific needs.

To solve this problem, focus crawlers have emerged

2. Focused crawler: a crawler written by a crawler programmer for a particular content.

Topic-oriented crawler, requirement-oriented crawler: Crawls information for a specific content and ensures that the information is as relevant to the requirements as possible. And what we’re going to learn later is focusing on the crawler.

You now need to master the technology that crawlers use

  1. Basic knowledge of Python syntax
  2. How to grabHTMLPage:HTTPProcessing of requests,urllib,urllib2,requests
  3. Parse the contents of the server response:re,xpath,BeautifulSoup4 (bs4),jsonpath,pyqueryEtc.
  4. How to collect dynamic HTML, verification code processing: general dynamic page collection:Selenium + PhantomJS(no interface): Simulates real browser loadingjs,ajaxSuch non-static page data.Tesseract: Machine learning library, machine image recognition system, can handle simple captcha, complex captcha can be manually entered/specialized coding platform
  5. ScrapyFrame: (Scrapy.Pyspider)
  6. Distributed strategyscrapy-reids:
  7. The struggle between crawler – anti-crawler – anti-crawler (in fact, this place is definitely the final victory of the crawler, because the crawler is just simulating the user’s operation, as long as your site let the user see, then the crawler can climb down)