purpose

Learn how to get data from the Internet. One of the must-have skills of data science.

The third-party libraries used in this article are requests, Parsel, and Selenium

Requests is responsible for sending HTTP requests to the Web page and getting the response, Parsel is responsible for parsing the response string, and Selenium is responsible for JavaScript rendering.

What is a web crawler

Web crawler is a program or script that can automatically grab website information according to certain rules.

How to crawl site information

Before writing a crawler, we must make sure that we can crawl the information of the target site.

But before we do that, we need to know three things:

  1. Does the site already provide an API
  2. Is the site static or dynamic
  3. Whether the site has anti – crawling countermeasures

Scenario 1: Site with open API

If a site has its API open, you can GET its JSON data directly.

XKCD’s About page, for example, provides an API for you to download

import requests
requests.get('https://xkcd.com/614/info.0.json').json()
Copy the code

So how do you tell if a site has open apis? There are 3 ways:

  1. Look for API entry inside the station
  2. Use a search engine to search for “a website API”
  3. Caught. Some sites use Ajax (such as guokr’s Waterfall article), but it is still possible to capture the JSON data in XHR.

How caught: F12 – Network – F5 to refresh fiddle and other tools can also be | or use

Scenario 2: A site with no open API

If the site is a static page, you can use the Requests library to send requests and the HTML parsing library (LXML, Parsel, etc.) to parse the text of the response

The parser library strongly recommends Parsel, which has a similar syntax to CSS selectors but is also fast, and Scrapy uses it.

You need to understand the syntax of CSS selectors (xpath is fine) and learn to look at the review elements of a web page.

For example, get all the original konachan links

from parsel import Selector
res = requests.get('https://konachan.com/post')
tree = Selector(text=res.text)
imgs = tree.css('a.directlink::attr(href)').extract()
Copy the code

If the site is a dynamic page, selenium is used to render JS first, and the HTML parsing library is used to parse the driver’s page_source.

For example, get hitomi. La data (here chrome is in headless mode)

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
driver.get('https://hitomi.la/type/gamecg-all-1.html')
tree = Selector(text=driver.page_source)
gallery_content = tree.css('.gallery-content > div')
Copy the code

Scenario 3: Crawling websites

At present, common anti-crawling policies include verification code, login, and IP address blocking.

Captcha: Crack using a coding platform (opencV or Keras training chart if hard)

Login: Mock login using the Requests POST or Selenium mock user

Proxies: Buy proxies (free proxies don’t work) in Requests to proxies

Other anti-creeping methods: Disguise user-Agent and disable cookies

You are advised to use fake-userAgent to disguise user-Agent

from fake_useragent import UserAgent
headers = {'User-Agent': UserAgent().random}
res = requests.get(url, headers=headers)
Copy the code

How to write a structured crawler

If you can successfully crawl the site information, then you have succeeded more than half way.

The crawler structure is simple: create a Tasklist and call the crawl function on each task in the tasklist.

For dynamic web pages with a constant URL, consider packet capture first, not selenium click next page. If you are looking for speed, Consider libraries such as concurrent.futures or Asyncio.

import requests from parsel import Selector from concurrent import futures domain = 'https://www.doutula.com' def crawl(url): res = requests.get(url) tree = Selector(text=res.text) imgs = tree.css('img.lazy::attr(data-original)').extract() # save  the imgs ... if __name__ == '__main__': tasklist = [f'{domain}/article/list/?page={i}' for i in range(1, 551)] with futures.ThreadPoolExecutor(50) as executor: executor.map(crawl, tasklist)Copy the code

Data storage, depending on your needs, are generally stored in the database, as long as you are familiar with the corresponding driver.

Pymysql (MySQL) pymongo(MongoDB)

If you need a framework

At this point in this article, you should have a clear idea of the basic structure of a crawler, so you can start working on the framework.

Lightweight framework (looter) : github.com/alphardex/l…

Scrapy: github.com/scrapy/scra…

Click here to participate in Python programming

In this paper, the author

Alphardex,Pythonista && Otaku, one of the surveysers trying to change careers.

Address: zhihu.com/people/ban-zai-liu-shang

\

Submission Email:[email protected]

Welcome to apply for the Python Chinese Community’s new Columnist program


Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Click **** to read the original article and become a free member of **** community