1. Introduction

Crawlers have always been a big Python scenario, and you can write crawlers in almost every language, but programmers love Python. We like Python because of its concise syntax. It is very easy to write a crawler using Python. This article will introduce you to a Python crawler with a few very simple examples.

2. Web crawler

If our Internet is a complex spider web, then our crawler is a spider that we can let crawl around the web looking for valuable prey. First of all, our web crawler is built on the network, so the basis of web crawler is network request. In our daily life, we will use the browser to browse the web, we enter a web address in the url bar, click enter and a web page will be displayed in a few seconds.

We appear to click a few buttons, but the browser actually does some things for us, including the following:

  1. Sends a network request to the server
  2. The browser receives and processes the request
  3. The browser returns the requested data
  4. The browser parses the data and presents it as a web page

We can compare the above process to our daily shopping:

  1. Tell the boss I want a bubble tea
  2. The boss is in the shop to see if he has what you want
  3. The boss took out the material to make milk tea
  4. The boss will make the material into milk tea and give you

Although the example of buying milk tea above is somewhat inappropriate, I think it has been a good explanation of what network request is.

Now that we know what a web request is, we can take a look at what a crawler is. In fact, crawler is also a network request, usually we through the browser, and our crawler is through the program to simulate the process of network request. But this basic web request is not quite a crawler, and crawlers usually have a purpose. For example, if I want to write a picture of a crawling beauty, we need to conduct some screening and matching of the data we request to find the data that is valuable to us. The whole process from network request to data crawl is a complete crawler.

Sometimes the anti-crawler of the website is not good, we can directly find its API in the browser, we can directly obtain the data we need through the API, which is much simpler

3. Simple crawlers

A simple crawler is a simple network request, and it can also do some simple processing to the requested data. Python provides the native network request module urllib as well as a packaged version of the Requests module. Requests are easier to use than straight-line Requests, so this article uses Requests for network requests.

3.1 Crawl a simple web page

When we send a request, we return a variety of data, including HTML code, JSON data, XML data, and binary streams. Let’s take baidu home page as an example to crawl:

import requests
Send the request with the get method and return the data
response = requests.get('http://www.baidu.com')
Open a file as a binary write
f = open('index.html'.'wb')
Write the byte stream of the response to the file
f.write(response.content)
# close file
f.close()
Copy the code

Let’s see what the crawl site looks like when it opens:

This is our familiar Baidu page, it looks more complete. Let’s take the example of other websites, which can have a different effect. Let’s take CSDN as an example:

You can see that the layout of the page is completely out of order, and a lot of things are missing. If you’ve learned the front end, a web page is made up of HTML pages and a lot of static files. When we crawl, we just crawl the HTML code down. Static resources linked to the HTML, such as CSS styles and image files, are not crawled, so you can see this very strange page.

3.2 Crawling pictures from web pages

First of all, we need to make it clear that when we climb some simple web pages, we climb pictures or videos to match the URL information contained in the web page, that is, we say the URL. We then download the image from this specific URL, which completes the image crawl. We have the following url: https://img-blog.csdnimg.cn/2020051614361339.jpg we are going to this url to download demo pictures of code:

import requests
# for the url
url = 'https://img-blog.csdnimg.cn/2020051614361339.jpg'
Send a get request
response = requests.get(url)
Open the image file as binary write
f = open('test.jpg'.'wb')
Write the file stream to the image
f.write(response.content)
# close file
f.close()
Copy the code

As you can see, the code is the same as the page above, but the file opens with the JPG suffix. As a matter of fact, it is more appropriate to write files such as pictures, videos and audio files in binary mode, while for the text information such as HTML code, we usually directly obtain its text in response.text mode. After obtaining the text, we can match the image URL in it. Let’s take the following topit.pro as an example:

import re
import requests
# Sites to crawl
url = 'http://topit.pro'
Get the source code of the web page
response = requests.get(url)
# match image resources in source code
results = re.findall(", response.text)
# for named variables
name = 0
Pass through the result
for result in results:
	# In the source code analysis of the image resources write is the absolute path, so the complete URL is the main site + absolute path
    img_url = url+result
    # Download image
    f = open(str(name) + '.jpg'.'wb')
    f.write(requests.get(img_url).content)
    f.close()
    name += 1
Copy the code

Above we have completed a site crawl. In the match, we used the regular expression, because the content of the regular is more, here will not expand, interested readers can go to understand, here only says a simple. Python’s use of re is implemented through the re module, which calls findAll to match all the required strings in the text. This function takes in two parameters, the first is the regular expression and the second is the string to match. If you don’t know anything about the regular, you need to know that we can use this regular to extract the SRC content from the image.

4. Parse HTML using BeautifulSoup

BeautifulSoup is a module for parsing XML and HTML files. We used regular expressions for pattern matching, but writing regular expressions yourself is a tedious and error-prone process. BeautifulSoup would greatly reduce our workload if we handed parsing over to BeautifulSoup, so we installed it before using it.

4.1 BeautifulSoup installation and simple use

We use PIP installation directly:

pip install beautifulsoup4
Copy the code

Module import is as follows:

from bs4 import BeautifulSoup
Copy the code

Let’s take a look at BeautifulSoup in action, using the following HTML file:

<! DOCTYPEhtml>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <img class="test" src="1.jpg">
    <img class="test" src="2.jpg">
    <img class="test" src="3.jpg">
    <img class="test" src="4.jpg">
    <img class="test" src="5.jpg">
    <img class="test" src="6.jpg">
    <img class="test" src="7.jpg">
    <img class="test" src="8.jpg">
</body>
</html>
Copy the code

Here is a very simple HTML page. The body contains 8 img tags. Now we need to get their SRC as follows:

from bs4 import BeautifulSoup

Read the HTML file
f = open('test.html'.'r')
str = f.read()
f.close()

# Create BeautifulSoup object with the parsed string as the first argument and the parser as the second argument
soup = BeautifulSoup(str.'html.parser')

Img tag = test; img tag = test; img tag = test
img_list = soup.find_all('img', {'class':'test'})

# traversal the tag
for img in img_list:
	Get the SRC value of the img tag
    src = img['src']
    print(src)
Copy the code

The analytical results are as follows:

1.jpg
2.jpg
3.jpg
4.jpg
5.jpg
6.jpg
7.jpg
8.jpg
Copy the code

That’s exactly what we need.

4.2 BeautifulSoup of actual combat

We can parse the web page and figure out SRC so that we can crawl images and other resources. Below, we use pear video as an example to carry out video crawling. The home page is as follows: www.pearvideo.com/ We can see the following page by right clicking:

We can first click on 1, then choose the position to climb, such as 2, on the right side will jump to the corresponding position. We can see that there is a label of A in the outer layer. In actual operation, we found that the webpage was reconnected at the position of clicking 2. After analysis, the webpage reconnected should be the HERF value in the label of A. Since the herf value starts with a /, the full URL should be the host +href value. Knowing this, we can proceed with the next step:

import requests
from bs4 import BeautifulSoup
# master station
url = 'https://www.pearvideo.com/'
# Simulate browser access
headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
# send request
response = requests.get(url, headers=headers)
# Get BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')
# parse the a tag that meets the requirements
video_list = soup.find_all('a', {'class':'actwapslide-link'})
# traversal the tag
for video in video_list:
	Get the HERF and assemble it into a full URL
    video_url = video['href']
    video_url = url + video_url
    print(video_url)
Copy the code

The following output is displayed:

https://www.pearvideo.com/video_1674906
https://www.pearvideo.com/video_1674921
https://www.pearvideo.com/video_1674905
https://www.pearvideo.com/video_1641829
https://www.pearvideo.com/video_1674822
Copy the code

We only have to crawl one, we went to the first url to look at the source, found this sentence:

var contId="1674906",liveStatusUrl="liveStatus.jsp",liveSta="",playSta="1",autoPlay=! 1,isLiving=! 1,isVrVideo=! 1,hdflvUrl="",sdflvUrl="",hdUrl="",sdUrl="",ldUrl="",srcUrl="https://video.pearvideo.com/mp4/adshort/20200517/cont-16749 06-15146856_adpkg-ad_hd.mp4",vdoUrl=srcUrl,skinRes="//www.pearvideo.com/domain/skin",videoCDN="//video.pearvideo.com";Copy the code

SrcUrl contains the video files of the website, but we can not find a webpage by ourselves, we can use regular expressions:

import re
# Get the source of a single video page
response = requests.get(video_url)
# Match the video url
results = re.findall('srcUrl="(.*?) "', response.text)
# output result
print(results)
Copy the code

The results are as follows:

['https://video.pearvideo.com/mp4/adshort/20200516/cont-1674822-14379289-191950_adpkg-ad_hd.mp4']
Copy the code

Then we can download this video:

with open('result.mp4'.'wb') as f:
    f.write(requests.get(results[0], headers=headers).content)
Copy the code

The complete code is as follows:

import re
import requests
from bs4 import BeautifulSoup
# master station
url = 'https://www.pearvideo.com/'
# Simulate browser access
headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
# send request
response = requests.get(url, headers=headers)
# Get BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')
# parse the a tag that meets the requirements
video_list = soup.find_all('a', {'class':'actwapslide-link'})
# traversal the tag
video_url = video_list[0] ['href']

response = requests.get(video_url)

results = re.findall('srcUrl="(.*?) "', response.text)

with open('result.mp4'.'wb') as f:
    f.write(requests.get(results[0], headers=headers).content)
Copy the code

So far we have realized several different crawlers from simple web pages to pictures and videos.

conclusion

The above process can be roughly divided into three phases

  1. The requested page
  2. Parsing the page
  3. Save the data

If you think the article is good, please like, share, leave a message, because this will be my continuous output of more high-quality articles the strongest power!