As a Python+ crawler, today I’m going to do some crawler work, which I haven’t done before, starting from scratch. Win10 system, Python on my computer before do not remember when installed, installed is version 3.6, can be used. Code tools, starting with VS Code. Let’s do a simple demo of crawling web pages. Python3 provides the urllib standard library, which can be referenced without installation:

from urllib import request
Copy the code

Then we can use urllib to send GET requests. The code is as follows:

from urllib import request

req_news = request.Request('https://news.163.com/20/0522/13/FD82KGA1000189FH.html')
req_news.add_header('User-Agent'.'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36')
html_news = request.urlopen(req_news).read()

print('-- -- -- -- -- -- -- -- -- -- -- --')
print(html_news.decode('gbk'))
Copy the code

User-agent (UA for short) is a special string header that enables the server to identify the operating system and version used by the customer, CPU type, browser and version, browser rendering engine, browser language, and browser plug-in. Here this is added in the request header to simulate normal browser requests, many server will do inspection, found that is not a normal browser requests will be refused to directly (many sites may not have the test, but this is the most basic of the crawler strategy), and of course the real inside a browser sends a request header is not only a UA. Screenshot on the browser.

Note: The contents of request.urlopen(req_news).read() are sometimes binary (bytes) data, which need to be converted. Chinese characters need to be converted to GBK format by decode, and utF-8 format can be written. Sometimes it is a string, which can be converted to UTF-8 with encoding

Why GBK format? The HTML /head/meta tags start with the charset attribute, which identifies the encoding of the page. We can take it out and use it for transcoding. Note: The encoding of the default format is GBK. If the retrieved charset is empty, set it to GBK

from urllib import request

req_news = request.Request('https://news.163.com/20/0522/13/FD82KGA1000189FH.html')
req_news.add_header('User-Agent'.'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36')
html_news = request.urlopen(req_news).read()

if len(char_set) == 0:
    char_set = ['gbk']
html_news = html_news.decode(char_set[0])
print('-- -- -- -- -- -- -- -- -- -- -- --')
print(html_news)
Copy the code

Then, write the page information into a file. Reading and writing python files requires importing the OS first.

import os
Copy the code

The read-write side is less like the C side. Read and write Python file see: www.w3school.com.cn/python/pyth…

base_dir = os.getcwd()
Get the absolute path to the current folder
file_name = os.path.join(base_dir, 'data'.'news.html')

# Open the news. HTML file in fie_name and use write mode
If the file does not exist, create it; if it does, empty it and write to it
my_open = open(file_name, 'w', encoding="utf-8")
Write something to the file
my_open.write(html_news.decode('gbk'))
my_open.close()
Copy the code

The complete read and write code is as follows:

from urllib import request
# from lxml import etree
# import re
import os

req_news = request.Request('https://news.163.com/20/0522/13/FD82KGA1000189FH.html')
req_news.add_header('User-Agent'.'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36')
html_news = request.urlopen(req_news).read().decode('gbk')

print('-- -- -- -- -- -- -- -- -- -- -- --')
print(html_news)

base_dir = os.getcwd()
Get the absolute path to the current folder
file_name = os.path.join(base_dir, 'data'.'news.html')

# print(file_name)

# Open the news. HTML file in fie_name and use write mode
If the file does not exist, create it; if it does, empty it and write to it
my_open = open(file_name, 'w', encoding="utf-8")
Write something to the file
my_open.write(html_news)

my_open.close()
Copy the code

This is only the first step of crawler, and then we need to analyze the web data and extract the content we want.