1, the preface

  • Everything we can see on the web, we can crawl. Rich text is an exception.
  • The data crawling process is generally divided into two stages:

Stage 1: Initiate the request

Stage 2: Parse data using regular expressions or third-party libraries (BeautifulSoup).

2. Initiate a request

2.1 Request header masquerade (a common anti-crawler strategy, which verifies what is in the request header)

from fake_useragent import UserAgent  Crawler request header masquerade
import json
from urllib3.exceptions import InsecureRequestWarning
from urllib3 import disable_warnings
disable_warnings(InsecureRequestWarning)  # disable HTTPS (SSL) error

ua = UserAgent()  Crawler request header masquerade
proxies = {
"http": "http://"+proxies_ip,   Type # HTTP
"https": "http://"+proxies_ip   Type # HTTPS
}

Custom request headers
my_headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/avif, image/webp image/apng, * / *; Q = 0.8, application/signed - exchange; v=b3; Q = 0.9 '.'User-Agent': ua.chrome,
}
response = requests.get(uri,  headers=my_headers,proxies=proxies)
Copy the code

Generally speaking, if you configure the request headers in this way, you should be able to crawl most websites normally

2.2 Some websites require login during the crawling process, so you need to customize the cookie in the request header

Reference: docs.python-requests.org/zh_CN/lates…

2.3 If the target website is in the process of crawling: accompanied by a verification code

Graphic captcha can be used to parse images using some third-party apis. However, encountered this verification code, I suggest or lower the frequency of crawler!

2.4 Common Exceptions

  • When you see your program throw a throwConnectionErrorWrong, that’s usually because the target site detects that you’re a crawler.

2, Content parsing find_all() instructions (see Beautiful Soup 4.9.0)

Find_all () method signature: find_all(name, attrs, recursive, string,limit, ** kwargs) this find_all() method browses the descendants of the tag and retrieves any descendants that match the filter. I gave a few examples in "Filters of sorts", but there are more examples: soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p"."title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# Lacie,
# Tillie]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re
soup.find(string=re.compile("sisters"))
# 'Once upon a time there were three little sisters; and their names were\n'Some of them should look familiar, but others are new. What does it mean to be a string or pass a value id? Why find the <p> tag with CSS class "title"? Let's look at the parameters. find_all("p"."title")find_all() This name argument passes a value name, and you tell Beautiful Soup to consider only tags with a specific name. Text strings and labels with names that do not match are ignored. This is the simplest usage: soup.find_all(soup."title")
# [<title>The Dormouse's story</title>]Recall from the various filters that the value name of to can be a string, regular expression, list, function, or value True. Keyword parameters Any unrecognized parameters will be converted to a filter for one of the tag attributes. If you pass the value ID for the parameter named, Beautiful Soup will be based on the value of each tag'id'Attribute to filter: soup. Find_all (id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]If you enter the value href, Beautiful Soup will be based on each tag'href'Find_all (href=re.compile()"elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]You can filter attributes based on strings, regular expressions, lists, functions, or values True. This code looks for all tags whose ID attribute has a value, regardless of the value: soup.find_all(id=True)# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# Lacie,
# Tillie]You can filter multiple attributes at once by passing in multiple keyword arguments: soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]Some attributes (such as the data- * attribute in HTML 5) have names that cannot be used as keyword parameter names: data_soup = BeautifulSoup('
      
foo!
'
.'html.parser') data_soup.find_all(data-foo="value") # SyntaxError: keyword can't be an expressionYou can use these attributes in a search by putting them into a dictionary and passing the dictionary find_all() as the attrs argument: data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>]You cannot use keyword arguments to search for HTML "name" elements, because Beautiful Soup uses the name argument to contain the name of the tag itself. Instead, you can set the attrs parameter to'name'Assignment: name_soup = BeautifulSoup('<input name="email"/>'.'html.parser') name_soup.find_all(name="email") # [] name_soup.find_all(attrs={"name": "email"}) # [<input name="email"/>] Copy the code

3, source code reference

3.1 Obtaining agency (white piao Agency)

Free agent – 3 hours free agent – 4 hours free prostitute

  • In my demo, I used fast proxy, and the target site didn’t seem to have a strong anti-crawler mechanism, so I didn’t make a request header disguise. However, only one or two agents out of 100 are available. Even if it’s multithreaded, it’s full to death.
#! /usr/bin/env python
# _*_ coding:utf-8 _*_  
#  
# @Version : 1.0  
# @Time : 20120/10/24
# @Author : wjt
# @File : parsing_html
# @description: Get the set of fast proxy IP addresses

from bs4 import BeautifulSoup
import requests
import re
import time


def get_html(url):
    ""Param Open_proxy: Proxies enabled (default: False) Param ip_Proxies: Proxies if enabled (Return:)""
    try:
        pattern = re.compile(r'/ / (. *?) / ')
        host_url = pattern.findall(url)[0]
        headers = {
            "Host": host_url,
            "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; WOW64; The rv: 60.0) Gecko / 20100101 Firefox / 60.0"."Accept": "text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8"."Accept-Language": "zh-CN,zh; Q = 0.8, useful - TW; Q = 0.7, useful - HK; Q = 0.5, en - US; Q = 0.3, en. Q = 0.2"."Accept-Encoding": "gzip, deflate"."Connection": "keep-alive",
        }
        res = requests.get(url, headers=headers, timeout=5)
        # res.encoding = res.apparent_encoding # Apparent_encoding automatically determines the HTML apparent_encoding, which can cause garbled characters, comment out first
        print("Fetching proxy IP Html page succeeded" + url)
        return res.text     Return only the source code of the page
    except Exception as e:
        print("Failed to fetch proxy IP Html page" + url)
        print(e)

def get_kuaidaili_free_ip(begin_page_number):
    ""Param ip_Proxies proxies to be proxies: Param Save_path Proxy IP save path: Param open_proxy Whether to enable proxy. Default: False :return:"""
    ip_list_sum = []    # Proxy IP list
    a = 1
    while a<=1:  Get page number
        # Start to crawl
        r = get_html("https://www.kuaidaili.com/free/inha/" + str(begin_page_number+a) + "/")
        # print("-10"+"\\"+"n")
        if(r == "-10\n") :return print("Too many proxy IP crawling operations!")
        # page parsing
        soup = BeautifulSoup(r, "html.parser")
        tags_ip = soup.tbody.find_all(attrs={"data-title": "IP"} )
        tags_port = soup.tbody.find_all(attrs={"data-title": "PORT"} )
        min_index =0
        max_index = len(tags_ip)-1
        while min_index<=max_index:
            ip_info = tags_ip[min_index].get_text()+":"+tags_port[min_index].get_text()
            ip_list_sum.append(ip_info)
            min_index+=1
        a+=1
    return ip_list_sum
    
# if __name__ == "__main__":
# get_kuaidaili_free_ip(1)
    
Copy the code

3.2. Climb the target website: search according to Baidu keywords

#! /usr/bin/env python
# _*_ coding:utf-8 _*_  
#  
# @Version : 1.0  
# @Time : 20120/10/24
# @Author : wjt
# @File : my_reptiles.py
# @description: Crawls baidu keyword search of target website
import requests
from bs4 import BeautifulSoup
import re
import json
import time
import datetime
import threading  # multithreaded
import os  # file manipulation
import parsing_html   Introduce classes that get proxies
from fake_useragent import UserAgent Crawler request header masquerade

ip_list =[]  Proxy IP address set
begin_page_number = 0  The proxy IP source starts to climb the page number

# Search by keyword
def get_baidu_wd(my_wd,proxies_ip):
    Build query criteria
    my_params = {'wd': my_wd}
    
    proxies = {
        "http": "http://"+proxies_ip,   Type # HTTP
        "https": "http://"+proxies_ip   Type # HTTPS
    }
    try:
        ua = UserAgent() Crawler request header masquerade
        Custom request headers
        my_headers = {
            "User-Agent":ua.random,
            "Accept": "text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8"."Accept-Language": "zh-CN,zh; Q = 0.8, useful - TW; Q = 0.7, useful - HK; Q = 0.5, en - US; Q = 0.3, en. Q = 0.2"."Accept-Encoding": "gzip, deflate"."Connection": "close",
        }  
        r = requests.get('https://www.baidu.com/s?ie=UTF-8',
                           params=my_params, headers=my_headers,proxies = proxies,timeout=2, verify=False)
    except (requests.exceptions.ConnectTimeout,requests.exceptions.ProxyError,Exception):
        print(proxies_ip+"Time out!)
    else:
        if r.status_code == 200:
            print(proxies_ip+"Success!)
    finally:
        pass
    
    
# Start fetching tasks
def newmethod304():
    global ip_list
    global begin_page_number
    while1 = = 1:if len(ip_list) == 0:
            time.sleep(1)
            ip_list = parsing_html.get_kuaidaili_free_ip(begin_page_number)
        whilelen(ip_list) ! =0: proxies_ip = ip_list.pop().replace('\n'.' ') Remove an element from the list (the last element by default) and return the value of that element
            Create a new thread
            myThread1(proxies_ip).start()    
        begin_page_number+=1   

# Thread task
class myThread1(threading.Thread):
    def __init__(self,proxies_ip):
        threading.Thread.__init__(self)
        self.proxies_ip = proxies_ip
    def run(self):
        print("Start thread:" + self.proxies_ip)
        get_baidu_wd('Jay Chou',self.proxies_ip)  


if __name__ == '__main__':
    newmethod304()

    
Copy the code

4, reference

The Requests official reference document

Beautiful Soup 4.9.0 documentation, content parsing reference

Fake – useragent reference