Crawler’s crawling steps:

  • Prepare the proxy IP address we need (see blog.csdn.net/qq_38251616…).
  • First, the URL is necessary
  • Use url to crawl
  • Integrate the information from the crawl
  • Save to a local directory

Specific steps:

  • Get the web page using the proxy IP and requests. Get () statement
  • BeautifulSoup () parse web pages (BeautilfulSoup function can consult this www.jianshu.com/p/41d06a4ed.)
  • Find_all () finds the corresponding tag
  • Get the contents of the tag with.get_text()
  • Urlretrieve () download images locally (if text is stored directly into a local file)

Code examples:

Headers = {" user-agent ":"Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit / 537.36 (KHTML, } # Retrieve random IP proxies = get_random_IP (ip_list) req = requests.get(url=url,headers=headers,proxies=proxies) soup = BeautifulSoup(req.text,'lxml') targets_url_1 = soup.find('figure') targets_url = soup.find_all('noscript')Copy the code

Complete code:

This is a crawling zhihu picture tutorial code, which involves the proxy IP file (ip.txt)

Import requests,random, OS,time from BS4 import BeautifulSoup from urllib.request import urlRetrieve # Retrieve IP list and check IP validity def  get_ip_list(): F =open(' ip.txt ','r') ip_list=f.readlines() f.lose () return ip_list def get_random_ip(ip_list): proxy_ip = random.choice(ip_list) proxy_ip=proxy_ip.strip('\n') proxies = {'https': Proxy_ip} return proxies def get_picture(URL,ip_list): headers = {" user-agent ":"Mozilla/5.0 (Windows NT 6.1); Win64; X64) AppleWebKit / 537.36 (KHTML, } # Retrieve random IP proxies = get_random_IP (ip_list) req = requests.get(url=url,headers=headers,proxies=proxies) soup = BeautifulSoup(req.text,'lxml') targets_url_1 = Soup. Find ('figure') targets_url = soup. Find_all ('noscript') List_url. Append (each.img.get(' SRC ')) for each_img in list_url: Os.makedirs (' keystore ') # Get rid of proxies = get_random_IP (IP_list) Picture = '%s.jpg' % time.time() req = Requests. Get (URL =each_img,headers=headers, headers= tables) with open(' library /{}.jpg'. Format (picture),' WB ') as f: Print ('{} download complete! '.format(picture)) def main(): ip_list = get_ip_list() url = 'https://www.zhihu.com/question/22918070' get_picture(url,ip_list) if __name__ == '__main__': main()Copy the code

Screenshot after success:

Introduction to crawlers:

  • Introduction to crawlers:

A Web crawler, also known as a spider, is a web robot that automatically navigates the World Wide Web. Web crawlers start with a list of uniform resource addresses (urls) called seeds. When web crawlers visit these URLS, they identify all the hyperlinks on a page and write them into a “to-do list,” or crawl territory. Urls on this domain will be accessed in a loop according to a set of policies. If the crawler copies the archive and saves information on the site during execution, these archives are usually stored so that they can be easily viewed. Read and view the updated information on their stored web pages, also known as “snapshots.” Larger web pages mean that web crawlers can only download fewer pages in the given time, so their downloads should be given priority. A high rate of change means the page may have been updated or replaced. URL (Uniform resource Locator) generated by some server-side software also makes it difficult for web crawlers to avoid retrieving duplicate content. (From: Wikipedia)

  • Crawler analysis:

Access the web page through code and save the page content locally. The URL is an important symbol for crawlers to identify a web page. The HTML code of a web page is retrieved through requests. Get (URL), and the content is retrieved through BeautifulSoup parsing HTML files.

Add:

Headers in a crawler: If the headers is not set, the user-agent will declare itself to be a Python script. If the site has the idea of anti-crawler, the user-agent will declare itself to be a Python script. Must reject such connections. Modify Headers to avoid this problem by disguising your crawler script as a normal browser access.

About IP/ Proxies in crawlers After the User Agent is set up, another problem should be considered. The running speed of the program is very fast. If we use a crawler to crawl on the website, the access frequency of a fixed IP will be very high, which does not meet the standard of human operation, because human operation cannot carry out such frequent access within a few ms. Therefore, some websites will set a threshold of IP access frequency. If an IP access frequency exceeds this threshold, it means that it is not a person visiting, but a crawler. So when we need to crawl a lot of data, a mechanism to constantly change IP is essential, and the IP.txt file in my code is for this mechanism.

BeautifulSoup: In a nutshell, BeautifulSoup is a Python library whose primary function is to crawl data from web pages.

  • Beautiful Soup provides simple, Python-like functions to handle navigation, searching, modifying analysis trees, and more. It is a toolkit that provides users with data to grab by parsing documents, and because it is simple, it takes very little code to write a complete application.
  • Beautiful Soup automatically converts the input document to Unicode encoding and the output document to UTF-8 encoding. You don’t need to worry about encoding unless the document doesn’t specify one, in which case Beautiful Soup doesn’t automatically recognize the encoding. Then, all you need to do is explain the original code.
  • Beautiful Soup has become as good a Python interpreter as LXML and HTML6lib, giving users the flexibility to parse different strategies or strong speeds.

BeautifulSoup installation:

pip install beautifulsoup4
Copy the code