Python captures 500 "beautiful" images at once.

IT journey (ID: Jake_Internet) please contact authorization (wechat ID: Hc220088)

Python captures 500 “beautiful” wallpaper images at once.

1. Crawl a page of pictures

Extract image data by regular matching

The screenshot of the source code is as follows:

Reset GBK code to solve the garbled code problem

Code implementation:

Import requests import re # set save path = r'd :\test\picture_1\ '# destination url url = "Http://pic.netbian.com/4kmeinv/index.html" # camouflage request header By crawling headers = {the user-agent: "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", "Referer": "Http://pic.netbian.com/4kmeinv/index.html"} # send request to obtain the response = requests. Get (url, Response. encoding = 'GBK' # reformat match response.encoding = 'GBK' # reformat match response.encoding = 'GBK' # reformat match response.encoding = 'GBK img_info = re.findall('img src="(.*?)" alt="(.*?)" /', response.text) for src, name in img_info: Img_url = 'http://pic.netbian.com' + SRC # + 'http://pic.netbian.com' img_content = requests. Get (img_url, headers=headers).content img_name = name + '.jpg' with open(path + img_name, 'wb') as f: Print (f" img_name}") f.rite (img_content)Copy the code

Xpath localization extracts image data

Code implementation:

Import requests from LXML import etree # set save path = r'd :\test\picture_1\ '# destination URL url = "Http://pic.netbian.com/4kmeinv/index.html" # camouflage request header By crawling headers = {the user-agent: "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", "Referer": "Http://pic.netbian.com/4kmeinv/index.html"} # send request to obtain the response = requests. Get (url, Response. encoding = 'GBK' HTML = response.encoding = 'GBK Etree.html (response.text) # xpath to extract the desired data to get the image link and name img_src = html.xpath('//ul[@class="clearfix"]/li/a/img/@src') # The list derivation yields the actual image URL img_src = ['http://pic.netbian.com' + x for x in img_src] img_alt = html.xpath('//ul[@class="clearfix"]/li/a/img/@alt') for src, name in zip(img_src, img_alt): img_content = requests.get(src, headers=headers).content img_name = name + '.jpg' with open(path + img_name, 'wb') as f: # print({img_name}")Copy the code

2. Page crawling to achieve batch download

Single thread version

Import requests from LXML import etree import datetime import time path = r'd: test\picture_1\ 'headers = { "User-agent ": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", "Referer": "http://pic.netbian.com/4kmeinv/index.html" } start = datetime.datetime.now() def get_img(urls): for url in urls: Response = requests. Get (url, Response. encoding = 'GBK' HTML = response.encoding = 'GBK Etree.html (response.text) # xpath to extract the desired data to get the image link and name img_src = html.xpath('//ul[@class="clearfix"]/li/a/img/@src') # The list derivation yields the actual image URL img_src = ['http://pic.netbian.com' + x for x in img_src] img_alt = html.xpath('//ul[@class="clearfix"]/li/a/img/@alt') for src, name in zip(img_src, img_alt): img_content = requests.get(src, headers=headers).content img_name = name + '.jpg' with open(path + img_name, {img_name}") f (img_content) time.sleep(1) def main(): # to the requested url list url_list = [' http://pic.netbian.com/4kmeinv/index.html '] + [f 'http://pic.netbian.com/4kmeinv/index_ {I}. HTML' For I in range(2, 11)] get_img(url_list) delta = (datetime.datetime.now() -start).total_seconds() print(f) {delta}s") if __name__ == '__main__': main()Copy the code

The program runs successfully and captures 210 pictures of 10 pages in 63.682837s.

Multi-threaded version

import requests from lxml import etree import datetime import time import random from concurrent.futures import Path = r'd :\test\picture_1\ 'user_agent = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6" "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1" "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3" Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3" "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3" "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, Def get_img(url) def get_img(url) def get_img(url) def get_img(url) def get_img(url) headers = { "User-Agent": random.choice(user_agent), "Referer": "Http://pic.netbian.com/4kmeinv/index.html"} # send request to obtain the response = requests. Get (url, Response. encoding = 'GBK' HTML = response.encoding = 'GBK Etree.html (response.text) # xpath to extract the desired data to get the image link and name img_src = html.xpath('//ul[@class="clearfix"]/li/a/img/@src') # The list derivation yields the actual image URL img_src = ['http://pic.netbian.com' + x for x in img_src] img_alt = html.xpath('//ul[@class="clearfix"]/li/a/img/@alt') for src, name in zip(img_src, img_alt): img_content = requests.get(src, headers=headers).content img_name = name + '.jpg' with open(path + img_name, {img_name}") f (img_content) time.sleep(random.randint(1, 2)) def main(): # to the requested url list url_list = [' http://pic.netbian.com/4kmeinv/index.html '] + [f 'http://pic.netbian.com/4kmeinv/index_ {I}. HTML'  for i in range(2, 51)] with ThreadPoolExecutor(max_workers=6) as executor: Executor.map (get_img, url_list) delta = (datetime.datetime.now() -start).total_seconds() print(f" print(f") {delta}s") if __name__ == '__main__': main()Copy the code

The program ran successfully, capturing 50 pages of pictures, a total of 1047 pictures, with a time of 56.71979s. Open multithreading greatly improves the efficiency of data crawling.

The final results are as follows:

Original is not easy, code word is not easy, if you think this article is useful to you, welcome to like, leave a message, forward to share so that more digg friends see, because this will be my continuous output of more high-quality articles the strongest power, thank you!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python captures 500 “beautiful” images at once.

Python captures 500 “beautiful” images at once.

Related Posts

Springboot startup process analysis source code (a)

Law of cosines for text similarity calculation

The declaration cycle of beans in Spring