Learning should start with interest. One of my primary reasons for learning Python was to write some simple game scripts that could serve my life. So I decided to try and start by crawling some wallpapers from my favorite onmyoji site

Code replication is available directly, remember PIP Install to download Requests and BS4

The final version

The system class library is introduced to open and close files
import sys
# Use document parsing libraries
from bs4 import BeautifulSoup
Use the Network request library
import requests

# Image save directory
path = 'D:/ Onmyoji ' 
# Onmyoji wallpaper site
html_doc = "https://yys.163.com/media/picture.html" 

# request
requests_html_doc = requests.get(html_doc).text
Matches all href addresses
regex = re.compile(". *? href="(.*?) 2732x2048.jpg"')
urls = regex.findall(requests_html_doc)

# set sets prevent duplicate images from being downloaded
result = set(a)for i in urls:
    result.add(i)

# counter is used for image naming
num = 0

File path, operation mode, encoding
# Open file input picture
f = open(r'result.txt'.'w', encoding='utf-8')
for a in urls:
    try:
        image_data = requests.get(a).content
        image_name = '{}.jpg'.format(num)  Give each image a name
        save_path = path + '/' + image_name  # Save address of picture
        with open(save_path, 'wb') as f:
            f.write(image_data)
            print(image_name, '=======================> Download successful!! ')
            num = num+1  # Add one to the next image
    except:
        pass
# Close file entry
f.close()
print("\r\n scan results have been written to result. TXT \r\n")
Copy the code

process

Reference code

I started from 0, have no clue, and do not have a good command of Python, so I started by borrowing other people’s code, the first reference code is as follows

Introduce the system class library
import sys
# Use document parsing libraries
from bs4 import BeautifulSoup
Use the Network request library
import urllib.request
path = 'D:/ Onmyoji '

html_doc = "https://yys.163.com/media/picture.html"
# fetch request
req = urllib.request.Request(html_doc)
# Open the page
webpage = urllib.request.urlopen(req)

Read the page content
html = webpage.read()
Parse to a document object
soup = BeautifulSoup(html, 'html.parser')  # document object

Invalid URL 1
invalidLink1 = The '#'
Invalid URL 2
invalidLink2 = 'javascript:void(0)'
# set sets prevent duplicate links for downloaded images
result = set(a)# counter is used for image naming
num = 0
Find all a tags in the document
for k in soup.find_all('a') :# print(k)
    # Find the href tag
    link = k.get('href')
    # filter not found
    if(link is not None) :# Filter illegal links
        if link == invalidLink1:
            pass
        elif link == invalidLink2:
            pass
        elif link.find("javascript:") != -1:
            pass
        else:
            result.add(link)

for a in result:
    File path, operation mode, encoding
    f = open(r'result.txt'.'w', encoding='utf-8')
    # image_data = urllib.request.get(url=a).content
    image_data = requests.get(url=a).content
    image_name = '{}.jpg'.format(num)  Give each image a name
    save_path = path + '/' + image_name  # Save address of picture
    with open(save_path, 'wb') as f:
        f.write(image_data)
        print(image_name, '=======================> Download successful!! ')
        num = num+1  # Add one to the next image
        f.close()

print("\r\n scan results have been written to result. TXT \r\n")
Copy the code

Think urllib.request and requests

Urllib. request is used for requests in the reference code. Some of the code examples we saw in the beginning also used urllib.request to initiate requests, and then we saw some code that used requests. For me personally, subjective requests were much easier, and I wrote fewer lines of code, so I looked them up to see the difference.

BeautifulSoup

After reading BeautifulSoup and seeing the praise of BeautifulSoup in the comments of some articles, I went into the documentation and looked up its usage, which changed my previous impression of python’s difficulty in writing element nodes to obtain some features in the documentation.

Beautiful Soup 4.4.0 documentation

Optimization of processing

The reason why we need to add regular match is because the problem of empty string in the image link is reported directly when downloading the image, and the whole program will hang. Besides, the invalidLink1 and invalidLink2 in this reference code look real and uncomfortable. So we added the re to keep the link valid from the source, and in executing the download code, we added a try, except to keep the program from dying if it goes wrong.

Every download in the reference code, will be to save the directory to open and close, so open and close to the outermost layer, download logic in the inside, download pictures successfully speed up as expected visible ~

conclusion

Wallpapers really don’t poke, so hopefully the next script will be more interesting