I prefer to collect wallpaper, found the other side of the desktop wallpaper under the beautiful classification of wallpaper, I like; So I wrote a crawler, and then I found that the structure of the entire site was basically the same, so I added some code, and I crawled the high-resolution wallpaper of the entire site

Table 1: Overview

On your computer, create a folder to store your photos from the other side of the desktop

There are 25 folders under this folder, corresponding to the classification

Directory 2: Preparing the environment

  • Environment preparation – recommended article: how to write Python code with VSCode? (Win10 X64 OS)

You will also need to use three third-party packages (see the official documentation if you are interested)

  • Requests: HTTP requests for pages, official documents
  • LXML: is a python parsing library that supports HTML and XML parsing. It supports XPath parsing and is efficient in parsing
  • Beautiful Soup4: Can extract data from HTML or XML files, official documents

Install them by entering the following PIP commands in the terminal

python -m pip install beautifulsoup4
python -m pip install lxml
python -m pip install requests
Copy the code

Directory 3: Analyze the page structure

  • Since my computer has a 1920 × 1080 resolution, the image I crawled was at that resolution
  • Other shore desktop wallpaper provides many categories for us to browse: calendar, animation, scenery, beauty, games, film and television, dynamic, aesthetic, design…
    • 4kWallpapers under the category are an important source of revenue for the site, and I don’t have a need for 4K wallpapers, so I don’t crawl them
    • CSS selector: #header > div.head > ul > Li :nth-child(1) > div > a, locate the a tag of the package classification

I take the beautiful classification of wallpaper, to explain how to crawl the next picture

  1. There are 73 pages in total, with 18 images on each page except the last one

    • But in the code we need to automatically get the total page number, well, the desktop wallpaper site structure is really comfortable, basically every page number of THE HTML structure is similar

    • CSS selector: div.page A, locate the a tag to the package page number, only 6

  2. And the third image on each page is the same AD, which needs to be filtered out in the code

  3. It is also clear that each page links to www.netbian.com/weimei/inde… X. HTMX is exactly the page number of the page

  4. Note: the pictures seen under the classification are slightly scaled, with low resolution; To get the graph at 1920 × 1080 resolution, two jumps were required

    • The following picture shows an example

In the category page we can get the URL of this image directly, but unfortunately, its resolution is not satisfactory; Upon inspection, it is obvious that each image displayed in the category page points to another hyperlink

CSS selector: div#main div.list ul li A
First jump

CSS selector: div#main div.endPage div.pic div.pic-down a

Click on the Download Wallpaper (1920 × 1080) button and jump a second time to a new link that displays images at 1920 × 1080 resolution

After many twists and turns, I finally found a 1920 × 1080 hd image of the picture

CSS selector: div#main table a img, locate the img tag of the image

Through my own crawling test, there are very few pictures failed to download due to many fragmentary problems, and there are a few pictures because the website provides a 1920 × 1080 resolution download button but other resolutions

Directory 4: Code analysis

  • Please modify the red bold content in the following according to my explanation and according to your own situation

Step 1: Set up global variables

index = 'http://www.netbian.com' # site root address
interval = 10 # Time between images to crawl
firstDir = 'D:/zgh/Pictures/netbian' # route
classificationDict = {} # Store the information of the sub-page of the website category
Copy the code
  • Index, to crawl the website root address of the web page, the code to crawl the image needs to use its splicing complete URL
  • Interval, we should take into account the bearing capacity of the website server when we climb the content of a website. It will cause great pressure on the website server to climb a large amount of content in a short time. We need to set the interval unit when we climb: Seconds because I want to climb the shore desktop website all hd pictures, if concentrated in a short period of time to climb, on the one hand will give the website server huge pressure, on the one hand the website server will be forced to break our link, so I set the time interval for each picture to climb 10 seconds; If you’re just crawling a few images, you can set the interval to a shorter point
  • FirstDir, climb to the root path of the image stored on your computer; Code to climb the picture, in the first level of directory will be in accordance with the other side of the desktop aesthetic classification under the paging page code generated folder and stored pictures
  • ClassificationDict, which stores the URL pointing to the classification of the website and the corresponding classification folder path

Step 2: Get the filtered content list of the page

Write a function to get the filtered content array of the page

  • Pass in two parameters url: the URL of the page select: selector (works seamlessly with the selector in CSS, which I love, to locate the appropriate element in the HTML)
  • Return a list
def screen(url, select):
    html = requests.get(url = url, headers = UserAgent.get_headers()) Get a headers at random
    html.encoding = 'gbk'
    html = html.text
    soup = BeautifulSoup(html, 'lxml')
    return soup.select(select)
Copy the code
  • Headers, which pretends to be a user visiting the site, randomly selects a Headers for each page crawler to ensure the success rate of the crawler
  • Encoding, the encoding of the site

Step 3: Get urls for all categories

# Store subpage information in a dictionary
def init_classification(a):
    url = index
    select = '#header > div.head > ul > li:nth-child(1) > div > a'
    classifications = screen(url, select)
    for c in classifications:
        href = c.get('href') # retrieve relative address
        text = c.string Get the category name
        if(text == '4 k wallpaper') :# 4K wallpaper, can not be climbed because of permissions, directly skip
            continue
        secondDir = firstDir + '/' + text # Category directory
        url = index + href # category subpage URL
        global classificationDict
        classificationDict[text] = {
            'path': secondDir,
            'url': url
        }
Copy the code

The next code, I to the beautiful classification of wallpaper, to explain how to jump two links to climb hd pictures

Step 4: Get the URL of all pages under the category page

Select div. Page a, and the sixth element in the list returned by screen gets the last page number we need

However, some categories have less than 6 pages, such as:

You need to rewrite a filter function to get it from a sibling element

Get the page number
def screenPage(url, select):
    html = requests.get(url = url, headers = UserAgent.get_headers())
    html.encoding = 'gbk'
    html = html.text
    soup = BeautifulSoup(html, 'lxml')
    return soup.select(select)[0].next_sibling.text
Copy the code

Gets the URL of all the pages under the category page

url = 'http://www.netbian.com/weimei/'
select = '#main > div.page > span.slh'
pageIndex = screenPage(secondUrl, select)
lastPagenum = int(pageIndex) Get the last page number
for i in range(lastPagenum):
    if i == 0:
        url = 'http://www.netbian.com/weimei/index.htm'
    else:
        url = 'http://www.netbian.com/weimei/index_%d.htm' %(i+1)
Copy the code

Because the HTML structure of the site is very clear, the code is straightforward to write

Step 5: Get the url of the image under the page

The obtained URL is a relative address, which needs to be converted to an absolute address

select = 'div#main div.list ul li a'
imgUrls = screen(url, select)
Copy the code

The values in the list obtained from these two lines look something like this:

<a href="/desk/21237.htm" target="_blank" title="Starry Night Girls look at beautiful wallpaper update: 2019-12-06"><img alt="Starry Girl looking at beautiful Night wallpaper" src="http://img.netbian.com/file/newc/e4f018f89fe9f825753866abafee383f.jpg"/><b>Starry girl looks at beautiful night wallpaper</b></a>
Copy the code
  • The obtained list needs to be processed
  • Get the value of the href attribute in the A tag and convert it to the absolute address, which is the URL needed for the first jump

Step 6: Locate the 1920 × 1080 resolution image

# Position to 1920 1080 resolution image
def handleImgs(links, path):
    for link in links:
        href = link.get('href')
        if(href == 'http://pic.netbian.com/') :# Filter image ads
            continue

        # First jump
        if('http://' in href): There are very few images that do not provide correct relative addresses
            url = href
        else:
            url = index + href
        select = 'div#main div.endpage div.pic div.pic-down a'
        link = screen(url, select)
        if(link == []):
            print(url + 'No such image, climb failed')
            continue
        href = link[0].get('href')

        # Second jump
        url = index + href

        # get the image
        select = 'div#main table a img'
        link = screen(url, select)
        if(link == []):
            print(url + "This image requires login to climb, climb failed")
            continue
        name = link[0].get('alt').replace('\t'.' ').replace('|'.' ').replace(':'.' ').replace('\ \'.' ').replace('/'.' ').replace(The '*'.' ').replace('? '.' ').replace('"'.' ').replace('<'.' ').replace('>'.' ')
        print(name) Print the file name of the downloaded image
        src = link[0].get('src')
        if(requests.get(src).status_code == 404):
            print(url + 'This image download link 404, failed to climb')
            print()
            continue
        print()
        download(src, name, path)
        time.sleep(interval)
Copy the code

Step 7: Download the image

# Download operation
def download(src, name, path):
    if(isinstance(src, str)):
        response = requests.get(src)
        path = path + '/' + name + '.jpg'
        while(os.path.exists(path)): # if the file name is the same
            path = path.split(".") [0] + str(random.randint(2.17)) + '. ' + path.split(".") [1]
        with open(path,'wb') as pic:
            for chunk in response.iter_content(128):
                pic.write(chunk)
Copy the code

Item 5: Fault tolerance of code

One: filter picture ads

if(href == 'http://pic.netbian.com/') :# Filter image ads
    continue
Copy the code

Two: the first jump page, there is no link we need

The other shore wallpaper website, the first jump page link, to the relative address

But very few pictures directly to the absolute address, and to the classification website, so need to do two steps

if('http://' in href):
    url = href
else:
    url = index + href

...

if(link == []):
    print(url + 'No such image, climb failed')
    continue
Copy the code

Here is the problem with the second jump page

Three: unable to crawl pictures due to permission problems

if(link == []):
    print(url + "This image requires login to climb, climb failed")
    continue
Copy the code

4: Obtain img Alt, as the file name of the downloaded image file, the name carries \ T or the file name does not allow special characters:

  • In Python, ‘\t’ is the escape character: space
  • Name of a file in Windows. The file name cannot contain\ / : *? “< > |There are nine special characters
name = link[0].get('alt').replace('\t'.' ').replace('|'.' ').replace(':'.' ').replace('\ \'.' ').replace('/'.' ').replace(The '*'.' ').replace('? '.' ').replace('"'.' ').replace('<'.' ').replace('>'.' ')
Copy the code

Five: obtain img Alt, as the file name of the download picture file, the name is repeated

path = path + '/' + name + '.jpg'
while(os.path.exists(path)): # if the file name is the same
    path = path.split(".") [0] + str(random.randint(2.17)) + '. ' + path.split(".") [1]
Copy the code

Six: image link 404

Such as:

if(requests.get(src).status_code == 404):
    print(url + 'This image download link 404, failed to climb')
    print()
    continue
Copy the code

Directory six: Complete code

  • Blue Cloud Link:Python crawlers, hd images I want all (desktop wallpaper).zipWhen you download it and unzip it, you have two Python files