Crawler for learning only, the rest are not responsible for, if you need to reprint please ask me private letter!!

preface

The source of this article is available at GitHub github.com/2335119327/… Has been included (more content of this blog not crawler, interested partners can see), will continue to update, welcome Star.

Today this crawler is a very simple crawler, as long as there is a little bit of foundation can understand, come on, Ori, dry is done!!

Web analytics

Multi-page crawl URL section

Enter the home page a look to know is boutique

Slide to the bottom, dude, page 162, (● strap strap ●), enough for me to play!

Ok, without further ado, want to crawl, first to understand the URL

This is the URL for the first page

The second page

The third page

This rule does not need to be explained, according to the current page number change the value of P OK, but some friends may say: the first time not p=1 ah?

As you can see, if we manually access the first page with p=1, we can also access it successfully

Small friends to mark oh!

Image download URL section

In this article, Beautiful Soup is used for data parsing. If you have no idea about it, you can see my good article!

Beautiful Soup for Python crawlers

Open the console

You can see that a picture corresponds to a div whose class value is item

The title

The title is in the H3 tag of the description div tag under the class tag of item

Download URL

def getUrl(curPage,data,page_path) :
	# BeautifulSoup parse
    data = BeautifulSoup(data,"html.parser")
    div_list = data.find_all(class_="item")
    for div in div_list:
    	# splicing URL
        img_url = "https://bing.ioliu.cn" + div.find(class_="ctrl download") ["href"]
        # get the title
        title = div.find(class_="description").find("h3").text
        # Because save image title, so the title of the special character processing
        title = replaceTitle(title)
        downLoadImg(curPage,title,img_url,page_path)
Copy the code

Images are downloaded

def downLoadImg(curPage,title,img_url,page_path) :
    print("Climbing to number one." + str(curPage) + Page: "" + title)
    #.content Binary byte stream
    img_res = requests.get(url=img_url,headers=headers).content
    # Save as JPG image, can also be PNG oh!
    with open(page_path + "/" + title + ".jpg"."wb") as f:
        f.write(img_res)
    f.close()
Copy the code

The complete code

import requests
from bs4 import BeautifulSoup
import os



headers = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/ 537.36EDG /91.0.864.59"
}

path = "./ Bing Gallery"


symbol_list = ["\ \"."/"."<".":"."*"."?"."<".">"."|"."\" "]

def replaceTitle(title) :
    title = "".join(str(title).split())
    for i in symbol_list:
        if title.find(str(i)) ! = -1:
            title = title.replace(str(i),"")

    return title

def getUrl(curPage,data,page_path) :
    data = BeautifulSoup(data,"html.parser")
    div_list = data.find_all(class_="item")
    for div in div_list:
        img_url = "https://bing.ioliu.cn" + div.find(class_="ctrl download") ["href"]
        title = div.find(class_="description").find("h3").text
        title = replaceTitle(title)
        downLoadImg(curPage,title,img_url,page_path)

def downLoadImg(curPage,title,img_url,page_path) :
    print("Climbing to number one." + str(curPage) + Page: "" + title)
    img_res = requests.get(url=img_url,headers=headers).content
    with open(page_path + "/" + title + ".jpg"."wb") as f:
        f.write(img_res)
    f.close()





if __name__ == '__main__':
    if not os.path.exists(path):
        os.mkdir(path)

    url = "https://bing.ioliu.cn"
    for i in range(1.3):
        page_path = path + The first "/" + str(i) + "Page Gallery"
        if not os.path.exists(page_path):
            os.mkdir(page_path)
        if i > 1:
            url = url + "/? p=" + str(i)
        response = requests.get(url=url, headers=headers)
        getUrl(i,response.text,page_path)
Copy the code

Crawl results (high definition large picture, watching is enjoyable)

Because of the test, so I’m afraid I only climbed two pages

== all 1920×1080 oh! Feel good small partner can give three even, thanks for support 😁==

The last

I am aCode pipi shrimp, a prawns lover who loves to share knowledge, will update useful blog posts in the future, looking forward to your attention!!

Creation is not easy, if this blog is helpful to you, I hope you can == one key three even oh! Thank you for your support. See you next time

Share the outline

Big factory interview question column

Python Crawler Column

The source of this article is available at GitHub github.com/2335119327/… Has been included (more content of this blog not crawler, interested partners can see), will continue to update, welcome Star.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

1920×1080 Bing hd Wallpaper crawl, just for the crawlers to get started! | Python Theme month

preface

Web analytics

Multi-page crawl URL section

Image download URL section

Images are downloaded

The complete code

Crawl results (high definition large picture, watching is enjoyable)

Share the outline

1920×1080 Bing hd Wallpaper crawl, just for the crawlers to get started! | Python Theme month

preface

Web analytics

Multi-page crawl URL section

Image download URL section

Images are downloaded

The complete code

Crawl results (high definition large picture, watching is enjoyable)

Share the outline

Related Posts

Priority queue and binary heap details

A web-based RSS reader run entirely by your GitHub repo

Docker-based log analysis platform (5) monitoring and alarm