The main implementation of a simple crawler, from a Baidu post bar page to download pictures. The steps to download the image are as follows:

Get the HTML text content of the web page;

Analyze the HTML tag features of images in HTML, and use the regular to parse out the list of all images url links.

Download the image to a local folder according to the url link list of the image.




code

import requests

import re

Get the HTML content of the page based on the URL

def getHtmlContent(url):

page = requests.get(url)

return page.text

Parse all JPG urls from HTML

The url format of the JPG image in # HTML is: <img… src=”XXX.jpg” width=… >

def getJPGs(html):

# regex for JPG image URLS

JpgReg = re.compile(r'<img.+? SRC =”(.+?\.jpg)” width’

Parse out JPG url list

jpgs = re.findall(jpgReg,html)


return jpgs

Download the image from the image URL and save it to the file name

def downloadJPG(imgUrl,fileName):

The request and response modules can be automatically closed

from contextlib import closing

with closing(requests.get(imgUrl,stream = True)) as resp:

with open(fileName,’wb’) as f:

for chunk in resp.iter_content(128):

f.write(chunk)


# Batch download images, save them to the specified directory file by default

def batchDownloadJPGs(imgUrls,path = ‘C:/Users/Administrator/Desktop/picture/’):

# this is used to name images

count = 1

for url in imgUrls:

downloadJPG(url,”.join([path,'{0}.jpg’.format(count)]))

Print (” Download… Please later… {0}.jpg”.format(count))

count = count + 1


# Wrap: Download images from web pages

def download(url):

html = getHtmlContent(url)

jpgs = getJPGs(html)

batchDownloadJPGs(jpgs)


def main():

url = ‘http://tieba.baidu.com/p/2256306796’

download(url)

print(“================================”)

Print (” Download completed… Please go to the designated directory to view!!” )


if __name__ == ‘__main__’:

main()