preface

Use Python to crawl resources such as images, audio, and video from any web page. The commonly used approach is to put the HTML request of the web page down through XPath or regular to get the resources they want, here I made a crawler software, can be a key to climb the resource media file; But it needs to be noted that, here to crawl resource files only for HTML existing files, if the need for a second request is to crawl, such as cool dog music playback interface, because to do a general tool, match different websites!! 😀 😀 😀

Here the main push picture climb, some need picture material can be input url one key climb!

There is also the crawling video will be the magnetic link crawling down! You can download it using third-party download tools! 🤗

code

Crawl resource files

Here needs to explain only, some picture resources are not URL links, is data:image format, here needs to convert storage!

def getResourceUrlList(url ,isImage, isAudio, isVideo):
	global imgType_list, audioType_list, videoType_list
	imageUrlList = []
	audioUrlList = []
	videoUrlList = []
 
	url = url.rstrip().rstrip('/')
	htmlStr = str(requestsDataBase(url))
	# print(htmlStr)
	
	Wopen = open('reptileHtml.txt'.'w')
	Wopen.write(htmlStr)
	Wopen.close()
 
	Ropen = open('reptileHtml.txt'.'r')
	imageUrlList = []
 
	for line in Ropen:
		line = line.replace("'".'"')
		segmenterStr = '"'
		if "'" in line:
			segmenterStr = "'"
 
		lineList = line.split(segmenterStr)
		for partLine in lineList:
			if isImage == True:
				# search for images
				if 'data:image' in partLine:
					base64List = partLine.split('base64,')
					imgData = base64.urlsafe_b64decode(base64List[-1] + '=' * (4 - len(base64List[-1]) % 4))
					base64ImgType = base64List[0].split('/')[-1].rstrip('; ')
					imageName = zfjTools.getTimestamp() + '. ' + base64ImgType
					imageUrlList.append(imageName + '$= = $' + base64ImgType)
 
				# search for images
				for imageType in imgType_list:
					if imageType in partLine:
						imgUrl = partLine[:partLine.find(imageType) + len(imageType)].split(segmenterStr)[-1]
 
						# to repair the URL
						imgUrl = repairUrl(imgUrl, url)
 
						sizeType = '_ {' + 'size' + '} '
						if sizeType in imgUrl:
							imgUrl = imgUrl.replace(sizeType, ' ')
 
						imgUrl = imgUrl.strip()
 
						if imgUrl.startswith('http://') or imgUrl.startswith('https://') and imgUrl not in imageUrlList:
							imageUrlList.append(imgUrl)
						else:
							imgUrl = ' '
 
			if isAudio == True:
				# Find audio
				for audioType in audioType_list:
					if audioType in partLine or audioType.lower() in partLine:
						audioType = audioType.lower() if audioType.lower() in partLine else audioType
						audioUrl = partLine[:partLine.find(audioType) + len(audioType)].split(segmenterStr)[-1]
 
						# to repair the URL
						audioUrl = repairUrl(audioUrl, url)
 
						if audioUrl.startswith('http://') or audioUrl.startswith('https://') and audioUrl not in audioUrlList:
							audioUrlList.append(audioUrl)
						else:
							audioUrl = ' '
 
			if isVideo == True:
				# Find video
				for videoType in videoType_list:
					if videoType in partLine or videoType.lower() in partLine:
						videoType = videoType.lower() if videoType.lower() in partLine else videoType
						videoUrl = partLine[:partLine.find(videoType) + len(videoType)].split(segmenterStr)[-1]
 
						# to repair the URL
						videoUrl = repairUrl(videoUrl, url)
 
						if videoUrl.startswith('http://') or videoUrl.startswith('https://') or videoUrl.startswith('ed2k://') or videoUrl.startswith('magnet:? ') or videoUrl.startswith('ftp://') and videoUrl not in videoUrlList:
							videoUrlList.append(videoUrl)
						else:
							videoUrl = ' '
 
	return (imageUrlList, audioUrlList, videoUrlList)
Copy the code

Climb from the definition node

# configure node crawl
def getNoteInfors(url, fatherNode, childNode):
	url = url.rstrip().rstrip('/')
	htmlStr = requestsDataBase(url)
	
	Wopen = open('reptileHtml.txt'.'w')
	Wopen.write(htmlStr)
	Wopen.close()

	html_etree = etree.HTML(htmlStr)

	dataArray = []

	ifhtml_etree ! = None: nodes_list = html_etree.xpath(fatherNode)for k_value in nodes_list:
			partValue = k_value.xpath(childNode)
			if len(partValue) > 0:
				dataArray.append(partValue[0])

	return dataArray
Copy the code

software

Download the software from gitee.com/zfj1128/ZFJ…

Use instructional videos

Resources crawl: Link :pan.baidu.com/s/1xa9ruF_h… ZPG password: 1

Node crawl: link: pan.baidu.com/s/1ebWWYtjo… Password: cosa

Use screenshots as follows:

conclusion

Welcome your valuable opinions and suggestions!!!! 🤗 🤗 🤗