This is the 16th day of my participation in the August Text Challenge.More challenges in August

Hello, I’m white and white I. If you like me, please give me a thumbs-up!

Recently, I was my best friend Amway’s blood anime, “The Valkyrie of the End” and “Asura Of the Fist”. I stayed up late on weekends to watch them. However, resources are not very easy to find, white and white a anger to climb the resources, this can see enough. Bestie worship again and again, think of my school grass, like the hero in the cartoon ah, Asi…

Python Crawler – VIP animation collection

Results show

Crawl target

Site Objectives:The cherry blossom anime

Tool use

Development tool: PyCharm

Development environment: python3.7, Windows10

Use toolkits: Requests, LXML, RE, TQDM

Key learning content

The use of regular TQDM uses various audio data processing

Project idea analysis

Search the animation data you need, and the method of video parsing is different according to the videos you need (two kinds of videos will be selected for parsing).

On the current page, the corresponding chapter information needs to be extracted, the jump content of the A tag of the chapter information needs to be obtained, the name of each chapter needs to be extracted, and the method of chapter extraction is the xpath method I used (you can try other methods by yourself).

Pythonheaders = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'Referer': 'http://www.imomoe.la/search.asp' } url = 'http://www.imomoe.la/view/8024.html' response = requests.get(url, headers=headers) # print(response.content.decode('gbk')) html_data = etree.HTML(response.content.decode('gbk')) chapter_list = html_data.xpath('//div[@class="movurl"]/ul/li/a/text()') chapter_url_list = html_data.xpath('//div[@class="movurl"]/ul/li/a/@href')[0]Copy the code

The URL data needs to be assembled by itself, and the data of the details page can be obtained according to the new URL

The normal first step is to check whether the play address is static

It is obvious that the data is not static data. Use the packet capture tool to obtain dynamic data.

It is not dynamic data, and media data is not known how to form.

From scratch from the front page in the parsing, looking for video page events.

No valid data was found, but the Script tag under iframe has js jump address, the resolved data url is the same domain name as the video playing address, click to check, this is the video playing address we are looking for, finally found. Start to realize in the current page through xpath extraction of script js jump address, splicing out the new video link play address, send requests, through regular expression extraction of all MP4 play address.

Pythonnew_url = 'http://www.imomoe.la' + chapter_url_list
response = requests.get(new_url, headers=headers)
html = etree.HTML(response.content.decode('gbk'))
​
data_url = 'http://www.imomoe.la' + html.xpath('//div[@class="player"]/script[1]/@src')[0]
res = requests.get(data_url, headers=headers).text
# print(res)
play_url_list = re.findall('\$(.*?)\$flv', res)
print(play_url_list)

Copy the code

Save the video data send request, save the data to MP4, TQDM tool can view the corresponding download speed and download progress

Pythonfor chapter, play_url in tqdm(zip(chapter_list, play_url_list)): F = open(' + chapter + '.mp4', "wb") f = open(result) f = open(' + chapter + '.mp4', "wb") f = open(result)Copy the code

But when I changed the url to fight the Sky, it returned empty data

The data loaded in this video is in the format of M3U8, and the data loaded in other audio files may also be different. Processing m3U8 data is slightly complicated to throw, and its M3U8 file has nested M3U8 link address, so it needs to convert the corresponding data interface to join the link address. Take out the TS file for download and splice into a video.

Pythonm3u8_url_list = re.findall('\$(.*?) \$bdhd', res) for m3u8_url, chapter in zip(m3u8_url_list, chapter_list): data = requests.get(m3u8_url, headers=headers) # print(data.text) new_m3u8_url = 'https://cdn.605-zy.com/' + re.findall('/(.*?m3u8)', data.text)[0] # print(new_m3u8_url) ts_data = requests.get(new_m3u8_url, Headers =headers) ts_url_list = re.findall('/(.*? Ts)', ts_data. ", chapter) for ts_url in tqdm(ts_url_list): Result = requests. Get ('https://cdn.605-zy.com/' + ts_url). Content f = open(' /' + chapter + '.mp4', "ab") f.write(result)Copy the code

Summary of Project ideas

Get the address of the animation you want

The name of the extract details page has jumped to the address

Gets the static JS file for the page

Parse the video playing address or m3U8 file

Save corresponding data

Easy source sharing

Pythonimport requests from lxml import etree import re from tqdm import tqdm headers = { 'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'Referer': 'http://www.imomoe.la/search.asp' } url = 'http://www.imomoe.la/view/8024.html' response = requests.get(url, headers=headers) # print(response.content.decode('gbk')) html_data = etree.HTML(response.content.decode('gbk')) chapter_list = html_data.xpath('//div[@class="movurl"]/ul/li/a/text()') chapter_url_list = html_data.xpath('//div[@class="movurl"]/ul/li/a/@href')[0] # print(chapter_list) # print(chapter_url_list) new_url = 'http://www.imomoe.la' + chapter_url_list response = requests.get(new_url, headers=headers) html = etree.HTML(response.content.decode('gbk')) data_url = 'http://www.imomoe.la' + html.xpath('//div[@class="player"]/script[1]/@src')[0] res = requests.get(data_url, headers=headers).text # print(res) play_url_list = re.findall('\$(.*?)\$flv', res) print(play_url_list) for chapter, play_url in tqdm(zip(chapter_list, play_url_list)): F = open(' + chapter + '.mp4', "wb") f = open(result) f = open(' + chapter + '.mp4', "wb") f = open(result)Copy the code

If you find that you can’t or learn Python, you can directly comment on it or leave a comment on it.