【Python crawler project actual practice 】 click to download short video

This is the 12th day of my participation in the August More Text Challenge. For details, see:August is more challenging

Now short video can be said to be a ride the dust, when eating, when resting, lying in bed in the brush short video, today to bring you Python crawler advanced: Beauty shot video address encryption parsing.

Short video JS reverse analysis

Grab the target

Target url: Meipai Video

Tool use

Development environment: Win10, Python3.7 development tools: PyCharm, Chrome kit: Requests, xpath, Base64

The parsing process of crawler collecting data js code debugging skills JS reverse parsing code Python code transformation project idea parsing

Go to the home page of the site and select the category you are interested in. According to the home page address, you will get the jump address of the hyperlink to the details page

Find the corresponding encrypted video playback address data

This data is static web data, decoded by JS code

Find the corresponding parsing code

Find the address where the video will play

Find the encrypted JS file that parses the video address

Clicking Play triggers the file

You can see that this is base64 encrypted data

Search the corresponding JS file for the keyword

Find the encryption mode of JS

Some functions of the js function usage

The # eplace() method is used to replace some characters in a string with others. # parseInt Converts data to the corresponding integer type. # Base64. atob decodes a base64-encoded string A specified number of characters at the beginning of the subscriptCopy the code

Convert js code to Python code

import base64 def decode(data): def getHex(a): return { 'str': a[4:], 'hex': ''.join(list(a[:4])[::-1]), } def getDec(a): b = str(int(a, 16)) return { 'pre': list(b[:2]), 'tail': list(b[2:]), } def substr(a, b): c = a[0: int(b[0])] d = a[int(b[0]): int(b[0]) + int(b[1])] return c + a[int(b[0]):].replace(d, "") def getPos(a, b): b[0] = len(a) - int(b[0]) - int(b[1]) return b b = getHex(data) c = getDec(b['hex']) d = substr(b['str'], c['pre']) return base64.b64decode(substr(d, getPos(d, c['tail']))) print(decode("e121Ly9tBrI84RdnZpZGVvMTAubWVpdHVkYXRhLmNvbS82MGJjZDcwNTE3NGZieXBueG5udnRwMTA5N19IMjY0XzFfNWY3YThmM2U0MTEw Ny5tc2JVjAu3EDQ="))Copy the code

Get the final video playback address

Simple source sharing

import requests from lxml import etree import base64 def decode_mp4(data): def getHex(a): return { 'str': a[4:], 'hex': ''.join(list(a[:4])[::-1]), } def getDec(a): b = str(int(a, 16)) return { 'pre': list(b[:2]), 'tail': list(b[2:]), } def substr(a, b): c = a[0: int(b[0])] d = a[int(b[0]): int(b[0]) + int(b[1])] return c + a[int(b[0]):].replace(d, "") def getPos(a, b): b[0] = len(a) - int(b[0]) - int(b[1]) return b b = getHex(data) c = getDec(b['hex']) d = substr(b['str'], C ['pre']) return base64.b64decode(substr(d, getPos(d, c['tail']))) Url = 'https://www.meipai.com' headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',} response = requests. Get (url=url, headers=headers) html_data = etree.HTML(response.text) href_list = html_data.xpath('//div/a/@href') # print(href_list) for href in href_list: res = requests.get('https://www.meipai.com' + href, headers=headers) html = etree.HTML(res.text) name = html.xpath('//div[@id="detailVideo"]/img/@alt')[0] mp4_data = html.xpath('//div[@id="detailVideo"]/@data-video')[0] # print(name, mp4_data) mp4_url = decode_mp4(mp4_data).decode('utf-8') print(mp4_url) result = requests.get("http:" + mp4_url) with open(name + ".mp4", 'wb') as f: f.write(result.content) f.close() if __name__ == '__main__': main()Copy the code

I am white and white I, a love to share knowledge of the program yuan ❤️

If you have no contact with the programming section of the friends see this blog, find do not understand or want to learn Python, you can directly leave a message + private I ducky [thank you very much for your likes, collection, attention, comments, one button four connect support]

【Python crawler project actual practice 】 click to download short video

Short video JS reverse analysis

Tool use

Related Posts

How to delete a file or directory in Linux

Go language concurrent programming 03 – concurrent memory model

Take you hand in hand to station B a small number of bullets to crawl and generate word clouds