Single short video

Get information about the video

In order to be able to facilitate the analysis and illustration, will certainly take an example to just good wow:

AcFun Family Party – Chengdu Station (Today is another LSP day ~~)

Open the link directly in the browser and capture the packet. It can be found that under XHR data, the first request (or a certain request) loads the real request link of the video:

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/99f97c7aa88e49c0a7c44358514170e4)

Although it is only an M3U8 file, we still have a way to deal with it. Before we do this, we must first find out where the file is sent from, or where we can find the link.

After searching through the XHR data in vain, I decided to take a look at the web source code: When I searched the web source code using the m3U8 file’s request link, I found the link in the source code:

! [](https://p3-tt-ipv6.byteimg.com/large/pgc-image/62ff6d9c1265454f9af4cbd46f9695ce)

Because in the source code is stored in JSON data:

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/abbca9a7808a484dbb079e9b74065ea3)

Therefore, we need to format the data to facilitate data extraction:

! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/6be5795581da4f07bc856aa5f9026e71)

After formatting the data, I found that the value of one of the fields was also in JSON data format, so I formatted the second layer of JSON data and saw the following information:

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/75a62f6d9d744490a031cd6974875587)

For not logged in when the state, even if the web page can not play directly, but the “background” has already given us ready to play the link (B station is loading the current account or not logged in to watch the maximum definition), so we can not log in the case of white whao ultra clear resources ~~

class m3u8_url(): def __init__(self, f_url): self.url = f_url def get_m3u8(self): global flag, qua, rel_path html = requests.get(self.url, headers=headers).text first_json = json.loads(re.findall('window.pageInfo = window.*? = (. *?) }; ', html)[0] + '}', strict=False) name = first_json['title'].strip().replace("|",'') video_info = json.loads(first_json['currentVideoInfo']['ksPlayJson'], strict=False)['adaptationSet'][0]['reprCopy the code

In order to be able to select the definition later, I also carried out the definition crawl:

For quality in video_info: # definition num += 1 Label[num] = quality['qualityLabel'] print(Label) choice = int(input(" "))Copy the code

Download the video from the m3U8 file address

Now that we have the address of the m3U8 file for the video, let’s begin to solve a small problem left over from the beginning: how to download the video from the M3U8 file?

First, we get an M3U8 file as an example: for convenience, HERE I manually write an M3U8 file as an example.

! [](https://p26-tt.byteimg.com/large/pgc-image/878d39cd4dc749baa3fee92cdffcc4ae)

As we know, all the video links in the m3u8 file are in the section format of.ts, so we must first try to take out all the.ts links and add prefixes to assemble the real and complete links of the video :(assume that the original prefix of the video is www.acfun.cn/)

Def get_ts_urls(): with open('123.m3u8',"r") as file: lines = file.readlines() for line in lines: if '.ts' in line: print("https://www.acfun.cn/"+line)Copy the code

Through the above method, we can obtain the link of each video through the M3U8 file. Next, we will improve the download function:

The basic idea is the same as in my previous article: Python crawlers: crawl TS files in the most common way and merge them into MP4 format

Def __init__(self, name, m3U8_URL, path): ": url = [] # def __init__(self, name, m3U8_URL, path):" :param name: Video name :param m3U8_URL: m3U8 file address: Param path: Self.video_name = name self.path = path self.f_url = STR (m3U8_URL).split(' HLS /')[0] + 'HLS /' with open(self.path + '/{}.m3u8'.format(self.video_name), 'wb')as f: f.write(requests.get(m3u8_url, headers={'user-agent': 'the Chrome / 84.0.4147.135}). The content) def get_ts_urls (self) : with open(self.path + '/{}.m3u8'.format(self.video_name), "r") as file: lines = file.readlines() for line in lines: if '.ts' in line: self.urls.append(self.f_url + line.replace('\n', '')) def start_download(self): Self.get_ts_urls () for url in TQDM (self.urls, desc=" downloading {} ".format(self.video_name)): movie = requests.get(url, headers={'user-agent': 'Chrome/84.0.4147.135'}) with open(self.path + '/{}.flv'. Format (self.video_name), 'ab')as f: f.write(movie.content) os.remove(self.path + '/{}.m3u8'.format(self.video_name))Copy the code

Code comments:

1. In order to get only the video at last, the M3U8 file of the current video was automatically deleted after the video was downloaded. 2, line.replace(‘\n’, ‘”) : the read m3u8 file contains a “\n” at the end of each line.

Source code and effect

Finally, we are now ready to put the code together and run. Take a look:

import os import re import json import requests from tqdm import tqdm path = './' headers = { 'referer': 'https://www.acfun.cn/', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83'} class m3U8_url (): def __init__(self, f_url): self.url = f_url def get_m3u8(self): global flag, qua, rel_path html = requests.get(self.url, headers=headers).text first_json = json.loads(re.findall('window.pageInfo = window.videoInfo = (.*?) }; ', html)[0] + '}', strict=False) name = first_json['title'].strip().replace("|",'') video_info = json.loads(first_json['currentVideoInfo']['ksPlayJson'], strict=False)['adaptationSet'][0]['representation'] Label = {} num = 0 for quality in video_info: [num] = quality['qualityLabel'] print(Label) choice = int(input(" ")) Download(name + '[{}]'.format(Label[choice]), video_info[choice - 1]['url'], path).start_download() class Download(): urls = [] def __init__(self, name, m3u8_url, path): ''' :param name: Video name :param m3U8_URL: m3U8 file address: Param path: Self.video_name = name self.path = path self.f_url = STR (m3U8_URL).split(' HLS /')[0] + 'HLS /' with open(self.path + '/{}.m3u8'.format(self.video_name), 'wb')as f: f.write(requests.get(m3u8_url, headers={'user-agent': 'the Chrome / 84.0.4147.135}). The content) def get_ts_urls (self) : with open(self.path + '/{}.m3u8'.format(self.video_name), "r") as file: lines = file.readlines() for line in lines: if '.ts' in line: self.urls.append(self.f_url + line.replace('\n', '')) def start_download(self): Self.get_ts_urls () for url in TQDM (self.urls, desc=" downloading {} ".format(self.video_name)): movie = requests.get(url, headers={'user-agent': 'Chrome/84.0.4147.135'}) with open(self.path + '/{}.flv'. Format (self.video_name), 'ab')as f: O.remove (self.path + '/{}.m3u8'. Format (self.video_name)) url1 = input(" input address: ") m3u8_url(url1).get_m3u8()Copy the code

Effect:

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/a8aee742ebef4999849344d57d142662)

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/3aef6921b36d4b31be0bbbbd625cf0d2)

Oh hoo, take off

His drama series

Get information about the video

Since we want to start with a drama, it must take an example to illustrate it:

Renting girlfriend (LSP again ~~)

For this drama, we directly from a single video parsing method to gain experience —–> directly from the web source:

! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/cb221e4ddd9e4186a0d1f8748fa5db00)

Sure enough, JSON data similar to a single video was also found in the source code, which we continued to format:

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/af11af87c9724a66b27c2e347832c864)

The resulting video is stored in exactly the same way and fields as a single video. To reduce the amount of code we end up with, we can fit both methods into a single class:

Class m3U8_url (): def __init__(self, f_url, name=""): "" Self. url = f_url self.name = name def get_m3u8(self): self.url = f_url self.name = name def get_m3u8(self): global flag, qua, rel_path html = requests.get(self.url, headers=headers).text first_json = json.loads(re.findall('window.pageInfo = window.*? = (.*?)}; ', html)[0] + '}', strict=False) if self.name == '': name = first_json['title'].strip().replace("|",'') else: name = self.name rel_path = path + first_json['bangumiTitle'].strip() if os.path.exists(rel_path): pass else: os.makedirs(rel_path) video_info = json.loads(first_json['currentVideoInfo']['ksPlayJson'], strict=False)['adaptationSet'][0]['representation'] Label = {} num = 0 for quality in video_info: [num] = quality['qualityLabel'] if flag: print(Label) choice = int(input(" ")) flag = False qua = choice Download(name + '[{}]'.format(Label[choice]), video_info[choice - 1]['url'], path).start_download() else: Download(name + '[{}]'.format(Label[qua]), video_info[qua - 1]['url'], rel_path).start_download()Copy the code

Code comments:

Flag: Used to determine whether clarity has been selected for download. Qua: Saves the clarity of the selection. Rel_path: Change the play download location (under the play name folder). First_json = json.loads(re.findAll (‘ window.pageinfo = window.

? = (.

?). }; ‘, HTML)[0] + ‘} ‘, strict=False) : Changes the matching regular expression of video information, which can be used to match a single video and a drama video at the same time.

Knowing how to download an episode, it’s impossible to manually enter links for every episode!! It’s ok to have a series with only a few episodes, if it’s like this:

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/4fb3b931242b4481bff01a21ab82f6e9)

Will you…?

Links to the series

Again, let’s start with the web source code:

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/247d3ae4fbee47e2ab5e50ce3ee267f5)

Although we can find all the information about the show in the source code, not all of it is what we need first. We need to see what information we need first: When I click on the second episode, the address in the browser’s address bar changes:

www.acfun.cn/bangumi/aa6…

We can easily find:

www.acfun.cn/bangumi/aa6… : Home page link of Panju. 36188: A string of numbers not known to be useful, but which I found to be of no use, all fixed:

Take a few examples: rent girlfriend: second words ex-girlfriend and girlfriend: www.acfun.cn/bangumi/aa6… Rent girlfriend: the third words sea and girlfriend: www.acfun.cn/bangumi/aa6… Jinhun Street: second words: www.acfun.cn/bangumi/aa5… … The same point back to the first set can also see the first set of links can also be written: Ghost Street: the first words: www.acfun.cn/bangumi/aa5… Rent girlfriend: the first words rent girlfriend: www.acfun.cn/bangumi/aa6… …

1740687: The ID of each episode, saved as an itemId field in the source code.

So, we can write the code to get the link to each episode:

class Pan_drama(): def __init__(self, f_url): ''' :param f_url: Self. aa = len(STR (f_url).split('/')[-1]) if self.aa == 7: self.url = f_url elif self.aa > 7: self.url = str(f_url).split('_')[0] def get_info(self): video_info = {} html = requests.get(self.url, headers=headers).text all_item = json.loads(re.findall('window.bangumiList = (.*?) ; ', HTML)[0])['items'] for item in TQDM (all_item, desc=" preparing a play "): video_info[item['episodeName'] + '-' + item['title']] = self.url + '_36188_' + str(item['itemId']) for name in video_info.keys(): m3u8_url(video_info[name],name).get_m3u8()Copy the code

Code comments:

Self.aa: For adaptability, simply pass in a link to an episode, but download the entire series.

Source code and effect

Full source code:

import os import re import json import requests from tqdm import tqdm path = './' headers = { 'referer': 'https://www.acfun.cn/', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83'} Flag = True class M3U8_URL (): Def __init__(self, f_url, name=""): "" :param f_url: link to current video :param name: Self. url = f_url self.name = name def get_m3u8(self): self.url = f_url self.name = name def get_m3u8(self): global flag, qua, rel_path html = requests.get(self.url, headers=headers).text first_json = json.loads(re.findall('window.pageInfo = window.*? = (.*?)}; ', html)[0] + '}', strict=False) if self.name == '': name = first_json['title'].strip().replace("|", '') rel_path=path else: name = self.name rel_path = path + first_json['bangumiTitle'].strip() if os.path.exists(rel_path): pass else: os.makedirs(rel_path) video_info = json.loads(first_json['currentVideoInfo']['ksPlayJson'], strict=False)['adaptationSet'][0][ 'representation'] Label = {} num = 0 for quality in video_info: [num] = quality['qualityLabel'] if flag: print(Label) choice = int(input(" ")) flag = False qua = choice Download(name + '[{}]'.format(Label[choice]), video_info[choice - 1]['url'], rel_path).start_download() else: Download(name + '[{}]'.format(Label[qua]), video_info[qua - 1]['url'], rel_path).start_download() class Pan_drama(): Def __init__ (self, f_url) : "' ': param f_url: video page links'' self. Aa = len (STR (f_url). The split ('/') [1]) if self. Aa = = 7: self.url = f_url elif self.aa > 7: self.url = str(f_url).split('_')[0] def get_info(self): video_info = {} html = requests.get(self.url, headers=headers).text all_item = json.loads(re.findall('window.bangumiList = (.*?); ', HTML)[0])['items'] for item in TQDM (all_item, desc=" preparing a play "): video_info[item['episodeName'] + '-' + item['title']] = self.url + '_36188_' + str(item['itemId']) for name in video_info.keys(): m3u8_url(video_info[name],name).get_m3u8() class Download(): Urls = [] def __init__(self, name, m3U8_url, path): "" :param m3U8_URL Self.video_name = name self.path = path self.f_url = STR (m3U8_URL).split(' HLS /')[0] + 'HLS /' with open(self.path + '/{}.m3u8'.format(self.video_name), 'wb')as f: f.write(requests.get(m3u8_url, headers={'user-agent': 'the Chrome / 84.0.4147.135}). The content) def get_ts_urls (self) : with open(self.path + '/{}.m3u8'.format(self.video_name), "r") as file: lines = file.readlines() for line in lines: if '.ts' in line: self.urls.append(self.f_url + line.replace('\n', '')) def start_download(self): Self.get_ts_urls () for url in TQDM (self.urls, desc=" downloading {} ".format(self.video_name)): movie = requests.get(url, headers={'user-agent': 'Chrome/84.0.4147.135'}) with open(self.path + '/{}.flv'. Format (self.video_name), 'ab')as f: O.remove (self.path + '/{}.m3u8'. Format (self.video_name)) url1 = input(" input address: ") if url1.split('/')[3] == 'v': m3u8_url(url1).get_m3u8() elif url1.split('/')[3] == 'bangumi': Pan_drama(url1).get_info()Copy the code

Examples of effects:

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/b8c0a92a75104bce819e5fad9ce73071)

! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/ba8d22a9495f4b0e816476005a710371)

(Because of the consideration in case to be shielded, that is not GG, so I did not join multithreading, there is a need to try straight line ~~~)

Complete project code acquisition⬅ here

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python crawler: AcFun! It’s so clear!

Single short video

Get information about the video

Download the video from the m3U8 file address

Source code and effect

His drama series

Get information about the video

Links to the series

Source code and effect

Python crawler: AcFun! It’s so clear!

Single short video

Get information about the video

Download the video from the m3U8 file address

Source code and effect

His drama series

Get information about the video

Links to the series

Source code and effect

Related Posts

Top10 programming challenges that programmers can’t miss in 2018

My first cloud development micro channel applet

Behind decryption concurrency: Atomicity problems caused by thread switching