How to manually write a Python script to automatically crawl Bilibili small videos

At the end of the National Day, some code farmers do not work well, wandering in B station, after all, National Day, also let people do not rest ah ~ ~

Many friends around me in the circle of friends drying out to play photos, it is, people do not do not, the Great Wall was blocked to overflow, honest people think ah, since there are so many people, where do not go is also a good thing, nothing can brush B station 23333. At this time honest man also had a bold idea, can let these friends in the tourist attractions queue faster to pass the time? Considering the entertainment of the video and the number of people watching it, I decided to start the small video function launched by STATION B, so I went to station B to look for the API interface, but failed, station B provided an API interface in the small video function, friends are lucky!

B station small video website here oh:

All http://vc.bilibili.com/p/eden/rank#/?tab=Copy the code

In this experiment, we climbed the top100 of the daily small video charts

How do we climb it?

Preparation of experimental environment

  • ChromeBrowser (any browser that can use developer mode)
  • Vim(Editor optional, honest people prefer the Vim interface, so use this)
  • Python3The development environment
  • Kali Linux(Any operating system will do)

API find && extract

We opened developer mode via F12 and found the link under Networking -> Name:

http://api.vc.bilibili.com/board/v1/ranking/top?page_size=10&next_offset=&tag=%E4%BB%8A%E6%97%A5%E7%83%AD%E9%97%A8&platf orm=pcCopy the code

Let’s look at the Headers property

So you can see the value of the Request URL property, and as we scroll down to load the video, this is the only URL that’s invariant.

http://api.vc.bilibili.com/board/v1/ranking/top?
Copy the code

Next_offset will always change, and we can guess that this may be to get the serial number of the next video. We just need to take out this part of the parameter, write next_offset as a variable value, and return it to the target web page in JSON format.

Code implementation

We wrote a code through the above attempts, and found that station B did anti-crawler operation to some extent, so we need to obtain the headers information first, otherwise the downloaded video will be empty, and then define the params parameter to store JSON data. Then get the parameter value information through requests. Get and return it to the target web page in JSON format. The code is as follows:

def get_json(url) :
    headers = {
        'User-Agent': 
        'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    }

    params = {
        'page_size': 10.'next_offset': str(num),
        'tag': 'What's Trending Today'.'platform': 'pc'
    }

    try:
        html = requests.get(url,params=params,headers=headers)
        return html.json()

    except BaseException:
        print('request error')
        pass
Copy the code

In order to be able to clearly see our download situation, we toss a downloader up, the implementation code is as follows:

def download(url,path) :
    start = time.time() # Start time
    size = 0
    headers = {
        'User-Agent': 
        'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    }

    response = requests.get(url,headers=headers,stream=True) The # stream attribute must be attached
    chunk_size = 1024 # Data size per download
    content_size = int(response.headers['content-length']) # total size
    if response.status_code == 200:
        print('[file size]:%0.2f MB' %(content_size / chunk_size / 1024)) # Conversion unit
        with open(path,'wb') as file:
            for data in response.iter_content(chunk_size=chunk_size):
                file.write(data)
                size += len(data) The size of the downloaded file
Copy the code

The effect is as follows:

Summarizing the above code, the whole implementation process is as follows:

#! /usr/bin/env python
#-*-coding:utf-8-*-
import requests
import random
import time
def get_json(url) :
    headers = {
        'User-Agent': 
        'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    }

    params = {
        'page_size': 10.'next_offset': str(num),
        'tag': 'What's Trending Today'.'platform': 'pc'
    }

    try:
        html = requests.get(url,params=params,headers=headers)
        return html.json()

    except BaseException:
        print('request error')
        pass

def download(url,path) :
    start = time.time() # Start time
    size = 0
    headers = {
        'User-Agent': 
        'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    }

    response = requests.get(url,headers=headers,stream=True) The # stream attribute must be attached
    chunk_size = 1024 # Data size per download
    content_size = int(response.headers['content-length']) # total size
    if response.status_code == 200:
        print('[file size]:%0.2f MB' %(content_size / chunk_size / 1024)) # Conversion unit
        with open(path,'wb') as file:
            for data in response.iter_content(chunk_size=chunk_size):
                file.write(data)
                size += len(data) The size of the downloaded file

    

if __name__ == '__main__':
    for i in range(10):
        url = 'http://api.vc.bilibili.com/board/v1/ranking/top?'
        num = i*10 + 1
        html = get_json(url)
        infos = html['data'] ['items']
        for info in infos:
            title = info['item'] ['description'] # The title of the little video
            video_url = info['item'] ['video_playurl'] # Download link of small video
            print(title)

            # In case some videos do not provide a download link
            try:
                download(video_url,path='%s.mp4' %title)
                print('Download one successfully! ')
                
            except BaseException:
                print('Cool, download failed')
                pass

        time.sleep(int(format(random.randint(2.8)))) Set random wait time
Copy the code

The climbing effect is as follows:

It seems that the effect of crawling can also be, of course, like friends do not forget to click on the recommendation & star ah.

Project link

  • Github