Moment For Technology

How to quickly crawl the video information of station B

Posted on June 24, 2022, 2:02 a.m. by 高靜宜
Category: The back-end Tag: The back-end

Designed bar \

\

❈ Jien-dong Chen, Python Chinese Community columnist, Zhihu: Nonsense GitHub: github.com/chenjiandon... * * ❈ \

I think we are all familiar with site B, in fact, site B crawler search a lot of. But the paper come zhongjue shallow, and must know this to practice, SO I code in. A total of 7.6 million pieces of data were eventually retrieved.

The preparatory work

First open B station, randomly find a video in the home page click into. General operation, open developer tools. This time, the goal is to get video information by crawling the API provided by STATION B, without parsing the web page, which is too slow and easy to be blocked IP.

Check the JS option and F5 refresh

Found the address of the API

Copy down, remove unnecessary content, https://api.bilibili.com/x/web-interface/archive/stat?aid=15906633, with open browser, will get the following json data

Just write code

Ok, so here's where the code comes in, getting the data over and over again through the request, using multiple threads to make the crawler more efficient.

The core code

Iterative crawl

The main part of the project is about 20 lines of code, which is pretty neat.

The effect of operation is probably like this, the number is already climbed how many links, in fact, it can be in a day or two days to complete the climb of the whole station information.

As for how to deal with it after the crawl depends on your own preferences, I am first saved as a CSV file, and then summarized into the database.

The database table


Since I crawled it a few months ago, the data is actually a bit behind.Total dataQuery the top 10 videos with the most views* * * * * *Query the top 10 videos with most replies* * * * * * * *All kinds of query you choose!! Video links for https://www.bilibili.com/video/av + v_aid

For details, go to Bili.py, chenjiandongx/ Bili-Spider
* * * * *Long press scan to follow the Python Chinese community for more technical wizardry!

Python Chinese Community ****Python Chinese developer's spiritual home

For cooperation and submission, please contact Wechat: PYTHonPost

-- Life is short, I use Python --

1MEwnaxmMz7BPTYzBdj751DPyHWikNoeFS







This article is the author's original work, reprint is prohibited without the author's permission\



Search
About
mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.