Designed bar \
❈ Jien-dong Chen, Python Chinese Community columnist, Zhihu: Nonsense GitHub: github.com/chenjiandon... * * ❈ \
I think we are all familiar with site B, in fact, site B crawler search a lot of. But the paper come zhongjue shallow, and must know this to practice, SO I code in. A total of 7.6 million pieces of data were eventually retrieved.
The preparatory work
First open B station, randomly find a video in the home page click into. General operation, open developer tools. This time, the goal is to get video information by crawling the API provided by STATION B, without parsing the web page, which is too slow and easy to be blocked IP.
Check the JS option and F5 refresh
Found the address of the API
Copy down, remove unnecessary content, https://api.bilibili.com/x/web-interface/archive/stat?aid=15906633, with open browser, will get the following json data
Just write code
Ok, so here's where the code comes in, getting the data over and over again through the request, using multiple threads to make the crawler more efficient.
The core code
The main part of the project is about 20 lines of code, which is pretty neat.
The effect of operation is probably like this, the number is already climbed how many links, in fact, it can be in a day or two days to complete the climb of the whole station information.
As for how to deal with it after the crawl depends on your own preferences, I am first saved as a CSV file, and then summarized into the database.
The database table
Since I crawled it a few months ago, the data is actually a bit behind.Total dataQuery the top 10 videos with the most views* * * * * *Query the top 10 videos with most replies* * * * * * * *All kinds of query you choose!! Video links for https://www.bilibili.com/video/av + v_aid
For details, go to Bili.py, chenjiandongx/ Bili-Spider* * * * *Long press scan to follow the Python Chinese community for more technical wizardry!
Python Chinese Community ****Python Chinese developer's spiritual home
For cooperation and submission, please contact Wechat: PYTHonPost
-- Life is short, I use Python --
This article is the author's original work, reprint is prohibited without the author's permission\