Well-known bilibili there is no way to directly view the sender’s barrage, which makes when we see some nt barrage while angry, but helpless, but B station can be blocked a user sends a barrage, suggesting that there must be some user information in the data interface, due to the recent learning crawler, so I want to look for barrage interface, Let’s analyze the data.

To find the interface

Look for the interface of course is to open a video and then F12, but when I looked for two circles after I was dumbfounded, did not find ah. Well, can’t waste time on this kind of thing, decided to open Baidu, unsurprisingly, found the following two interfaces, are XML format web pages

comment.bilibili.com/+cid

Api.bilibili.com/x/v1/dm/lis…

The CID in this case is a number that is unique to each video, that is, each P has a CID, so if you look for cid, you can open up the page and then F12, CTRL + F, search for CID, usually eight or nine digits is cid.

So here I have an interface that I can use aid to find CID

www.bilibili.com/widget/getP…

Analyze the data

Barrage data is obtained, so we need to analyze their use from this pile of data.

Here you get two pieces of information: the sixth piece of data is a timestamp, and the eighth piece of data is some kind of encryption for the user UID. According to the query, this is the result of user UID verification by CRC32 into hexadecimal number, so only the uid can be used to obtain the correct verification code, not backward.

Looks like the only way to find the data is through the rainbow table? So how will this 8-bit hexadecimal number be stored in the database?

The choice seems to be varchar and Bigint, since there are nearly 600 million users on B station, it must be very slow to find the desired string in 600 million data.

When I decided to store data in Bigint, it occurred to me that 8-bit hexadecimal is 2 to the power of 32, and that ints have a limit of 2 to the power of 31, which is exactly 2 to the power of 32, or 0xffffFFFF, if stored unsigned.

So I decided to change to unsigned INT, and the corresponding ID is also unsigned int, and use the crC32B encoded data as the primary key, make rainbow table and store it in my server.

According to a rough calculation, 600 million data takes about 27 GIGABytes of space, and my server is only 40 GIGABytes.

Make a web page for everyone to use

The next step seems to be to write a Python script that takes two parameters, the video CID and the keyword of the barrage you want to search for, returns the barrage sent by the user, the crC32B encoding of the user, and the timestamp.

The exec function of PHP is then used to execute python code, and the user’s UID is found by searching the database, and the DATA in JSON format is returned to the front-end through PHP.

Python code (poorly written)

import requests
from bs4 import BeautifulSoup
import re
import io
import sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')
req = requests.get('https://comment.bilibili.com/'+sys.argv[1]+'.xml')
req.encoding = req.apparent_encoding
soup = BeautifulSoup(req.text, 'html.parser').find_all(name='d')
result = ""
for i in soup:
    s = re.sub('< (. *?) > '.' ', str(i))
    index = 0
    if(len(sys.argv[2])>0):
        index = s.find(str(sys.argv[2]))
    if(index! =-1): result+=str(i).split(",") [6] +","+s+","+str(i).split(",") [4] +","
print(result)
1
2
Copy the code

Results show

Front-end code is written casually ~ at least functional realization

Here is why NULL, because my server is still writing rainbow table data to the database. It is expected to take 4 days

Today added a violence function, to avoid NULL query results, but the relative query speed will be very slow.

Wenyuan network, only for the use of learning, if there is infringement please contact delete.

You will definitely encounter difficulties in learning Python. Don’t panic, I have a set of learning materials, including 40+ e-books, 800+ teaching videos, covering Python basics, crawlers, frameworks, data analysis, machine learning, etc. Shimo. Im/docs/JWCghr… Python Learning Materials

Follow the Python circle and get good articles delivered daily.