This is the fifth day of my participation in the August More text Challenge. For details, see:August is more challenging

Crawl data targets

Web site:bilibili

Results show

Tool use

Development tools: pycharm development environment: python3.7, Windows10 use toolkits: requests, threading, CSV

Key Learning Content

  • Reverse crawl of common request headers
  • Json data processing
  • CSV File processing

Project idea Analysis

Find the video address you need to capture (I’m looking for a high quality male video from a crappy site)Web site:www.bilibili.com/video/BV1po…

The crawler first needs to find the corresponding data target address to collect data. It can be clearly seen that the comment data of the current web page is constantly changing, and it needs to find the corresponding comment interface, which is used to find dynamic data

Data is not cleared in dynamic data. Loading new comment data triggers load conditions

The loaded data is processed after the data is specified in all and the corresponding web interface is fetched and the network requests are sent

    url = 'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={}&type=1&oid=9659814&mode=3&plat=1&_=1627974678383'.format(
        i)
​
    response = requests.get(url)
    print(response.text)
Copy the code

Data request failure request header does not do the anti-crawl strategy to add the corresponding UA, and refere is mainly a request header measure to prevent link theft, in the browser request also can not get data

Get accurate data and extract the data information you want

  • Content of comments
  • Time to comment
  • The author of a review
  • Author’s gender
  • The author’s signature
  • (You can automatically collect data according to your own needs)

When dealing with JSON data, Before the json data jQuery1720892078778784086_1627994582044 can be extracted through regular way matching I choose to modify the parameters of the url here Url jQuery1720892078778784086_1627994582044 to delete The final url is:

https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={}&type=1&oid=376402596&mode=3&plat=1&_=1627993394215
Copy the code

After data is obtained, the data is saved in a CSV file

Def save_data(item): with open(' 1.csv', "a", newline='', encoding=" utF-8 ")as f: filename = ['content', 'ctime', 'sex', 'uname', 'sign'] csv_data = csv.DictWriter(f, fieldnames=filename) csv_data.writerow(item)Copy the code

Simple source sharing

Headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'} def save_data(item): With open(' a ', 'a ', newline='', encoding=" GBK ")as f: filename = ['content', 'ctime', 'sex', 'uname', 'sign'] csv_data = csv.DictWriter(f, fieldnames=filename) csv_data.writerow(item) def get_data(i): url = 'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={}&type=1&oid=376402596&mode=3&plat=1&_=1627993394215'.format ( i) response = requests.get(url, headers=headers).json() # print(response.content.decode('utf-8')) item = {} for data in response['data']['replies']: # print(data) # print(' ') item['content'] = data['content']['message'].replace('\n', ") item [' ctime] = data [' ctime] # timestamp saved convenient item [' sex '] = data [' member '] [' sex '] item [' uname] = data [' member '] [' uname] item['sign'] = data['member']['sign'] print(item) save_data(item) if __name__ == '__main__': for i in range(1, 300): # Send network request get_data(I)Copy the code

I am white and white I, a love to share knowledge of the program yuan ❤️

If you have no experience in programming, you can read this blog and find that you don’t know or want to learn Python, you can directly leave a message or private me.