Python Web crawler - Crawling the Cloud Music Review (4)

“This is the fourth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Let’s review the content of yesterday’s article first, 1. 2. Download the web page. 3. Set the loading speed and locate the target file. First, open a song of netease Cloud

https://music.163.com/#/song?id=25723157
Copy the code

The ID of each song is the number that follows it. So in principle we just need to collect the corresponding song into the play page, we can get the ID number. This makes it easy to loop and crawl through all the reviews for multiple songs.

Def get_comments(url): headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36', 'referer': 'http://music.163.com/' } params = "EuIF/+GM1OWmp2iaIwbVdYDaqODiubPSBToe5EdNp6LHTLf+aID/dWGU6bHWXS0jD9pPa/oY67TOwiicLygJ+BhMkOX/J1tZMhq45dcUIr6fLuoHOECYrOU 6ySwH4CjxxdbW3lpVmksGEdlxbZevVPkTPkwvjNLDZHK238OuNCy0Csma04SXfoVM3iLhaFBT" encSecKey = "db26c32e0cd08a11930639deadefda2783c81034be6445ca8f4fbedd346e1f9567375083aeb1a85e6ad6d9ae4532a49752c2169db8bcc04d38a79f9 bed7facea42ee23f1b33538c34f82741318d9b4b846663b53b0b808dd0499dccfbc6c61fbf180c6fb24b1c2dd3c2c450ce09917d74be9424dab836fd 2e671988ffbc6ae1b" data = { "params": params, "encSecKey": encSecKey } name_id = url.split('=')[1] target_url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_{}?csrf_token=".format(name_id) res = requests.post(target_url, headers=headers, data=data) return resCopy the code

Params and encSecKey are the data that the server wants, and these are the encrypted content. We get the core.js file, and since the js file is so large, we can search for key code segments based on the keyword encSecKey and the same two parameters from POST can also be used for other songs.

Once the comments are found, we need to extract the key data. As you can see, the data is in JSON format. JSON is a lightweight data interchange format that is often used in network transport.

The json.loads () method is used to restore a string to a Python data structure:

comments_json = json.loads(res.text)
Copy the code

So the “hotComments” key in the dictionary is the value of all the great comments!

def get_hot_comments(res): comments_json = json.loads(res.text) hot_comments = comments_json['hotComments'] with open('hot_comments.txt', 'w', encoding='utf-8') as file: for each in hot_comments: File. The write (each [' user '] [' nickname '] + ': \n\n') file.write(each['content'] + '\n') file.write("---------------------------------------\n")Copy the code

At this point, we can combine the previous parts and get the desired effect!

Other form netease cloud music API interface is as follows: music.163.com/api/v1/reso…

It should be noted that if the crawler frequency is too fast and the number is too much, the server will block IP. Therefore, a more complete crawler project needs to set up proxy and IP pool.

In addition, if you don’t want to be stuck cracking post form parameters, you can try using Python + Selenium +PhantomJs to simulate user actions, click on the page, and then directly parse the page elements. This can be “visible and crawling”, but it is a little less efficient.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python Web crawler – Crawling the Cloud Music Review (4)

Python Web crawler – Crawling the Cloud Music Review (4)

Related Posts

“Depth concept” to understand the MAP evaluation method of multi-label image classification task

Back propagation of notes

In the future, human food supply could be in the hands of artificial intelligence