The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article is from the author of Tencent Cloud: Python learning tutorial

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)

In the common several music sites, cool dog can be said to be the best to climb it, what bend all have no, also did not encrypt what, so the most suitable for small white entry crawler

This article is aimed at the crawler zero-based white, so I have screenshots of each step and detailed explanation, in fact, I look at the long-winded, in the final analysis is the request of two steps, but also please avoid spraying.

1. Open the official website of Kugou, you can see the search box, and the data we need to crawl is the song list returned by Kugou after searching songs and the song information of each song (lyrics, author, URL, etc.).



Press F12 to enter developer mode and select Network-All.



Enter the search content in the search box, and then you can see that there are many lists on the right side. The list data is actually in this one, which I have marked in red (to find this one, you can click on it according to the name song_search, if necessary, click on it one by one to see if it is the content you are looking for).



Click on this line and switch to Preview above to find json data for search results and Lists are data lists



Click on a song and it contains the song name, author, AlbumID, FileHash, etc



Then we switch to Headers above and see the RequestURL and the arrow below shows the GET request



If you scroll down, you can see the Requset Headers and the request parameters (this is the request parameters). Search terms, number of requests, etc.)



Without further explanation, we’ll construct requests directly from Python’s Requests library. My environment is Python2.7, and python3 is python3

#coding= UTF-8import requests search = 'like you' # search content pagesize = '10' # number of requests URL = 'https://songsearch.kugou.com/song_search_v2?callback=jQuery11240251602301830425_1548735800928&keyword=%s&page=1&pagesiz e=%s&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1548735800930' % (search,pagesize)res = requests. Get (URL) #requests for getCopy the code

The output looks like this, and you can see that the returned JSON content is printed out, which is exactly what you saw in the browser developer tools



Then we get the list, go back to the browser, get the details for each song in the list, select the first one on the left and click on the details page



You can see the jump to the play page, refresh the page, reload again



You can see that the red box on the right is the song information (you may ask me how to know which one contains the song information, of course, is the observation method, write more experience, I really won’t click on it one by one)



The arrows I use are all useful information that generally need to be retrieved. You can see the author, song name, lyrics, album picture, ID and play_url are all inside. If you copy play_url to the address bar and press enter, the song will be played



Then we switch from Preview to Headers from the top, and you can see that it’s pretty much the same as requesting a list of songs, but it’s still a GET request



The hash and album_id parameters are the same as those used in the GET request. These parameters are the same as the hash and album_id parameters in the GET request.



Create a get request based on the RequestURL in developer mode and change the ID and hash value for each song

#coding= UTF-8import requests # Directly in just the first step in the search mode when developers access to the search list of the first id and hash# finally have the coherent code id = '557512' # single idhash = '41 c2e4ab5660eae04021c5893e055f50' # single hash url = 'https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery19107465224671418371_1555932632517&hash=%s&album_id =%s&_=1555932632518' % (hash,id) res = requests.get(url) print res.textCopy the code

You can see that the console printed the single track information, because the JSON data was not converted, the direct output printing now looks a bit messy



Note that the data returned by cool dog is not directly in JSON format. There are some useless strings at both ends, which need to be removed by regular expression. Only the contents inside curly braces {} (including curly braces) are kept



Now that we’re familiar with the above two steps, we’ll put it all together and write a complete Python crawler, enter the search song, and get the search list and the single information

# coding= UTF-8import requestsimPort jsonImport re # Request search list data search = raw_input(' music name :') # console input search keyword pagesize = "10" # Number of requests URL = 'https://songsearch.kugou.com/song_search_v2?callback=jQuery11240251602301830425_1548735800928&keyword=%s&page=1&pagesiz e=%s&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1548735800930' % (search, pagesize)res = requests. Get (URL) # Note that the data returned is not really in JSON format, Res = json.match (".*?({.*}).*", res. Text, loads(re.match(".*?({.*}). Re.s).group(1)) list = res['data']['lists'] MusicList = [] #for loop through the list to get information about each track Item ['AlnbumID'] ['AlnbumID'] url2 url2 = 'https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery191010559973368921649_1548736071852&hash=%s&album_i d=%s&_=1548736071853' % ( item['FileHash'], item['AlbumID']) res2 = requests.get(url2) res2 = json.loads(re.match(".*?({.*}).*", Res2.text).group(1) ['data']# print res2['song_name']+' - '+res2['author_name'] print Dict = {'author': res2['author_name'], 'title': res2['song_name'], 'id': str(res2['album_id']), 'type': 'kugou', 'pic': res2['img'], 'url': res2['play_url'], 'lrc': Res2 ['lyrics']} # Add dictionaries to musicList.append(dict)Copy the code

Finally the console outputs the result