Project description

Netease cloud music playlist data acquisition, get all the songs of a certain style of songs, each song into the single access song name, author, volume, page links, the collection number, forwarding number, comments, tags, is introduced, the number of songs, some included and, the title and statistics play the amount of the top ten will broadcast the amount of top ten of all the information and the corresponding additional storage, Visualize it.

When making this crawler, I discussed with people around how to turn the page. Some people said to use Selenium to simulate clicking, but through observing the web page, I found that even without simulating clicking to turn the page, I could go through the information of the playlist. Next, I will show you how to climb the data.

The code framework

Third-party Library Description

Introduction to some third-party libraries used in the project:

Bs4 "" BS4 stands for Beautiful Soup and provides simple, Python-like functions for navigating, searching, modifying analysis trees, and more. It is a toolbox that automatically converts input documents to Unicode encoding and output documents to UTF-8 encoding by parsing them for Beautiful Soup. The requests library is used for accessing web pages and retrieving web content, and supports HTTP features. The Time library is a time-processing module, and in this project is used to enforce the time interval for web pages. The random library provides random numbers and is used in conjunction with the Time library to produce random enforced access intervals. The library is used when Python accesses Excel. The Library is a NumPy - based tool for reading text files. It can be used to manipulate data quickly and easily. "" pyecharts.Charts is a library for data visualization. It contains a variety of drawing tools. In this case, Bar was used to draw Bar charts. Matplotlib is also a visual library, consisting of various visual classes. Matplotlib. pyplot is a command sublibrary for plotting visual graphs.Copy the code

Content crawl instructions

Crawl links: music.163.com/discover/pl…

Page for details

Observing web content is the first step of our crawler project. Here I chose a Chinese-style playlist to observe.

The Chinese-style playlist is 37 pages long, with 35 playlists per page, making a total of about 1,295 playlists. A style song sheet is on behalf of all, we are doing the crawler to avoid biased, see a page, find the rule, so you can write a structured the crawler, change when the content of web pages, does not change the overall framework, can continue to run our code, it is also a test code robustness on the one hand (run).

After selecting the other playlists, you can see that each category has 37 or 38 playlists, and each page has 35 playlists. How do you go through each page? I thought about this for a long time. I didn’t want to use Selenium to simulate clicks, so we had to look at the source code for clues. Old F12 entry developer options:

In the source code, we can see that each page corresponds to a regular link, such as: “music.163.com/#/discover/… Through the observation of page links, I found that the key point for page turning is the number 35 of “&limit=35&offset=35”. Each page is determined by the number after the link, which is the first page with 0, 35 is the rule of multiple, the first page is “&limit=0&offset=0”. The second page is “&limit=35&offset=35”, the third page is “&limit=35&offset=70”, and so on. As long as you know how many pages there are in the current category, you can loop through the for loop, going through each page.

Now that we know how to turn the page, the key is to get the number of pages in the playlist. So we can go to the point where the arrow is pointing, and use the copy method that comes with the developer option, and right click copy, Copy Selector and just copy the CSS selector statement;

#m-pl-pager > div > a:nth-child(11) result = bs.select('#m-pl-pager > div > a:nth-child(11)')Copy the code

That next is to a single playlist content to climb, because we climb more content, so here is not a list, you can compare and reference, do not understand can private message.

Gets the playlist name

Enter each page to get each playlist on the page, enter a single playlist, playlist name, creator, play amount and other data are stored in the same div of the page,

id='content-operation'
Copy the code

When selecting each content through selector, it is in the web version of netease Cloud, so all songs in the playlist are not displayed, only 10 songs are displayed. Therefore, only 10 songs are obtained in each playlist when crawling. If you want to crawl for more details about each day’s songs, you can go to the song’s URL to get more content. So I’m going to define a Content class and a Website class to do a structured crawler, and if you don’t understand that, you can look at my previous post,

Content class and Website class

class Content: def __init__(self, url, name, creator, play, collect, transmit, comment, tag, introduce, sing_num, sing_name): self.url = url self.name = name self.creator = creator self.play = play self.collect = collect self.transmit = transmit self.comment = comment self.tag = tag self.introduce = introduce self.sing_num = sing_num self.sing_name = sing_name def  print(self): print("URL: {}".format(self.url)) print("NAME:{}".format(self.name)) print("CRAETOR:{}".format(self.creator)) print("PLAY:{}".format(self.play)) print("COLLECT:{}".format(self.collect)) print("TRANSMIT:{}".format(self.transmit)) print("COMMENT:{}".format(self.comment)) print("TAG:{}".format(self.tag)) print("INTRODUCE:{}".format(self.introduce)) print("SING_NUM:{}".format(self.sing_num)) print("SING_NAME:{}".format(self.sing_name)) class Website: def __init__(self, searchUrl, resultUrl, pUrl, absoluterUrl, nameT, creatorT, playT, collectT, transmitT, commentT, tagT, introduceT, sing_numT, sing_nameT): self.resultUrl = resultUrl self.searchUrl = searchUrl self.absoluterUrl = absoluterUrl self.pUrl = pUrl self.nameT = nameT self.creatorT = creatorT self.playT = playT self.collectT = collectT self.transmitT = transmitT self.commentT = commentT self.tagT = tagT self.introduceT = introduceT self.sing_numT = sing_numT self.sing_nameT = sing_nameTCopy the code

Crawl type Crawler

From bS4 import BeautifulSoup import re import requests import time import Random import XLWT # request header headers_ = {' user-agent ':'Mozilla/5.0 (Macintosh; AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp, * / *; Q =0.8'} req = request.get (url, headers = headers_) req.encoding = "utF-8" Avoid garbled words except requests. Exceptions. RequestException: Def getContent(self, pageObj, selector) return None return BeautifulSoup(req.text, "html.parser") childObj = pageObj.select(selector) # print("\n".join(line.text for line in childObj)) return "\n".join(line.text for Def search(self, topic, site): Newurl = site.searchurl + Topic newURL = requesting.utils.quote(newurl, safe=':/? Bs = self.getWeb(newURL) result = bs.select('#m-pl-pager > div > a:nth-child(11)') num = Int ("\n".join(link.text for link in result)) # for I in range(0, num+1) J =35 * I url = site.searchurl + topic + '&limit=35&offset=' + STR (j) # safe=':/? =&') bs = self.getWeb(url) searchResults = bs.select(site.resultUrl) for link in searchResults: If (site.absoluterURL): bs = self.getWeb(url) else: bs = self.getWeb(site.pUrl + url) # print(site.pUrl + url) if(bs is None): print("something was wrong with that page or URL. Skipping") return else: Find ('ul',{'class':'f-hide'}) sing_name = "\n". Join (music.text for music in main.find_all('a')) # Url = site.purl + url data.append(url) # print(data) # add a parameter, Name = self.getContent(bs, site.namet) data.append(name) creator = self.getContent(bs, site.namet) site.creatorT) data.append(creator) play = self.getContent(bs, site.playT) data.append(play) collect = self.getContent(bs, site.collectT) data.append(collect) transmit = self.getContent(bs, site.transmitT) data.append(transmit) comment = self.getContent(bs, site.commentT) data.append(comment) tag = self.getContent(bs, site.tagT) data.append(tag) introduce = self.getContent(bs, site.introduceT) data.append(introduce) sing_num = self.getContent(bs, site.sing_numT) data.append(sing_num) # sing_name = self.getContent(bs, Site.sing_namet) data.append(sing_name) datalist. Append (data) # print(datalist) content = content (url, name, creator, play, collect, transmit, comment, tag, introduce, sing_num, Def saveData(self, datalist, savepath): print(" save to Excel file! ") ) # xlwt.Workbook to create a worksheet, Music = xlwt.Workbook(encoding = 'utF-8 ', style_compression=0) Add_sheet (' url ', 'playlist name ',' creator ', 'playlist number ',' Favorites') "Forward", "comments", "tag", "introduction", "number of songs", "the name of the song") for I in range (0, 10) : Sheet. Write (0, I, col[I]) # for I in range(0, len(datalist)-1): Format (I +1)) data = datalist[I] for j in range(0, 11): Sheet. write(I +1, j, data[J]) music.save('E:/ new folder /Python crawling/XLS ') print(" Data saved successfully! ") ) crawler = Crawler() # searchUrl, resultUrl, pUrl, absoluterUrl, nameT, creatorT, playT, collectT, transmitT, # commentT, tagT, introduceT, sing_numT, sing_nameT Defines website parameter instantiated siteData = [[' https://music.163.com/discover/playlist/?cat= ', 'arjun sk,' https://music.163.com ', False, 'div.tit h2.f-ff2.f-brk', 'span.name a', 'strong#play-count', 'a.u-btni.u-btni-fav i', 'a.u-btni.u-btni-share i', '#cnt_comment_count', 'div.tags.f-cb a i', 'p#album-desc-more', 'div.u-title.u-title-1.f-cb span.sub.s-fc3', 'span.txt a b']] sites = [] datalist = [] for row in siteData: sites.append(Website(row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[9], row[10], row[11], Row [12], row[13])) topics = "Topic =" Crawler. search(Topics, targetSite) savepath = 'netease cloud Music.xls' Crawler.saveData (datalist, savepath)Copy the code

Crawl results

The result of crawling



Because the data is too much, here is only a part of the interception, interested can run their own;

Content visualization

Visual code

Import Pandas as PD from Pyecharts. Charts import Pie # Charts from Pyecharts. Charts import Bar # Charts from Pyecharts Options as opts import matplotlib.pyplot as PLT # data = pd.read_excel(' 1616cloud Music.xls ') # Sort_values (' playlist name ',ascending=False).head(10) v = df[' playlist name '].values. Tolist () #tolist() convert the data to a list d = Df [' play amount]. Values. Tolist () # set colors color_series = [' # 2 c6ba0 ', '# 2 b55a1', '# 2 d3d8e', '# 44388 e', '# 6 a368b' '#7D3990','#A63F98','#C31C88','#D52178','#D5225B'] Pie1.set_colors (color_series) # Add data, set pie radius, Whether to display a nightingale figure pie1. Add (" ", [the list (z) for z in zip (v, d)], the radius = [" 30% ", "135%"], center = (" 50% ", "65%"), Rosetype ="area") # Set global configuration item # TitleOpts title configuration item # LegendOpts Legend configuration item is_Show Whether to show legend components # ToolboxOpts() Toolkit configuration item Default to show toolbar components Title_opts = opts.titleopts (title=' top10 playlist '), legend_opts= opts.legendopts (is_show=False), legend_opts= opts.legendopts (is_show=False) Toolbox_opts = opts.ToolBoxOpts ()) # Set the series of configuration items # LabelOpts tag configuration item is_show Whether to show the tag; Font_size Font size; # position="inside"; Pie1.set_series_opts (label_opts=opts.LabelOpts(is_show=True, position="inside", Font_size =12, font_size= "{b}:{c} ", font_style="italic", font_weight="bold", font_family="Microsoft YaHei") Pie1.render ("E:/ rose. HTML ") print(" Rose saved successfully! ") ) print (" -- -- -- -- -- "* 15) # print (df) [r]. 'creators' values. The tolist ()) bar = (bar () add_xaxis ([I for I in df [' creators'] values. The tolist ()]) .add_yaxis(' top 10 comments ', df[' top 10 comments '].values. Tolist ())) bar.render("E:/ bar graph. HTML ") print(" bar graph saved successfully! ") )Copy the code

The word cloud code

Import wordCloud import Pandas as pd import numpy as NP data = pd.read_excel(' netease cloud music.xls ') Data = data.sort_values(' playlist ',ascending=False). Head (10) print(data[" playlist name "]) #font_path W1 = wordCloud. Wordcloud (width=1000,height=700, background_color='white', background_color='white', Font_path ='msyh.ttc') TXT = "\n".join(I for I in data[' song list name ']) w1.generate(TXT) w1.to_file('E:\ word cloud.png ') 'Copy the code

rose

Bar charts

The word cloud

End, interested friends can come to exchange, this is the content, everyone good night, bye bye!

Recently, many friends have asked about Learning Python through private messages. For easy communication, click on blue to add yourselfDiscuss solutions to resource bases