Guide language:

Hello, hello ~ Xiaobian thinks that in this era of full generation gap, the only thing without generation gap is the novel, take “Overlord President” for example, even 80 years old people, 18 years old teenagers, love to read. It can be said that novels are spiritual food for many people……

Today, we use Selenium to crawl hongxiu Tianxiang website novel data and do simple data visualization analysis.

Body:

Preliminary Website analysis

Open the webpage as shown in the figure:

We’re going to grab all the novel data in the novel category: 50 pages, 20 data per page, 1,000 data in total.

First use the Requests third-party library to request data, as shown below:

The import requests a url = 'https://www.hongxiu.com/category/f1_f1_f1_f1_f1_f1_0_1' # first page headers url = {' XXX ': 'xxx'} res = requests.get(url,headers=headers) print(data.content.decode('utf-8'))Copy the code

We found that the requested data did not contain the novel data information of the first page, which was obviously not in the source code of the web page. Then, by checking the network, we found the following request field:

_csrfToken: btXPBUerIB1DABWiVC7TspEYvekXtzMghhCMdN43
_: 1630664902028
Copy the code

This one is js encrypted, so to avoid analyzing the encryption, it might be faster to use Selenium to crawl the data.

Selenium crawls the data

01 Preliminary Test

from selenium import webdriver
import time

url = 'https://y.qq.com/n/ryqq/songDetail/0006wgUu1hHP0N'

driver = webdriver.Chrome()

driver.get(url)
Copy the code

The results work fine, no problems.

02 Novel Data

Specify the novel data information to crawl

Image link, name, author, type, end, popularity, introduction

'https://www.hongxiu.com/category/f1_f1_f1_f1_f1_f1_0_1' # of the first page of the url 'https://www.hongxiu.com/category/f1_f1_f1_f1_f1_f1_0_2' # 2 page urlCopy the code

03 Data Parsing

Here is the code to parse the page data:

def get_data(): Find_elements_by_xpath ("//div[@class="right-book-list"]/ul/li") for item in items: dic = {} imgLink = item.find_element_by_xpath("./div[1]/a/img").get_attribute('src') # 1. 3. Different types of novels.... dic['img'] = imgLink # ...... ficList.append(dic)Copy the code

Here are a few things to note:

  1. Pay attention to xpath statements, pay attention to details, don’t make mistakes;

  2. For novel briefs, some briefs are longer, have newlines, and need to use the string’s replace method to replace ‘\n’ with an empty string for ease of storage

04 Page turn and crawl

Here is the code for paging data:

try: Time.sleep (3) js = "window.scrollto (0,100000)" driver.execute_script(js) while driver.find_element_by_xpath( "//div[@class='lbf-pagination']/ul/li[last()]/a"): driver.find_element_by_xpath("//div[@class='lbf-pagination']/ul/li[last()]/a").click() time.sleep(3) getFiction() print(count, "*" * 20) count += 1 if count >= 50: return None except Exception as e: print(e)Copy the code

Code description:

  1. Use try statements to handle exceptions in case there is a particular page element that doesn’t match or something else goes wrong.

  2. The driver executes the JS code, manipulates the wheel, and slides to the bottom of the page.

    Js = “window. The scrollTo (0100-000)” driver. Execute_script (js)

  3. Time.sleep (n) because the parsing function (driver location) has been added to the loop to wait for the data to load completely.

  4. A while loop statement, followed by a ‘next page’ button, ensures that the loop climbs the next page.

  5. Use an if statement as a judgment condition, as a condition for the while loop, and then return to exit the function, not break.

05 Data Saving

titles = ['imgLink', 'name', 'author', 'types', 'pink','popu','intro'] with open('hx.csv',mode='w',encoding='utf-8',newline='') as f: Writer = csv.dictwriter (F, titles) writer.writeheader() writer.writerows(data) print(' write successful ')Copy the code

06 Program Operation

The result is as follows, with 1000 entries:

Some points to note when using Selenium to crawl data:

If the data is not completely loaded, the find_Element_by_xpath statement using WebDriver will not locate the elements in the DOM document and will throw an error:

selenium.StaleElementReferenceException:

stale element reference: element is not

attached to the page document

The referenced element is out of date and no longer attached to the current page. Cause: Usually, this is because the page has been refreshed or jumped.

Solution: 1. Use findElement or findElements to locate elements. 2. Or just refresh the page with webdriver.chrome ().refresh and wait a few seconds before refreshing, time.sleep(5).

For a solution to this error, see the following blog: www.cnblogs.com/qiu-hua/p/1…

② When dynamically clicking the next page button, it is necessary to accurately locate the next page button. Secondly, it is important to have a total problem. Selenium opens the browser page, and the window is maximized.

Because there is a small two-dimensional code window with absolute positioning on the right side of the window, if the window is not maximized, the window will block the button on the next page and make it impossible to click, which needs to be noted.

Data analysis and visualization

Open the file

import pandas as pd
data = pd.read_csv('./hx.csv')
data.head()
Copy the code

② When dynamically clicking the next page button, it is necessary to accurately locate the next page button. Secondly, it is important to have a total problem. Selenium opens the browser page, and the window is maximized.

Because there is a small two-dimensional code window with absolute positioning on the right side of the window, if the window is not maximized, the window will block the button on the next page and make it impossible to click, which needs to be noted.

Data analysis and visualization

Open the file

import pandas as pd
data = pd.read_csv('./hx.csv')
data.head()
Copy the code

According to our data information, the following visual display can be made:

01 Proportion of different types of novels

Types = [‘ modern romance ‘, ‘ancient romance’, ‘fantasy’, ‘fantasy romance’, ‘sci-fi space’, ‘XianXia’, ‘city’, ‘history’, ‘science’, ‘XianXia colors’,’ romantic youth ‘, ‘other’] number = [343, 285, 83, 56, 45, 41, 41, 25, 14, 14, 13,40]

Pyecharts pie chart

from pyecharts import options as opts from pyecharts.charts import Page, Pie pie=( Pie() .add( "", [list(z) for z in zip(types, number)], radius=["40%", "75%"], ).set_global_opts(title_opts= opts.titleopts (title_opts= opts.titleopts (title_opts= opts.titleopts (title_opts= opts.titleopts), legend_opts= opts.legendopts (Orient ="vertical", pos_top="15%", pos_left="2%" ), ) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) pie.render('pie.html')Copy the code

The result is shown in figure

As can be seen from the graph, romance novels account for half of all novels.

02 Proportion of finished novels

from pyecharts import options as opts from pyecharts.charts import Page, Pie ty = [' and now ', 'serialised in] num = [723269] Pie = (Pie (), add (" ", [list(z) for z in zip(ty,num)]).set_global_opts(title_opts= opts.titLeopts (title=" end of story ") .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) pie.render('pie1.html')Copy the code

The result is shown below:

As can be seen from the chart, more than a quarter of the novels are still being serialized.

03 Introduction to the novel word cloud display

Generate a. TXT file

with open('hx.txt','a',encoding='utf-8') as f:
    for s in data['intro']:
        f.write(s + '\n')
Copy the code

Initialization Settings

Import jieba from PIL import Image import numpy as NP from wordcloud import wordcloud import matplotlib.pyplot Text = open("./hx.txt",encoding=' utF-8 ').read() text = Text. Replace ('\n',""). Replace ("\u3000","") # 分, Text_cut = jieba.lcut(text) # Join (text_cut)Copy the code

Word cloud display

Word_list = jieba.cut(text) space_word_list = ". Join (word_list) # print(space_word_list) Read the image file, Mask_pic = Np.array (image.open ("./ xin-.png ")) word = WordCloud( Font_path = 'C: / Windows/Fonts/simfang. The vera.ttf', # set the font, Mask =mask_pic, # set background_color='white', # set background color max_font_size=150, # set font maximum max_words=2000, # set stopwords={' '} # set stopwords, Word-to_file ('h.png') # save image.show().generate(space_word_list) image = word.to_image() word.to_file('h.png') # save image.show()Copy the code

The result is shown in figure

Nothing special can be seen from the picture here. We can consider other methods of natural language analysis and processing that are more effective.

Analyzing the popularity of novels by genre

from pyecharts.charts import Bar from pyecharts import options as opts bar = Bar() Bar.add_xaxis (list(c['types'].values)) bar.add_yaxis(' hotlist ',numList) bar.set_global_opts(xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=45))) bar.render()Copy the code

The result is shown in figure

It can be seen that romance novels have been an eternal topic since ancient times…

05 Proportion of hot novels by different authors

We looked at the number of novels written by an author and found the following:

data[‘author’].value_counts()

According to the top three writers by volume, they only wrote romance novels, and the last two wrote a variety of novels. Next, analyze the popularity of romance novelists and other novelists among these novelists.

Romance novelist fever

From Pyecharts import options as opts from Pyecharts. Charts Import Page, Pie attr = [" ", "" ", "Bronze spike"] v1 = [1383131, 5107, 4] pie = (pie (.) add (" ", [list(z) for z in zip(attr,v1)]).set_global_opts(title_opts= opts.titLeopts (title=" ")) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) pie.render('pie.html')Copy the code

All three novelists are equally popular, and of course, the most popular is **When line**

Two popular novelists

From Pyecharts. Charts Import Bar from Pyecharts Import options as opts Bar = Bar() # specifies the abscess of the Bar chart Bar.add_xaxis ([' fantasy ',' fantasy ',' xixia ']) # Add_yaxis (" TANG Jia SAN Shao ", [2315 279,192]) bar.add_yaxis(" I eat tomatoes ", [552,814,900]) # specify the bar chart title bar.set_global_opts(title_opts= opts.titLeopts (title=" hot novel ranking ")) # specify the generated HTML name bar.render('tw.html')Copy the code

As the picture shows,TangJiaSanShaoThe fantasy novels are more prominent, whileI eat tomatoesThe three novels are more equally popular.

The last

We use Selenium to crawl the novel page data of Hongxiu Tian Xiang website. Because the JS of the page is encrypted, we use Selenium to crawl the data, and then summarize the points for attention:

① Selenium data crawling needs to pay attention to the following points:

  • The positioning of various elements needs to be precise;

  • Because selenium requires loading JS code, elements need to be fully loaded to locate, so open web pages need to set time.sleep(n);

  • Then there is an absolute positioning element for many websites, which may be two-dimensional code… , fixed in the position of the computer screen, does not move with the page scroll wheel, so the page needs to be maximized to prevent the window blocking the page elements, resulting in the inability to click or other operations.

(2) Data cleaning should be carried out when data visualization is displayed, because some data is not standardized, such as errors like this:

'utf-8' codec can't decode byte 0xcb in 
position 2: invalid continuation byte
Copy the code

The encoding is set to the charset attribute of the meta tag. The encoding is set to the charset attribute when the file is opened. If the error persists, you can change the property value to'gb18030'

<meta charset="UTF-8">
Copy the code

Note, this article is only to learn the exchange, for the crawler dabble, so as not to increase the burden on the server.

Everyone like to remember a little praise, need a complete project source can be private letter I can yo!

👇

This line of blue text will also work