This is my fifth day of the August Challenge

Python crawler Mr. Lu Xun “Classic quotes” save to Excel table (with source code)

preface

Today use Python to climb Mr. Lu Xun’s “classic quotations”, directly open whole ~

Code running effect display

The development tools

Python version: 3.6.4

Related modules

requests

lxml

pandas

And the modules that come with Python

Thought analysis

1. Get data

Get the web page from the “Good Sentence Fan” website.

http://www.shuoshuodaitupian.com/writer/128 _1
Copy the code

The request module is used to obtain HTML web pages through URL links, and the next step is to analyze web pages.

Where only the last part of the URL changes (1-10: represents the entire content of pages 1 to 10)

# 1. Get the data
headers = {"user-agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; x64) " \
                         "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36",}for i in range(0.9):
    url = "http://www.shuoshuodaitupian.com/writer/128_" + str(i + 1)  # 1-10 pages
    result = requests.get(url, headers=headers).content.decode()  
Copy the code

2. Parse the data

Through Xpath statement parsing, the content, source and score of the sentence are obtained respectively, as shown in the figure.

Take each set of data, put it in a dictionary, and append the dictionary to a list.

Source:

# 2. Parse the data
html = etree.HTML(result)
div_list = html.xpath('//div[@class="item statistic_item"]')
div_list = div_list[1: -1]

for div in div_list:
    # Iterate over each piece of information

    item = {}

    #./ Pay attention to get from the current node, down
    item['content'] = div.xpath('./a/text()') [0]
    item['source'] = div.xpath('./div[@class="author_zuopin"]/text()') [0]
    item['score'] = div.xpath('.//a[@class="infobox zan like "]/span/text()') [0]

    item_list.append(item)

print("Crawling to page {}".format(i + 1))
time.sleep(0.1) Save data: Once the retrieved data is in a list, the Pandas module converts the data type to a DataFrame, which can be easily saved to an Excel file. To prevent Garbled Chinese characters, pay attention to the encoding format.Copy the code

3. Save data

df = pd.DataFrame(item_list) Save data as CSV file
df.to_csv('Classic Quotes of Lu Xun. CSV', encoding='utf_8_sig') # Guarantee no garbled code
Copy the code

The results are sorted according to the score, as shown in the figure below.

If you want to generate multiple profiles, you can use a for loop to add each dictionary to the list and export a DataFrame

This is the end of this article, thank you for watching, Python data analysis series, the next article to share the Python crawler under the topic of short comments data analysis

To thank the readers, I’d like to share some of my recent programming gems to give back to each and every one of you.

Dry goods are mainly:

① More than 2000 Python ebooks (both mainstream and classic)

Python Standard Library (Chinese version)

(3) project source code (forty or fifty interesting and classic practice projects and source code)

④Python basic introduction, crawler, Web development, big data analysis videos (suitable for small white learning)

⑤Python Learning Roadmap

All done ~ see profile for complete source code.

Review past

Python implements “fake” data