I learned the basic grammar, and then I saw someone on the Internet who was using Python to climb a novel. I copied the code and tried it out.

1. The environment is ready to install BeautifulSoup4 and LXML

& C:/Python39/python.exe -m pip install –user BeautifulSoup4

& C:/Python39/python.exe -m pip install –user lxml

2. Renamed the file name after downloading to facilitate sorting and prevent illegal characters from appearing to create the file, and added a 1 second interval

import os

import requests

import time

from bs4 import BeautifulSoup

Declaration request header

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'

}

Create a folder to hold the text of your novel

If not os.path.exists(‘./ novel ‘):

OS. The mkdir ('/novel/')

path = ‘http://www.biquw.com/book/416/’

Visit the site and get the page data

response = requests.get(path)

response.encoding = response.apparent_encoding

print(response.text)

As shown in the figure above, the data is stored in the A tag. The parent of a is li, the parent of li is ul, and the div tag is above the ul tag. So if you want to get the entire page of the novel chapter data, you need to get the div tag first. And div tag contains a class attribute, we can use the class attribute to get the specified div tag, details see the code ~ “”

LXML: The HTML parsing library converts HTML code into Python objects that Python controls

soup = BeautifulSoup(response.text, ‘lxml’)

book_list = soup.find(‘div’, class_=’book_list’).find_all(‘a’)

The SOUP object retrieves the batch data and returns a list that we can iteratively extract

count = 1;

for book in book_list:

Book_name = book.text # After you get the list data, you need to get the link to the article detail page, Book_url = book['href'] book_info_html = requests. Get (path + book_url, book_url, book_url) headers=headers) book_info_html.encoding = book_info_html.apparent_encoding soup_part = BeautifulSoup(book_info_html.text, 'lxml') info = soup_part.find('div', Id ='htmlContent') name = STR (count) # print(info.text) with open('./ '+ name.zfill(4) + '.txt', 'a', Encoding ='utf-8') as f: f.write(info.text) print('{} '.format(book_name)) count += 1 time.sleep(1)

As shown in the figure above, the data is stored in the A tag. The parent of a is li, the parent of li is ul, and the div tag is above the ul tag. So if you want to get the entire page of the novel chapter data, you need to get the div tag first. And div tag contains the class attribute, we can get the specified div tag through the class attribute, details see the code ~

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Learn Python notes, crawl to the pen fun court novel

Declaration request header

Create a folder to hold the text of your novel

Visit the site and get the page data

print(response.text)

LXML: The HTML parsing library converts HTML code into Python objects that Python controls

The SOUP object retrieves the batch data and returns a list that we can iteratively extract

Learn Python notes, crawl to the pen fun court novel

Declaration request header

Create a folder to hold the text of your novel

Visit the site and get the page data

print(response.text)

LXML: The HTML parsing library converts HTML code into Python objects that Python controls

The SOUP object retrieves the batch data and returns a list that we can iteratively extract

Related Posts

VirtualEnv, Pipenv, Conda virtual environment setup method and principle

Django Development – Differences between Django and Tornado

Do you really know Python? This article will tell you 90%