preface

In order to go to work to touch fish convenient, today I wrote a climb to take pen fun pavilion novel procedures. Well, it’s just a purpose to learn Python and share.

First, import the relevant modules



1.  import os
    
2.  import requests
    
3.  from bs4 import BeautifulSoup
    


Copy the code

Second, send a request to the website and obtain the website data

The last digit of the website link is the ID value of a book, and one digit corresponds to a novel. Let’s take the novel with ID 1 as an example.

After entering the website, we found a chapter list, so we first complete the novel list name crawl

Headers = {3. 'user-agent ': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, Like Gecko) Chrome/88.0.4324.182 Safari/537.36' 4. 8. OS. The mkdir (' / / 'novel) 10. \ # visit the web site and get the data page 11. The response \ = requests get (' http://www.biquw.com/book/1/'). The text 12. print(response)Copy the code

Students may have found a problem in this place, when I go to visit the website normally why the data back is garbled?

This is because the encoding format of the HTML page is inconsistent with the decoding format of the data we access and get from Python. The default decoding format of Python is UTF-8, but the encoding of the page may be GBK or GB2312, so we need to make the Python code automatically change the decoding mode of the page

1. \ # # # # to write the access code 2. \ ` \ ` \ ` python 3. Response = requests. Get (' http://www.biquw.com/book/1/ ') 4. The response. The encoding = response.apparent\_encoding 5. print(response.text) 7. ''' 8. The Chinese data returned in this way is correct.Copy the code

Three, get the page data after the data extraction

When we get the page data through the correct decoding way, the next need to complete the static page analysis. We need to get the data we want from the whole page data (section list data)

1. Open the browser first

2. Press F12 to bring up developer tools

3. Select the element selector

4. Select the data we want on the page and locate the element

5. Observe the element labels where the data exists

2. As shown in the figure above, the data is stored in label A. The parent label of A is Li, and the parent label of Li is UL, and the ul label is div. So if you want to get the novel chapter data for the entire page, you need to get the DIV tag first. We can use the class attribute to get the specified div tag. For details, see the code ~ 3. HTML parsing libraries convert HTML code into Python objects, 5. Soup \= BeautifulSoup(response.text, 'LXML ') 6. Book \_list \= soup. Find ('div', Class \_\='book\_list').find\_all('a') 7. 11. Book \_name \= book. Text 10.Copy the code

4. After obtaining the link of the novel details page, conduct a second visit to the details page and obtain the article data



1.  book\_info\_html \= requests.get('http://www.biquw.com/book/1/' + book\_url, headers\=headers)
    
2.  book\_info\_html.encoding \= book\_info\_html.apparent\_encoding
    
3.  soup \= BeautifulSoup(book\_info\_html.text, 'lxml')
    


Copy the code

5. Static page analysis of novel detail pages



1.  info \= soup.find('div', id\='htmlContent')
    
2.  print(info.text)
    


Copy the code

Six, data download

1. With the open (' / / 'novel + book \ _name +'. TXT ', 'a', encoding \ = "utf-8") as f: 2. F.w rite (info. Text)Copy the code

Finally, let’s take a look at the code

Captured data

That’s it for this article on the Python crawler novel

Scan the qr code below and add teacher wechat

Or search your teacher’s wechat id: XTUOL1988****

Invite you to listen to ****Python Web development, Python crawler, Python data analysis, artificial intelligence free boutique tutorials **, **0 basic entry to enterprise project actual combat teaching!