Knowledge is like rags, remember to “sew”, you can show up beautifully.


1.Beautiful Soup

1. The introduction of Beautifulsoup

Beautiful Soup is used in the Soup. Beautiful Soup is a third-party library for Python that helps us pull data from web pages. It has the following characteristics:

  • 1.Beautiful Soup can extract data from an HTML or XML, including simple processing, traversal, searching the document tree, modifying web elements, and more. Can be very short code to complete our crawler program.
  • 2.Beautiful Soup hardly needs to worry about coding. In general, it can convert input documents to Unicode encoding and output them in UTF-8 encoding,

2. Beautiful Soup installation

Win command line:

pip install beautifusoup4Copy the code

3. Beautiful Soup base

You can refer to the documentation to learn (Chinese version of oh) :

[http://beautifulsoup.readthedocs.io/zh_CN/latest/#id8]Copy the code

For this crawler task, it can be completed as long as you understand the following basic contents: 1.

  • Tag
  • Navigablestring
  • BeautifulSoup
  • Comment

2. Traverse the document tree: find, find_all, find_next, and children 3. A little HTML and CSS knowledge (do without it, just learn it now)

2. Climb the Story of Flower

1. Analysis of crawler ideas

The website for this climb is 136 Bookstore.

First, open the contents page of the Story of The Journey of Flower.

Our goal is to find the URL for each directory, and to crawl the body content and place it in a local file.

2. Analysis of web page structure

First of all, the top left corner of the table of contents page has a few words that can improve your sense of accomplishment after the success of the crawler:

Read on to find the latest chapter section, and then the entire table of contents. The object of our analysis is all the contents of the book. Click on one of the directories and we can all see the body content.

Press F12 to open the Review Elements menu. You can see that the content of the front end of the page is contained here.

Our goal is to find the corresponding link address of all directories and crawl the text content in each address.

Patient friends can find the corresponding chapter contents inside. An easy way to do this is to click on the button marked with the arrow in the upper left corner of the review element, and then select the element to deepen its position.

In this way, we can see that the links to each chapter are stored regularly in <li>. These <li> are placed in <div id= “book_detail” class= “box1” >.

I keep saying “our purpose” to show you that thinking matters. Reptiles are not about pao, so it is not advisable to cover your head.

3. Single-chapter crawler

We’ve just analyzed the structure of the web page. We can directly open the corresponding chapter link address in the browser and extract the text content.

Everything we want to crawl is contained within this <div>.

The code is sorted as follows:

from urllib import request from bs4 import BeautifulSoup if __name__ == '__main__': Chapter 8 # url = 'http://www.136book.com/huaqiangu/ebxeew/' head = {} # use proxy head [' the user-agent '] = 'Mozilla / 5.0 (Linux; Android 4.4.1. Nexus 7 Build/JRO03D AppleWebKit/535.19 Like Gecko) Chrome/18.0.1025.166 Safari/535.19' req = request.request (url, Headers = head) response = request.urlopen(req) HTML = response.read() # create request object soup = BeautifulSoup(HTML, Find ('div', id = 'content') # print(soup_text.text)Copy the code

The running results are as follows:

In this way, a single chapter content crawl is done.

4. The complete novel crawler

Single-chapter crawler we can directly open the corresponding chapter address to parse the text, full set crawler we can not let the crawler in each chapter in the web page run, so it is not as fast as copy and paste.

The idea is to crawl the link addresses of all sections in the table of contents page, and then crawl the text content of the page corresponding to each link. This is one more parsing process than a single-section crawler, which uses Beautiful Soup to traverse the contents of the document tree.

1. Parse the directory page

In the idea analysis, we have seen the structure of the directory page. Everything is going to be in one box and everything is going to be in one <div id= “book_detail” class= “box1” >.

There are two identical <div id= “book_detail” class= “box1” >.

The first <div> contains the most recently updated chapter, and the second <div> contains the entire collection.

Notice that we want to crawl the contents of the second <div>.

The code is sorted as follows:

from urllib import request from bs4 import BeautifulSoup if __name__ == '__main__': # contents page url = 'http://www.136book.com/huaqiangu/' head = {} head [' the user-agent '] = 'Mozilla / 5.0 (Linux; Android 4.4.1. Nexus 7 Build/JRO03D AppleWebKit/535.19 Like Gecko) Chrome/18.0.1025.166 Safari/535.19' req = request.request (url, Headers = head) response = request.urlopen(req) HTML = response.read() # BeautifulSoup = BeautifulSoup(HTML, Find ('div', id = 'book_detail'); Class_ = 'box1').find_next('div') # go through the sub-nodes of OL and print out the chapter title and corresponding link address for link in soup_texts. Ol.children: If link! = '\n': print(link.text + ': ', link.a.get('href'))Copy the code

The execution result is shown as follows:

2. Crawl the entire collection

Insert each parsed link loop into the url, and crawl out the text, and write it to the local F: / huaqiangU.txt. The code is sorted as follows:

from urllib import request from bs4 import BeautifulSoup if __name__ == '__main__': Url = 'http://www.136book.com/huaqiangu/' head = {} head [' the user-agent '] = 'Mozilla / 5.0 (Linux; Android 4.4.1. Nexus 7 Build/JRO03D AppleWebKit/535.19 Like Gecko) Chrome/18.0.1025.166 Safari/535.19' req = request.request (url, headers = head) response = request.urlopen(req) html = response.read() soup = BeautifulSoup(html, 'lxml') soup_texts = soup.find('div', id = 'book_detail', Class_ = 'box1').find_next('div') # open file f = open(' f :/huaqiangu.txt','w') # if link ! = '\n': download_url = link.a.get('href') download_req = request.Request(download_url, headers = head) download_response = request.urlopen(download_req) download_html = download_response.read() download_soup  = BeautifulSoup(download_html, 'lxml') download_soup_texts = download_soup.find('div', Text = download_soup_texts = download_soup_texts. Text = '\n\n f.write(download_soup_texts) f.write('\n\n') f.close()Copy the code

The command output is [Finished in 32.3s].

Open F check to see the Journey of Flower file.

The crawler succeeded. Get your tissues ready, and go feel the sadistic love between Zun and Bones.

5. To summarize

There are many improvements to the code. For example, the JS code containing advertisements in the text can be removed, but also add crawler progress display and so on. Implementing these features requires regular expressions and knowledge of the OS module, but without further elaboration, you can continue to improve.


Code word hard, code code trouble, text and code with code is very difficult. If you think the article is worth anything, please don’t be stingy with your appreciation.



1 comment