We’ve seen how regular expressions can be used, but if the regular expression is written incorrectly, you may not get the result you want. And for a web page, there are certain special structure and hierarchy, and many nodes have ID or class to distinguish, so with the help of their structure and attributes to extract can not also?
In this section, we’ll take a look at Beautiful Soup, a powerful parsing tool that uses features like structure and attributes to parse web pages. With it, we do not need to write some complex regular expressions, only need a few simple statements, you can complete the extraction of an element in the web page.
Without further ado, let’s take a look at the power of Beautiful Soup.
1. Introduction
In a nutshell, Beautiful Soup is Python’s HTML or XML parsing library that you can use to easily extract data from web pages. Here’s the official explanation:
Beautiful Soup provides simple, Python-like functions to handle navigation, searching, modifying analysis trees, and more. It is a toolkit that provides users with data to grab by parsing documents, and because it is simple, it takes very little code to write a complete application.
Beautiful Soup automatically converts the input document to Unicode encoding and the output document to UTF-8 encoding. You don’t need to worry about encoding unless the document doesn’t specify one, in which case you just need to specify the original encoding.
Beautiful Soup has become as good a Python interpreter as LXML and HTML6lib, giving users the flexibility to parse different strategies or strong speeds.
Therefore, it can save a lot of tedious extraction work, improve the efficiency of analysis.
2. Preparation
Before you start, make sure Beautiful Soup and LXML are properly installed, or refer to Chapter 1 if not.
3. The parser
Beautiful Soup actually relies on parsers for parsing, and it supports some third-party parsers (such as LXML) in addition to the HTML parser in the Python standard library. Table 4-3 lists the parsers supported by Beautiful Soup.
Table 4-3 Parsers supported by Beautiful Soup
As can be seen from the above comparison, LXML parser has the function of parsing HTML and XML, and it is fast and has strong fault tolerance. Therefore, it is recommended to use it.
If you use LXML, you can change the second parameter to LXML when initializing Beautiful Soup:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>'.'lxml')
print(soup.p.string)
Copy the code
Later, usage instances of Beautiful Soup are also demonstrated using the same parser.
4. Basic usage
Let’s look at the basic usage of Beautiful Soup with an example:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"> <! -- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">... """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)
Copy the code
The running results are as follows:
<html>
<head>
<title>
The Dormouse's story The Dormouse'
s story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <! -- Elsie --> </a> , <a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">... </p> </body> </html> The Dormouse's story
Copy the code
Here we declare the variable HTML, which is an HTML string. Note, however, that it is not a complete HTML string because neither the body nor the HTML node is closed. We then pass this as the first parameter to the BeautifulSoup object, whose second parameter is the type of the parser (here using LXML), and the BeaufulSoup object is initialized. This object is then assigned to the soup variable.
Next, you can parse the string of HTML code by calling the soup’s various methods and properties.
First, call the prettify() method. This method outputs the string to be parsed in a standard indentation format. Note that the output contains the body and HTML nodes, which means that the non-standard HTML string BeautifulSoup can be formatted automatically. This step is not done by the prettify() method, but is done when BeautifulSoup is initialized.
You then call soup.title.string, which essentially prints the text content of the TITLE node in HTML. Therefore, the soup. Title can select the title node in the HTML, and then call the string attribute to retrieve the text inside. So we can simply call a few attributes to complete the text extraction.
5. Node selector
It is very fast to select node elements by calling the node name and then calling the string property to get the text inside the node. If the structure of the individual nodes is very clear, this method can be used to resolve.
Select elements
Here’s another example to illustrate how to select an element:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"> <! -- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">... """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)
Copy the code
The running results are as follows:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
Copy the code
Here we still use the HTML code, first print out the selection result of the title node, the output result is the text content of the title node. Next, output its type, which is the bs4. Element.Tag type, which is an important data structure in Beautiful Soup. After the selector selection, the selection results in this type of Tag. Tag has attributes, such as the String attribute, that you can call to get the text content of the node, so the next output is the text content of the node.
Next, we try to select the head node, and the result is the node plus everything inside it. Finally, the P node is selected. However, in this special case, we found that the result was the content of the first P node, while the following p nodes were not selected. In other words, when there are multiple nodes, only the first matching node is selected in this selection method, and all subsequent nodes are ignored.
Extracting information
How do you get the value of the node property? How do I get the node name? Let’s unify the extraction method of information.
(1) Get the name
You can use the name attribute to get the name of the node. To get the node name, select the title node and call the name attribute:
print(soup.title.name)
Copy the code
The running results are as follows:
title
Copy the code
(2) Obtain attributes
Each node may have multiple attributes, such as id and class. After selecting this node element, you can call attrs to retrieve all attributes:
print(soup.p.attrs)
print(soup.p.attrs['name'])
Copy the code
The running results are as follows:
{'class': ['title'].'name': 'dromouse'}
dromouse
Copy the code
As you can see, attrs returns a dictionary that combines all the attributes and attribute values of the selected node into a dictionary. Next, if you want to get the name attribute, you get some key value from the dictionary, and you just add the attribute name in brackets. For example, to get the name attribute, attrs[‘name’] can be used.
This is a bit cumbersome, but there is an easier way to get the value of the attribute: instead of writing attrs, you can simply pass in brackets after the node element and get the attribute name. Here’s an example:
print(soup.p['name'])
print(soup.p['class'])
Copy the code
The running results are as follows:
dromouse
['title']
Copy the code
The important thing to note here is that some return a string and some return a list of strings. For example, the value of the name attribute is unique and the result returned is a single string. For classes, a node element may have multiple classes, so the list is returned. In the actual process, we should pay attention to judge the type.
(3) Access to content
We can use the string attribute to get the text content of the node element, for example, to get the text of the first P node:
print(soup.p.string)
Copy the code
The running results are as follows:
The Dormouse's story
Copy the code
Notice again that the p node selected here is the first P node, and the text obtained is also the text inside the first P node.
Nested choice
In the above example, we know that each return result is of type Bs4. Element. Tag, which can also proceed to call the node for the next selection. For example, if we get the head node element, we can continue to call head to select the inner head node element:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)
Copy the code
The running results are as follows:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
Copy the code
The first line results in the title node element selected by calling title again after calling head. It then prints out its type and you can see that it is still of type Bs4. Element. Tag. That is, if we select the Tag type again, we will get the same result each time, so we can do the nesting selection.
Finally, it prints its string property, which is the text content of the node.
Link to choose
When making a selection, sometimes it is not possible to select the desired node element in one step. You need to select a node element first, and then select its child node, parent node, brother node and so on based on it. Here is how to select these node elements.
(1) Child node and descendant node
After the node element is selected, if you want to get its immediate children, you can call the contents property as shown in the following example:
html = """ The Dormouse's story story"> Once upon a time there were three little sisters; and their names were http://example.com/elsie" class="sister" id="link1"> Elsie http://example.com/lacie" class="sister" id="link2">Lacie and http://example.com/tillie" class="sister" id="link3">Tillie and they lived at the bottom of a well. story">... """
Copy the code
The running results are as follows:
['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']
Copy the code
As you can see, the result is a list. The p node contains both text and nodes, which are eventually returned as a list.
Note that each element in the list is a direct child of the p node. For example, the first a node contains a layer of SPAN nodes, which is equivalent to grandchild nodes, but the result does not select the span nodes separately. So the result of the contents property is a list of immediate children.
Similarly, we can call the children property to get the result:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
Copy the code
The running results are as follows:
<list_iterator object at 0x1064f7dd8>
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4
and
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6
and they lived at the bottom of a well.
Copy the code
It’s the same HTML text, the children property is called to select, and the result is the generator type. Next, we use the for loop to output the corresponding content.
To get all the descendants, call the descendants property:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
print(i, child)
Copy the code
The running results are as follows:
<generator object descendants at 0x10650e678>
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <span>Elsie</span>
4 Elsie
5
6
7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9
and
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12
and they lived at the bottom of a well.
Copy the code
The result is still a generator. If you walk through the output, you can see that the output contains the SPAN node. Descendants recursively query all child nodes to get all descendants.
(2) Parent node and ancestor node
To get the parent of a node element, call the parent property:
html = """ The Dormouse's story story"> Once upon a time there were three little sisters; and their names were http://example.com/elsie" class="sister" id="link1"> Elsie story">... """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)
Copy the code
The running results are as follows:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
Copy the code
Here we select the parent element of the first a node. Obviously, its parent is the P node, and the output is the P node and its contents.
It should be noted that the output here is only the direct parent node of node A, instead of looking for the ancestor node of the parent node. To get all the ancestor nodes, call the parents property:
html = """ story"> http://example.com/elsie" class="sister" id="link1"> Elsie """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))
Copy the code
The running results are as follows:
<class 'generator'>
[(0, <p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body>), (2, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>), (3, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>)]
Copy the code
As you can see, the return result is a generator type. Here its index and contents are printed as a list, and the elements in the list are the ancestors of node A.
(3) Sibling nodes
If you want to obtain a sibling node (that is, a sibling node), what should you do? The following is an example:
html = """ story"> Once upon a time there were three little sisters; and their names were http://example.com/elsie" class="sister" id="link1"> Elsie Hello http://example.com/lacie" class="sister" id="link2">Lacie and http://example.com/tillie" class="sister" id="link3">Tillie and they lived at the bottom of a well. """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))
Copy the code
The running results are as follows:
Next Sibling
Hello
Prev Sibling
Once upon a time there were three little sisters; and their names were
Next Siblings [(0, '\n Hello\n '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n ')]
Prev Siblings [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]
Copy the code
As you can see, there are four attributes called, where next_SIBLING and previous_SIBLING get the next and previous sibling element of the node, respectively, and next_siblings and previous_siblings return the generator of all the preceding and following siblings, respectively.
(4) Extracting information
If you want to get some of their information, such as text, attributes, etc., use the same method, as shown in the following example:
html = """ story"> Once upon a time there were three little sisters; and their names were http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('Parent:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])
Copy the code
The running results are as follows:
Next Sibling:
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
Parent:
<class 'generator'>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']
Copy the code
If the result is a single node, you can call string, attrs, and other attributes directly to get the text and attributes. If the result is a generator of multiple nodes, you can turn it into a list, fetch an element, and then call string, attrs, and other attributes to get the text and attributes of the corresponding node.
6. Method selector
The previous selection methods are all based on attributes, which is very fast, but can be cumbersome and inflexible for more complex selections. Fortunately, Beautiful Soup also provides some query methods, such as find_all() and find(), that you can call and pass in the appropriate parameters to query flexibly.
find_all()
Find_all, as the name implies, queries all elements that match the criteria. You can pass it some attributes or text, and you get the elements that fit the criteria, which is pretty powerful.
Its API is as follows:
find_all(name , attrs , recursive , text , **kwargs)
Copy the code
(1)name
We can query elements by node names as shown in the following example:
html=' ''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'' '
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul') [0]))Copy the code
The running results are as follows:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
Copy the code
Here we call the find_all() method, passing in the name argument, whose value is ul. That is, we want to query all ul nodes and return a list of length 2, with each element still of type Bs4. Element. Tag.
Because they are all Tag types, you can still do nested queries. Query all ul nodes, then continue to query its internal LI nodes:
for ul in soup.find_all(name='ul') :print(ul.find_all(name='li'))
Copy the code
The running results are as follows:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Copy the code
The result is a list, and each element in the list is still a Tag.
Next, we can iterate over each li and get its text:
for ul in soup.find_all(name='ul') :print(ul.find_all(name='li'))
for li in ul.find_all(name='li') :print(li.string)
Copy the code
The running results are as follows:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar
Copy the code
(2)attrs
In addition to querying by node name, we can also pass in some attributes, as shown in the following example:
html=' ''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'' '
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))
Copy the code
The running results are as follows:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
Copy the code
The attrs argument is passed in as a dictionary type. For example, to query for a node whose ID is list-1, you can pass attrs={‘id’: ‘list-1’}. The result is a list containing all nodes whose ID is list-1. In the example above, the number of eligible elements is 1, so the result is a list of length 1.
For some common attributes, such as ID and class, we can pass them without attrs. For example, to query a node whose ID is list-1, you can pass in the id parameter. Let’s query the text in a different way:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))
Copy the code
The running results are as follows:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
Copy the code
Select * from node where id=’list-1′; For class, since class is a keyword in Python, it is followed by an underscore, class_=’element’, and the result is still a list of tags.
(3)text
The text argument can be used to match the text of a node, either as a string or as a regular expression object, as shown in the following example:
import re
html=' ''
'' '
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))
Copy the code
The running results are as follows:
['Hello, this is a link'.'Hello, this is a link, too']
Copy the code
There are two A nodes that contain text information inside. Here, the text argument, which is a regular expression object, is passed to the find_all() method, which returns a list of the text of all nodes that match the regular expression.
find()
In addition to the find_all() method, there is also the find() method, except that the latter returns a single element, the first matched element, while the former returns a list of all matched elements. The following is an example:
html=' ''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'' '
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(type(soup.find(name='ul')))
print(soup.find(class_='list'))
Copy the code
The running results are as follows:
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
Copy the code
The result is no longer a list, but the first matching node element, again of type Tag.
In addition, there are a number of query methods that use exactly the same usage as the find_all() and find() methods described earlier, but with different query scopes, which are briefly explained here.
- Find_parents () and find_parent() : the former returns all ancestor nodes, the latter returns the immediate parent node.
- Find_next_siblings () and find_next_sibling() : the former returns all siblings and the latter returns the first sibling.
- Find_previous_siblings () and find_previous_sibling() : the former returns all siblings and the latter returns the first sibling.
- Find_all_next () and find_next() : the former returns all eligible nodes after the node, the latter returns the first eligible node.
- Find_all_previous () and find_previous() : the former returns all eligible nodes after the node, the latter returns the first eligible node.
7. CSS selector
Beautiful Soup also provides another selector, which is a CSS selector. If you’re familiar with Web development, you’re certainly familiar with CSS selectors. If you are not familiar with, you can refer to www.w3school.com.cn/cssref/css_… Understanding.
To use CSS selectors, simply call the select() method and pass in the corresponding CSS selector, as shown in the following example:
html=' ''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'' '
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul') [0]))Copy the code
The running results are as follows:
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>
Copy the code
Here we use the CSS selector three times, and the result is a list of nodes that match the CSS selector. For example, select(‘ul Li ‘) selects all li nodes under all UL nodes, resulting in a list of all Li nodes.
The last sentence prints out the types of elements in the list. As you can see, the type is still Tag.
Nested choice
The select() method also supports nested selection. For example, select all UL nodes and then iterate over each UL node to select its LI node. For example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul') :print(ul.select('li'))
Copy the code
The running results are as follows:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Copy the code
You can see that the list of all LI nodes under all UL nodes is displayed.
Retrieve attributes
We know that the node type is Tag, so we can get the attribute the same way. Again the HTML text above, here we try to get the ID attribute for each UL node:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul') :print(ul['id'])
print(ul.attrs['id'])
Copy the code
The running results are as follows:
list-1
list-1
list-2
list-2
Copy the code
As you can see, passing in the brackets and attribute name directly and getting the attribute value through the attrs attribute succeeds.
Get the text
To get text, of course, you can also use the string property described earlier. There is also a method called get_text(), as shown in the following example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li') :print('Get Text:', li.get_text())
print('String:', li.string)
Copy the code
The running results are as follows:
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Get Text: Jay
String: Jay
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Copy the code
As you can see, the effect is exactly the same.
Now, the usage of Beautiful Soup is basically introduced. Finally, a brief summary is made.
- The LXML parsing library is recommended, or html.parser if necessary.
- Node selection and filtering is weak but fast.
- It is recommended to use
find()
orfind_all()
The query matches a single result or multiple results. - You can use CSS selectors if you are familiar with them
select()
Method selection.
This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience
For more crawler information, please follow my personal wechat official account: Attack Coder
Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)