Use Beautiful Soup

We’ve covered how to use regular expressions in previous articles, but if you have a problem with regular expressions, the result is not what you want. Friends familiar with the front end certainly know, for a web page, there are certain special structure and hierarchy, and many nodes are distinguished by ID and class. So you can use the structure and attributes of a web page to extract data.

Beautiful Soup is a Python library that extracts data from HTML or XML. It helps you quickly parse and find entire HTML documents with your favorite converter.

Beautiful Soup automatically converts the input document to Unicode encoding and the output document to UTF-8 encoding. So you don’t have to worry about coding.

Unless the document does not specify an encoding, you should simply state the original encoding.

The preparatory work

Before you start, make sure you have Beautiful Soup and LXML installed. If not, please refer to the installation tutorial below.

pip install bs4
pip install lxml
Copy the code

The parser

Beautiful relies on a parser for parsing, and it supports some third-party libraries (such as LXML) in addition to the HTML parser in the Python standard library.

Here’s a quick look at the parsers supported by Beautiful Soup.

The parser Method of use advantage disadvantage
The Python standard library BeautifulSoup(markup, ‘html.parser’) Python built-in standard library, moderate execution speed Python versions prior to 3.2.2 have poor fault tolerance
LXML HTML parser BeautifulSoup(markup, ‘lxml’) Fast speed, strong document fault tolerance The C language library needs to be installed
LXML XML parser BeautifulSoup(markup ‘xml’) Fast, the only parser that supports XML The C language library needs to be installed
html5lib BeautifulSoup(markup, ‘html5lib’) Best fault tolerance, parsing documents in browser fashion, generating documents in HTML5 format Slow speed, not dependent on external expansion

As you can see from the table above, the LXML parser can parse HTML and XML documents with high speed and fault tolerance, so it is recommended.

If you use LXML, you can set the second parameter to LXML when BeautifulSoup is initialized.

The specific code is as follows:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<p>Hello world</p>'.'lxml')
print(soup.p)
Copy the code

Basic usage

Let’s take a look at the basic usage of Beautiful Soup with an example

html_doc = """ The Dormouse's story  

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.prettify()) print(soup.title.string) Copy the code

Start with a brief explanation of the above code.

If you’re sharp-eyed, you’ll notice that the html_doc variable declared is a string of HTML code, but the HTML tag and the body tag are not closed.

Next, pass html_doc to BeautifulSoup and specify ‘LXML’ as the parser. This successfully creates the BeautifulSoup object, which is assigned to the soup.

You can then call the soup’s various methods and properties to parse the HTML code.

First, call the prettify() method. This method outputs the string to be parsed in a standard indentation format. Note that the output contains body and HTML nodes, which means that BeautifulSoup automatically corrects the formatting of non-standard HTML strings.

This step is not made by the prettify() method, but is done when BeautifulSoup is created.

You then call soup.title.string, which essentially prints the text content of the TITLE node in HTML.

Node selector

Select elements

Here’s another example to illustrate how to select an element:

html_doc = """ The Dormouse's story  

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.title) print(type(soup.title)) print(soup.title.string) print(soup.head) print(soup.p) print(type(soup.p)) Copy the code

Running results:

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<class 'bs4.element.Tag'>
Copy the code

Here is a brief description of the above code, here is still used before the HTML code, first print the selection result of the title node, you will find that the selection result is a Tag type, which has many methods and attributes, such as the string attribute, output the text content of the title node.

The rest of the code selects the node and prints the node and everything inside it.

Finally, note that when there are multiple nodes, this selection method will only match the first node, such as the P node.

Extracting node information

We know from the code above that we can use the string attribute to get the content of the text. But sometimes I need to get the value of the node property, or the node name.

(1) Get the name

You can use the name attribute to get the name of the node.

The specific code is as follows:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(tag.name)
Copy the code

By running the above code, you will see that you have successfully obtained the name of node B.

(2) Obtain attributes

Each node may have multiple attributes, such as ID and class, and after selecting the node element, you can call attrs to get all the attributes.

Specific code examples are as follows:

html_doc = """ The Dormouse's story  

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.p.attrs) print(soup.p.attrs['name']) Copy the code

The results

{'class': ['title'], 'name': 'Dormouse'}
Dormouse
Copy the code

You can see from the above results that the property value returns a dictionary type.

Class attributes are stored in lists. Why is this?

The reason: The class attribute can have multiple values, so keep it in a list

(4) Access to content

You can use the string attribute to get the text content contained in the node element, such as the text to get the first P node.

print(soup.p.string)
Copy the code

Get child nodes

Fetching child nodes can also be thought of as a nested selection. We know that there may be other nodes in one node, and BeautifulSoup provides many operations and traversal of the attributes of child nodes.

For example, we can get the head element in the HTML and then we can go ahead and get the node element inside the head element.

html_doc = """ The Dormouse's story  

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.head.title) print(soup.head.title.string) Copy the code

Link to choose

When making a selection, sometimes I cannot get the node element I want in one step. I need to select a node element, and then select its child nodes, parent nodes, and brother nodes based on this node.

(1) Select child nodes and descendant nodes

After selecting a node element, you can call the contents property to retrieve its immediate children.

Specific code examples are as follows:

html_doc = """ The Dormouse's story  

The Dormouse's story

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.p.contents) Copy the code
html_doc = """ The Dormouse's story  

The Dormouse's story

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.p.contents) Copy the code

I believe that sharp-eyed partners look at the above two pieces of code is easy to see the difference.

The p node of the first code does not have a line break, but the p node of the second code does have a line break. So when you try to run the code above, you will find that the direct child nodes are saved in the list, and the second code has a newline character.

The same functionality can also be obtained by calling the children attribute.

html_doc = """ The Dormouse's story  

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.p.children) print(list(soup.p.children)) for i in soup.p.children: print(i) Copy the code

The above code gets the selection by calling the children property, which returns a generator type that can be converted to a list or printed by a for loop.

If you want to get descendant nodes, you can call the descendants property to get output.

Specific code examples are as follows:

html_doc = """ The Dormouse's story  

Once upon a time there were three little sisters; and their names were ElsieElsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.p.descendants) for child in soup.p.descendants: print(child) Copy the code

The result is still of the generator type, and after iterating through the output, you’ll see that you can print the name individually, or if there are any children within the child node.

(2) Parent node and ancestor node

You can call the parent property if you want to get the parent of a node.

Specific code examples are as follows:

html_doc = """ The Dormouse's story  

Once upon a time there were three little sisters; and their names were Elsie

Lacie

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.a.parent) Copy the code

Try running the code above and you will see that the parent obtained is the direct parent of the first A node. It also does not access the ancestor node.

Call the parents property if you want to get all the ancestor nodes.

Specific code examples are as follows:

html_doc = """ The Dormouse's story  

Once upon a time there were three little sisters; and their names were Elsie

Lacie

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.a.parents) for i, parent in enumerate(soup.a.parents): print(i, parent) Copy the code

Gets the ancestor node, and the type returned is still the generator type. So you can loop through everything.

Try running the code above and you’ll see that the output contains both the body node and the HTML node.

(3) Sibling nodes

The above two examples illustrate how to obtain parent and child nodes. What if I need to get sibling nodes? You can get this using the four properties next_SIBLING, previous_SIBLING, next_siblings, and previous_siblings.

Specific code examples are as follows:

html_doc = """ The Dormouse's story  

Once upon a time there were three little sisters; and their names were Elsiehello Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.a.next_sibling) print(list(soup.a.next_siblings)) print(soup.a.previous_sibling) print(list(soup.a.previous_siblings)) Copy the code

As you can see from the code above, there are four attributes called, next_SIBLING and previous_SIBLING, which get the last and next sibling of the node, respectively.

Next_siblings and previous_siblings get the siblings before and after, and return the generator type as usual.

Method selector

All the previous content is selected by attributes, which is very fast, but if it is a more complex selection, the above selection method can be tedious. Therefore, Beautiful Soup provides us with query methods such as find_all() and find(). Call them, passing in the appropriate arguments.

  • find_all()

Its API is as follows:

find_all(name, attrs, recursive, text, **kwargs)

(1) the name

Parameters can be selected based on the node name

Specific code examples are as follows:

html_doc = """ The Dormouse's story  

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.find_all('a')) print(len(soup.find_all('a'))) Copy the code

The above code calls the find_all() method, passing in the name argument with the value a,

Try running the code above, all of the A nodes we want to get, and the result is a list type of length 3.

html_doc = """ The Dormouse's story  

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.find_all('a')) for a in soup.find_all('a') :print(a.find_all('span')) print(a.string) Copy the code

Make some changes to the code above.

Try running the code above and you’ll see that you can get the SPAN node from the A node, as well as the text content from the A node.

(2) the attrs

In addition to querying by node name, you can also query by property.

Specific code examples are as follows:

html_doc = """ The Dormouse's story  

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.find_all(attrs={'id': 'link1'})) print(soup.find_all(attrs={'name': 'Dormouse'})) Copy the code

In this case, the attrs parameter is passed in as a dictionary type.

For common attributes such as class, we can pass class directly as the text above, as shown in the following example:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all(class_ = 'sister'))
Copy the code

The important thing to note here is that class is a Python reserved word, so underline it after class.

Similarly, in fact, the id attribute can also operate in the same way, the above text, the specific code example is as follows:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all(id = 'link2'))
Copy the code
  • find( )

In addition to the find_all() method, there is also the find() method, which returns multiple elements as a list and one element after the suffix.

Specific code examples are as follows:

html_doc = """ The Dormouse's story  

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.find(name='a')) print(type(soup.find(name='a'))) Copy the code

Try running the code above and you’ll see that the find () method returns the element of the first A node, of type Tag.

Find () is used in the same way as find_all().

There are other method selectors, which are briefly described here.

Find_parents () and find_parent() : the former returns all ancestor nodes, the latter returns the immediate parent node.

Find_next_siblings () and find_next_sibling() : the former returns all siblings and the latter returns the first sibling.

Find_previous_siblings and find_previous_sibling() : the former returns all siblings and the latter returns the first sibling.

CSS selectors

Beautiful Soup also provides us with another selector, a CSS selector. For those familiar with front-end development, CSS selectors are certainly familiar.

To use CSS selectors, you call the select() method, passing the property value or node name into the selector.

Specific code examples are as follows:

html_doc = """ 
      

Hello World

  • Foo
  • Bar
  • Jay
  • Foo
  • Bar
  • Jay
"""
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml') print(soup.select('.panel .panel-heading')) Get the node whose class is panel-heading print(soup.select('ul li')) Get the li node under ul print(soup.select('#list-2 li')) Select * from li where id = list-2 print(soup.select('ul')) Get all ul nodes print(type(soup.select('ul') [0])) Copy the code

Try running the code above and you’ll see a lot more when you see the results.

The last sentence prints the type of the element in the list, and you’ll notice that it’s still Tag.

Nested choice

The select() method also supports nested selections, for example, all UL nodes are selected, and li nodes are selected after traversing ul nodes.

As with the HTML text above, the code looks like this:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
for ul in soup.select('ul') :print(ul.select('li'))
Copy the code

Try running the above result to print a list of all li nodes under all UL nodes.

Retrieve attributes

As you can see from the examples above, all the node types are Tag, so you can get the attributes using the same method as before, which is the HTML text above. Here we try to get the ID attribute under each UL node.

Specific code examples are as follows:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
for ul in soup.select('ul') :print(ul['id'])
    print(ul.attrs['id'])
Copy the code

As you can see from the code above, you can either pass the attribute name directly to the brackets or get the attribute value through the attrs attribute.

Get the text

To get text, you can also call the get_text() method in addition to the string property described earlier.

The code example for the HTML text is as follows:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
for li in soup.select('li') :print('String:', li.string)
    print('get text:', li.get_text())
Copy the code

summary

That’s pretty much the end of Beautiful Soup.

When writing crawlers, find_all() and find() methods are generally used to obtain specified nodes.

You can also use the select() method if you are familiar with CSS selectors.

In actual combat

preface

If you see this, then congratulations you have completed a lot of people can not do, because few people can read the above miscellaneous knowledge.

The actual combat content, I bring is to climb B station video barrage.

Why this actual combat content? It’s very simple to cater to Beautiful Soup that we just learned.

The preparatory work

For the same reason that a man must sharpen his tools to do a good work, so is the writing of reptiles.

First, install the two necessary libraries: Requests, BS4

pip install requests
pip install bs4
Copy the code

About station B barrage restrictions

In the past, the bullet screen of station B can be obtained quickly through packet capture, but now there are restrictions on station B, so it cannot be obtained. However, don’t worry, I can still get the bullet screen of station B with the previous API interface.

Crawl content

On the last day of 2020, Guo jingming and Yu Zheng apologized to Zhuang Yu and Qiong Yao respectively for plagiarism earlier in the year. I took a look at it and it went viral on Weibo.

So, TODAY, I will silently open B station, see how the UP masters analyze the apology and the audience comments on this analysis, all to climb the video barrage.

The video link is as follows:

https://www.bilibili.com/video/BV1XK411M7ka?from=search&seid=17596321343783034307

Demand analysis

Gets the barrage API interface

Through packet capture, we need to obtain oid information.

I took the old API interface to get the barrage, now I also share this interface with you.

https://api.bilibili.com/x/v1/dm/list.so?oid=276746872
Copy the code

The bullet screen of each video can be obtained by modifying the VALUE of OID.

Type the link above into your browser and you will see the message.

Crawl barrage

Since we are talking about Beautiful Soup above, it must be Beautiful Soup that parses the data and saves the text. There are many ways to get a barrage, and I’ll list one of them below.

Function implementation

Again, we need to make a request to the link above. Then use Beautiful Soup to get the text and save it to a TXT file.

Specific code

import requests
from bs4 import BeautifulSoup


class DanMu(object) :
    def __init__(self) :
        self.headers = {
            'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
        }
        self.url = 'https://api.bilibili.com/x/v1/dm/list.so?oid=276746872'

    # Get web information
    def get_html(self) :
        response = requests.get(self.url, headers=self.headers)
        html = response.content.decode('utf-8')
        return html

    Save the barrage
    def get_info(self) :
        html = self.get_html()
        soup = BeautifulSoup(html, 'lxml')
        file = open('barrage. TXT'.'w', encoding='utf-8')
        for d in soup.find_all(name='d'):
            danmu = d.string
            file.write(danmu)
            file.write('\n')



if __name__ == '__main__':
    danmu = DanMu()
    danmu.get_info()
Copy the code

Only 3000 rounds were captured using the code above, but there are over 6000 rounds, which is a kind of anti-creep measure. I’ll give you an analysis when I write about reverse crawling.

The last

That’s the end of this post. If you’ve reached this point, this post has been helpful to you, which is why I’m sharing this knowledge.

Nothing can be achieved overnight, so is life, so is learning!

The way ahead is so long without ending, yet high and low I’ll search with my will unbending.

I am shujun, a person who concentrates on learning. The more you know, the more you don’t know, and I’ll see you next time!