BeautifulSoup:

Brief introduction:

Beautiful Soup provides simple, Python-like functions to handle navigation, searching, modifying analysis trees, and more. It is a toolkit that provides users with data to grab by parsing documents, and because it is simple, it takes very little code to write a complete application. Beautiful Soup automatically converts the input document to Unicode encoding and the output document to UTF-8 encoding. You don’t need to worry about encoding unless the document doesn’t specify one, in which case Beautiful Soup doesn’t automatically recognize the encoding. Then, all you need to do is explain the original code. Beautiful Soup has become as good a Python interpreter as LXML and HTML6lib, giving users the flexibility to parse different strategies or strong speeds.

Beautiful Soup is a Python library whose primary function is to fetch data from web pages. As a crawler framework, it is mainly convenient for crawlers.

Find_all:
  1. When extracting the label, the first parameter is the name of the label. Then, if you want to use tag attributes for filtering when extracting tags, you can pass in the name of the attribute and the corresponding value in the form of keyword parameters in this method. Or useattrsAttribute, which puts all attributes and their corresponding values in a dictionaryattrsProperties.
  2. Sometimes, when extracting tags, you don’t want to extract so much, so you can use itlimitParameters. Limit how many to extract.
Find and find_all:
  1. Find: Returns the first label that meets the condition. In plain English, only one element is returned.
  2. Find_all: Returns all labels that meet the conditions. In plain English, many tags are returned (in the form of lists).
Filter criteria using find and find_all:
  1. Keyword parameter: Filters the name of the attribute as the name of the keyword parameter and the value of the attribute as the value of the keyword parameter.
  2. Attrs parameter: Puts attribute conditions into a dictionary, passing to the attrs parameter.
Get the attributes of the tag:
  1. Fetch by subscript: By subscript of a label.

     href = a['href']
    Copy the code
  2. Get from the attrs attribute:

     href = a.attrs['href']
    Copy the code
String and strings, the stripPED_strings attribute, and the get_text method:
  1. String: Gets the non-label string under a label. It returns a string. If there are more than one line of characters under the tag, it cannot be retrieved.
  2. Strings: Gets a descendant non-label string under a label. What comes back is a generator.
  3. Stripped_strings: Gets a descendant non-label string under a label, removing whitespace characters. What comes back is a generator.
  4. Get_text: Gets a descendant non-label string under a label. Not as a list, but as a regular string.
CSS selector:
  1. Based on the name of the tag, the sample code is as follows:

     p{
         background-color: pink;
     }
    Copy the code
  2. Select by class name, then precede the class with a dot. Example code is as follows:

     .line{
         background-color: pink;
     }
    Copy the code
  3. Select by id name, then add a # before id. Example code is as follows:

     #box{
         background-color: pink;
     }
    Copy the code
  4. Find descendant elements. So there should be a space between the descendant elements. Example code is as follows:

     #box p{
         background-color: pink;
     }
    Copy the code
  5. Find direct child elements. There must be a > between the parent and child elements. Example code is as follows:

     #box > p{
         background-color: pink;
     }
    Copy the code
  6. Lookup by attribute name. You should write the tag name first, then the value of the attribute in brackets. Example code is as follows:

     input[name='username']{
         background-color: pink;
     }
    Copy the code
  7. When searching by class name or ID, filter by tag name. You can put the tag name before the class or before the ID. Example code is as follows:

    div#line{
        background-color: pink;
    }
    div.line{
        background-color: pink;
    }
    Copy the code
BeautifulSop using a CSS selector:

In BeautifulSoup, to use a CSS selector, you should use the soup.select() method. You should pass a CSS selector string to the select method.

Four common objects:
  1. Tag: All labels in BeautifulSoup are of a Tag type, and the object of BeautifulSoup is essentially a Tag type. So some methods such as find and find_all are not BeautifulSoup, but Tag.
  2. NavigableString: Inherits STR from Python, which works just like STR in Python.
  3. BeautifulSoup: inherited from Tag. Used to produce the BeaufifulSoup tree. Some lookup methods, such as find and select, are actually Tag.
  4. Comment: NavigableString inherits from NavigableString.
Contents and children:

Returns a direct child of a label, which also includes a string. The difference between them is that contents returns a list, and children returns an iterator.