This is the 28th day of my participation in the August Challenge

Life is too short to learn Python together

The crawler agreement

Robots.txt file, which says whether to allow crawling site data

Bs4 module

  • The installation

pip3 install beautifulsoup4

  • Module is introduced

BeautifulSoup is a parsing library that extracts data from HTLML

Bs4 Crawl Autohome News – BASIC use of BS4

import requests
from bs4 import BeautifulSoup

# HTML in text format
response = requests.get('https://www.autohome.com.cn/news/1/#liststart')
# print(response.text)

Parsing HTML with BS4 first argument: HTML text second argument: what parser to use, either the built-in HTML.parser, no need to install third-party modules, or LXML (PIP install LXML) ""
soup = BeautifulSoup(response.text, 'lxml')

Find div "" for article-wrapper class
div1 = soup.find(class_='article-wrapper')
# print(div)

Find div "" with id auto-channel-lazyload-article
div2 = soup.find(id='auto-channel-lazyload-article')
# print(div)

Look for the article class ul tag.
ul = soup.find(class_='article')
Continue to find all li "" under UL
li_list = ul.find_all(name='li')
for li in li_list:
    Find the thing under each li.
    title = li.find(name='h3')
    Prevention of advertising
    if title:
        title = title.text
        url='https:'+li.find('a').attrs.get('href')
        desc=li.find('p').text
        img='https:'+li.find(name='img').get('src')
        print("" News Headline: % S News Address: % S News Summary: % S News Image: %s"%(title,url,desc,img))
Copy the code

Bs4 use

Traverse the document tree

Select directly by tag name, which is fast, but if there are multiple identical tags, only the first one is returned

  • Preparing HTML Documents
html_doc = """ The Dormouse's story  

helloThe Dormouse's story

Once upon a time there were three little sisters; and their names were ElsieLacie and Tillie; and they lived at the bottom of a well.

...

"""
Copy the code
  • Document fault tolerance, not a standard HTML can be parsed
The HTML body tag of the above HTML document is not closed
soup=BeautifulSoup(html_doc,'lxml')
Copy the code
  • Traverse the document tree usage

Gets the data inside the label

Head = soup. Headprint(head)
Copy the code

Get label name

print(head.name)
Copy the code

Gets the attributes of the tag, which may be multiple, in a list

p = soup.body.p Get the P tag under the body tag
print(p.attrs)  # get p tag attribute /id, etc
print(p.attrs.get('class'))  The result is a list
print(p.get('class'))  Method 2, the result is a list
print(p['class'])  Method 2, the result is a list
Copy the code

Nested choice

a=soup.body.a
print(a.get('id'))
Copy the code
# child node, descendant node
print(soup.p.contents) All the children of #p
print(soup.p.children) # return an iterator containing all children under p
print(list(soup.p.children)) # return an iterator containing all children under p
Parent node, ancestor node
print(soup.a.parent) Get the parent node of a tag (only one)
print(soup.p.parent) Get the parent node of the p tag
print(soup.a.parents) # Find all ancestor nodes of a tag, father's father, father's father...
print(list(soup.a.parents))# Find all ancestor nodes of a tag, father's father, father's father...
print(len(list(soup.a.parents)))# Find all ancestor nodes of a tag, father's father, father's father...
# sibling node
print(soup.a.next_sibling) # Next brother
print(soup.a.previous_sibling) # Last brother

print(list(soup.a.next_siblings)) # Brothers below => generator objects
print(list(soup.a.previous_siblings)) # brothers above => generator objects
Copy the code

Search document tree

Find and find_all

Find (): Returns only the first one found

Find_all (): all found

Five filters

  • String filtering: The filtering content is a string
a = soup.find(name='a')
print(a) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
res = soup.find(id='my_p')
res=soup.find(class_='story')
res=soup.find(href='http://example.com/elsie')

res=soup.find(attrs={'id':'my_p'})
res=soup.find(attrs={'class':'story'})
Copy the code
  • Regular expression
import re
re_b=re.compile('^b')
res=soup.find(name=re_b)
res=soup.find_all(name=re_b)
res=soup.find_all(id=re.compile('^l'))
Copy the code
  • The list of
Get the data for all tags with the tag name body or b
res=soup.find_all(name=['body'.'b'])
Get data for tags with sister or title attributes
res=soup.find_all(class_=['sister'.'title'])
Copy the code
  • True and False
# fetch all tags with names
res=soup.find_all(name=True)
Get the tag with the id
res=soup.find_all(id=True)
Get the tag without id
res=soup.find_all(id=False)
res=soup.find_all(href=True)
Copy the code
  • Method — Understanding
def has_class_but_no_id(tag) :
    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(has_class_but_no_id))
Copy the code
  • Other methods –limit(Limit the number of items searched)
res=soup.find_all(name=True,limit=1)
printRes =soup.body. Find_all (name='b ',recursive=False)
res=soup.body.find_all(name='p',recursive=False)
res=soup.body.find_all(name='b',recursive=True)
print(res)
Copy the code

The CSS to choose

More selectors reference www.w3school.com.cn/cssref/css_…

res = soup.select('#my_p')
ret=soup.select('body p')  # Children and grandchildren
ret=soup.select('body>p')  # Direct children (children)
ret=soup.select('body>p') [0].text  # Direct children (children)
res = soup.select('#my_p') [0].attrs # get attributes
res = soup.select('#my_p') [0].get_text() # Get content
res = soup.select('#my_p') [0].text # Get content
res = soup.select('#my_p') [0].strings # Zizizun content into the generatorObject. String# fetch only if there is text under the current tag otherwise None
Copy the code

conclusion

  • The LXML parsing library is recommended

  • There are three kinds of selectors: tag selectors, find and find_all, and CSS selectors

    The filter capability of the label selector is weak

    A simple find find_all query matches a single result or multiple results

    Select is recommended if you are familiar with CSS selectors

  • Remember the usual get_text()/text method for getting the attribute attrs and the content.

conclusion

The article was first published in the wechat public account Program Yuan Xiaozhuang, at the same time in nuggets.

The code word is not easy, reprint please explain the source, pass by the little friends of the lovely little finger point like and then go (╹▽╹)