No offense to the title, but I thought the AD was fun and the mind map above is yours to take, because I can’t learn that much anyway

The article directories

    • preface
      • Welcome to our circle
    • HTML based
            • What is HTML?
      • View the HTML code of the web page
      • What the hell are we looking at?
        • Tags and elements
        • Basic HTML structure
        • HTML attributes
    • Crawls web page text
      • Antecedents to review
      • BeautifulSoup
      • Web data parsing
        • Res = BeautifulSoup(‘ data to parse ‘,’ parser ‘)
        • Extract the data
        • Tag object
    • review

preface

Previous Review: I will Learn Python secretly (Day 7)

The last one, we first see the crawler, in fact, after close contact, found that it is not to learn it, although we just took a small picture, climb the web page is more difficult, once too hard to people’s source code to climb over.

So what to do? We look at other people’s crawler effect, as if there is no such messy source code ah. This needs to do web page parsing, well, today we will parse the web page, and then the inside of the effective word extraction.

Note that we can not teach you to do front-end, just a simple understanding of the web page structure, so that our crawler positioning.

Note: The black part of the picture in this article is from “Wind change programming”.

Interrupt a push :(if you are small white, you can take a look at the following paragraph)

Welcome to our circle

I created a Python learning q & A group, interested friends can know: what is this group

There are more than 200 friends in the group!!

Portal of the Straight group: Portal


This series of articles assumes that you have some basic knowledge of C or C++, because I started Python after learning a little C++. Thank you qi Feng for your support. This series of articles by default you will baidu, learning 'module' this module, or suggest that you have your own editor and compiler, the last article has given you to do the recommendation? I don't want much, just click on it. And then, the catalogue for this series, to be honest, MY preference is for the two Primer Plus books, so I'll stick with their catalogue structure. This series will also focus on cultivating your ability to do things on your own. After all, I cannot tell you all the knowledge points, so the ability to solve your needs by yourself is particularly important. Therefore, please do not regard the holes I have buried in the article as holes.Copy the code

HTML based

What is HTML?

HTML (Hyper Text Markup Language) is a Language used to describe web pages. It is also called hypertext Markup Language. It doesn’t matter what it is here, let’s focus on it, everyone here, I think, is not averse to strange languages, that’s a big premise.

View the HTML code of the web page

Let’s open a webpage casually, for example, on the seventh day of it: blog.csdn.net/qq_43762191…

Right – click in a blank area of the page to display the page source code. The code is too long for me to copy over ah, ah, the url is here: view-source:blog.csdn.net/qq_43762191… If you can’t find the source code, you can enter it directly from the website.

There is another way: right click -> Check, which makes the source code appear on the same screen as the web page. It all depends on personal preference.


What the hell are we looking at?

Tags and elements

Okay, let’s look at the HTML document first. You can see a bunch of letters between Angle brackets <> called tags.

Tags usually come in pairs: the first is the start tag, for example; The end tag follows, as in.

However, there are tags that appear alone, such as the fifth line of HTML code (which defines the web page encoding as UTF-8). Just so you know, most of the time you’re using tags that come in pairs.



In fact, the start tag + the end tag + everything in between, together they make up the element.

The following table lists a few common elements:

Now go back and look at the source code of the web page, although you can make a pretty good guess without this table. Is it a lot clearer!!

Basic HTML structure

And if that’s not enough, here’s another one:

HTML attributes

You’ll also notice that there are other things like “class” that occur over and over again after the body. These are called “attributes”. Let’s look at them:

So far, we’ve covered the components of HTML: tags, elements, structure (header and body), attributes. Ok, now visually look at the source code of the web page, until the feeling of no pressure, let’s go down.


Crawls web page text

Antecedents to review

The day before, we left you with a pity that we could only crawl the source code of the web page, but that was actually the first step, and then we parsed that web page code and extracted the content we wanted.

The code looked like this:

import requests Call the Requests library
res = requests.get('https://blog.csdn.net/qq_43762191')	# Crawl My own page
Get the source code of the web page, and the res is the Response object
html=res.text
Return the contents of the res as a string
print('Response status code :',res.status_code) Check that the request is properly responded to
print(html) Print the source code of the web page
Copy the code

BeautifulSoup

Here’s a module called BeautifulSoup, beautiful Soul.

This cannot be downloaded in my pycharm. I wonder if it is because I use Python3.9, or it cannot be downloaded separately, just like I did for word cloud before. I can download Beautifulsoup4, but whatever you download, just make sure you find bS4.

Web data parsing

Res = BeautifulSoup(‘ data to parse ‘, ‘parser’)

The first is the text to be parsed. Note that it must be a string.

The second parameter identifies the parser, which we will use as a built-in Python library: html.parser. (It’s not the only parser, but it’s simpler.)

Ok, let’s see:

import requests
from bs4 import BeautifulSoup
res = requests.get('https://blog.csdn.net/qq_43762191') 
soup = BeautifulSoup( res.text,'html.parser')
print(type(soup)) Check the type of soup
wt = open('test4.txt'.'w',encoding='utf-8')
wt.write(res.text)
wt.close()
Copy the code

Ok, you go and try it. You’ll find that CSDN doesn’t make it easy for you to get away with it. Nice, though. Let’s change the address. Mp.weixin.qq.com/s?__biz=MzI…

This article has bright spot, however, oneself climb down to read, originally read after I also specially studied a whole afternoon, of course, I am for safety, right, I am a good person.

BeautifulSoup <class ‘bs4.BeautifulSoup’ > indicates that the soup is a BeautifulSoup object.


Extract the data

This step can be broken down into two parts: find() and find_all(), and Tag objects.

import requests
from bs4 import BeautifulSoup
res = requests.get('https://mp.weixin.qq.com/s?__biz=MzI4NDY5Mjc1Mg==&mid=2247489783&idx=1&sn=09d76423b700620f80c9da9e4d8a8536&chksm=ebf6c0 88dc81499e3d5a0febeb67fec27ba52f233b6a0e6fda37221a2c497dee82f2de29e567&scene=21#wechat_redirect') # seems to be able to crawl only sites that end in.html, not even my blog homepage
soup = BeautifulSoup( res.text,'html.parser')
item = soup.find('div')
#item = soup.find_all('div')
print(type(item)) Check the type of soup
wt = open('test6.txt'.'w',encoding='utf-8')
wt.write(str(item))
wt.close()
print(item)
Copy the code

Facts prove that ah, the paper come zhongjue shallow, must know this to practice ah, almost believe the above graph nonsense.

(In fact, if you go back and study it carefully, you will find something new.)

Parameters in parentheses: Tag and attribute can be used either or together, depending on what we want to extract from the web page. icon

If only one of the parameters can be used for accurate positioning, only one parameter is used for retrieval. If you need both tags and attributes to find exactly what you’re looking for, use both parameters together.


Tag object

Do you feel that the victory is in sight, and then execute down to find that it is nothing, don’t worry, we continue to look down, can not see the loss oh:

We’ve been working on this for most of the day and have come up with a bunch of source code again, Tony with water, but really nothing has changed? Do not trust your eyes, eyes sometimes deceiving, you might as well return each time the return value print out the attribute to see, in fact, the attribute has been in favor of our direction of development.

We see that their data type is <class ‘bs4. Element. Tag’ >, which is a Tag object

Remember when we talked about extracting data into tags and whatnot? Now it’s time for the final push, no mistake, go!!

First, Tag objects can be retrieved using find() and find_all().

res = requests.get('https://mp.weixin.qq.com/s?__biz=MzI4NDY5Mjc1Mg==&mid=2247489783&idx=1&sn=09d76423b700620f80c9da9e4d8a8536&chksm=ebf6c0 88dc81499e3d5a0febeb67fec27ba52f233b6a0e6fda37221a2c497dee82f2de29e567&scene=21')#wechat_redirect') # seems to only be able to crawl urls that end in.html, not even my blog homepage ')
Return a Response object, assigned to res
html = res.text
Return the contents of the res as a string
soup = BeautifulSoup( html,'html.parser')
Parse the web page as BeautifulSoup object
items = soup.find_all('div') Extract the data we want by locating tags and attributes
for item in items:
    kind = item.find(class_='rich_media_area_primary') # For each element in the list, the match tag 

extracts the data

print(kind,'\n') Print the extracted data print(type(kind)) Print the extracted data type Copy the code

We use tag.text to extract the text in the Tag object and use the Tag[‘ href ‘] to extract the URL.

import requests Call the Requests library
from bs4 import BeautifulSoup Call BeautifulSoup library
res =requests.get('https://mp.weixin.qq.com/s?__biz=MzI4NDY5Mjc1Mg==&mid=2247489783&idx=1&sn=09d76423b700620f80c9da9e4d8a8536&chksm=ebf6c0 88dc81499e3d5a0febeb67fec27ba52f233b6a0e6fda37221a2c497dee82f2de29e567&scene=21#wechat_redirect')
Return a Response object, assigned to res
html=res.text
Parse res to a string
soup = BeautifulSoup( html,'html.parser')
Parse the web page as BeautifulSoup object
items = soup.find_all(class_='rich_media')   # Extract the element we want by matching the class='rich_media' attribute
for item in items:                      Walk through the list items
    kind = item.find('h2')               # For each element in the list, the match tag 

extracts the data

title = item.find(class_='profile_container') For each element in the list, match the attribute class_='profile_container' to extract data print(kind.text,'\n',title.text) Print the type of book, name, link, and brief description of the text Copy the code

review

To put it simply, from getting the data in the Requests library, parsing the data in BeautifulSoup, and extracting the data in BeautifulSoup, we went through a series of type conversions of the objects we were manipulating.

See the picture below: