Day 1 of the Python crawler course

Preface 🌞

Long time no see, very miss. I find today’s games are more and more fun, the video is more and more beautiful, and finally, I can’t shake it off.

So I’m going to document the Python crawler journey here. Hope you can see play at the same time, for me to show the maze, for me to encourage 😘

1. Urllib download baidu HTML file

The urllib module downloads baidu HTML pages
import urllib.request

def load_baidu(a):
  url = "http://www.baidu.com"
  Request network data
  response = urllib.request.urlopen(url)
  data = response.read().decode('utf-8')
  
  Write file in response to data
  with open("baidu.html"."w") as f:
    f.write(data)

load_baidu()
Copy the code

If you copy this code, then you must have a baidu. HTML file under the same directory, 👈 take a look

If you see results like the one above, you’ve made a big step, but let’s not get too excited just yet. After all, the old saying, steps too big careful pull 🥚

So we should think about what each line of code does, copy is not a good siege style 💪

2. Request HTTPS

If you are careful, you may have noticed that we requested Baidu under HTTP protocol, not HTTPS

The following error may occur when we use https://www.baidu.com

<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)>
Copy the code

Indicates that our SSL certificate is missing

Now let’s change the code

The urllib module configures headers to access baidu using HTTPS
import urllib.request
import ssl

Create an unauthenticated SSL context
context = ssl._create_unverified_context()

def load_baidu(a):
  url = "https://www.baidu.com"


  Request network data
  response = urllib.request.urlopen(request,context=context)
  data = response.read().decode('utf-8')
  
  with open("baidu.html"."w") as f:
    f.write(data)
  # print(response.headers)

load_baidu()
Copy the code

3. Add request header information

Successful request, we call biduo.html immediately. Yi? Nothing! 😨 open developer tools to view biduo.html file code as follows:

<html>
<head>
	<script>
		location.replace(location.href.replace("https://"."http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0; url=http://www.baidu.com/"></noscript>
</body>
</html>
Copy the code

The reason for this is that baidu. HTML will detect the user (me) requesting it. Since my request headers have no information, I will not be allowed to use HTTPS to access it, thus redirecting to HTTP

Now that we know why, let’s add the request header. Be a reptile with a face.

The urllib module configures headers to access baidu using HTTPS

import urllib.request
import ssl

Create an unauthenticated SSL context
context = ssl._create_unverified_context()

def load_baidu(a):
  url = "https://www.baidu.com"
  header = {
    'User-Agent':'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
  }
 Create a request object to add headers information later
  request = urllib.request.Request(url,headers=header)
  Request network data
  response = urllib.request.urlopen(request,context=context)
  data = response.read().decode('utf-8')
  
  with open("baidu.html"."w") as f:
    f.write(data)
  # print(response.headers)

load_baidu()
Copy the code

Oh, ✌. We successfully got baidu’S HTML file again. To sum up, today we mainly harvest what knowledge.

Use the urllib.request module to request the page and write it to our disk
The URllib module uses SSL to access HTTPS
Add a header to the Request.

Does it feel like learning just a little bit is over? If you are a novice, you can familiarize yourself with the API used by each line of code. What are the parameters? What can each parameter do 🤔? I believe that your skill will grow with each passing day, so that the little friend next door will sit up and take notice. What are you waiting for? Knock!

Click a concern not to get lost, later will be followed by push python crawler learning tutorial, we move forward together [sly smile] Aoli to 😍

Preface 🌞

1. Urllib download baidu HTML file

2. Request HTTPS

3. Add request header information

Related Posts

Operation cookie in Node.js

Build a wheel based on Netty to achieve their own RPC framework

Speed generated thumbnail, Serverless support events rebroadcast lock highlights of the Winter Olympics