The request header user-agent has been set in the crawler. Add this parameter in the request, it can be disguised as a browser to a certain extent, so that the server will not be directly identified as spider.demo.code. As far as I know, many of my readers copy user-Agent directly from the network and paste it into the code. So there is nothing wrong with getting the user-agent, can be used, but if the climb a little bit stronger, with the request of the fixed head may be a bit of a problem, so we need to set up a random request header, here, I share my own general use of three methods set up random request header, the learned a point of praise and comment!!!!!!!!!

Introduction of ideas:

In fact, to achieve the effect of randomness, to a large extent we can use the random function library random to achieve, you can call random. Choice ([user-agent]) random pick array is ok, this is one of my ways.
Python is a language with a lot of third party packages, so there is a package that can generate random request headers. Fake-useragent is a third party library.
Since other people can write a third-party library, natural also can realize such a function, in most cases, I a lot of code is a direct call myself realize GetUserAgentCS class, direct can get a random request header, write directly function library, to cow force, this I also will introduce how to write a function in the library.

Write your own third-party library:

I don’t know what your code framework is, procedural or object oriented? For one-time code, just simple code, if you think that the code it can be use in many place, can be repeated use, then you can use classes, to write the code, then in the other file, you can directly call you to write this file, the various approaches to direct call class class you wrote, And I also implemented a third-party library of a random request header as follows:

Import randomImport CSvClass GetUserAgentCS(object): “”” call the local request header, return the request header “” def init(self): With open(‘D://pyth//scrapy project //setting// userAgent.csv ‘, ‘r’) as fr: fr_csv = csv.reader(fr) self.user_agent = [str(i[1]) for i in fr_csv] def get_user(self): return random.choice(self.user_agent)12345678910111213141516

The useragent file is as follows:

1, "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 81.0.4044.129 Safari / 537.36, Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"2,"Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 81.0.4044.129 Safari / 537.36, Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36"3,"Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 81.0.4044.129 Safari / 537.36, Mozilla / 5.0 (X11; NetBSD) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"4,"Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 81.0.4044.129 Safari / 537.36, Mozilla / 5.0 (X11; CrOS I686 3912.101.0) AppleWebKit/537.36 (KHTML, Like Gecko) Chrome/27.0.1453.116 Safari/537.36"5,"Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 81.0.4044.129 Safari / 537.36, Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, Like Gecko) Chrome / 27.0.1453.93 Safari / 537.36 "-- -- -- -- -- -- -- -- -- -- -- -- -- -- # too much 100... 12345678Copy the code

The code is very simple, just read a local CSV file, and random will do it, and now someone asks me, “How did you get this file?” It’s very simple, and there’s a method, and I’ll talk about it in the next module. In this case, we just need to write a GetUserAgentCS class, You can simply copy the code above and save it as get_useragent.py, ** then put the package in your crawler folder ** and call it like this:

from get_useragent import GetUserAgentCSheaders = {}ua = GetUserAgentCS().get_user()headers['user-agent'] = uareturn headers12345
Copy the code

If your call to GetUserAgentCS fails, or a red wavy line appears at the bottom, then you have not set the current working environment, you just need to set (set your crawler folder) :

! [python three ways to let you crawl to the data points minutes] (https://p6-tt.byteimg.com/origin/pgc-image/10dfef7aa4fe4607ae69aecd1a12a72a?from=pc)

You just need to click Sources Root!

Use the third-party fake-userAgent library:

This is a others have written a third-party library, you need to install and then call the API is ok, it can get all kinds of the request of the head, the only drawback is the request is not stable, sometimes network fluctuation can lead to obtain is not successful, used in Scrapy, not very comfortable, so I on the basis of the package, write the above my own package, As for how the data in the request header comes from, it is to keep changing the user-agent while the package is running properly, and then keep requesting httpbin.org/user-agent and then keep saving the data and writing it to the local file.

Let’s talk about how to use this package.

The installation

pip install fake-useragent1
Copy the code

You can check PIP List to see if the installation is successful

use

from fake_useragent import UserAgentheaders= {'User-Agent':str(UserAgent().random)}r = requests.get(url, headers=headers)123
Copy the code

UserAgent().random can get the request header from any browser
UserAgent().Chrome can get the Google Chrome request header
UserAgent(). Firefox obtains the request header of firefox browser

In this case, random can be used directly, which is easy.

If you are interested, you can click on the link below

Docs.qq.com/doc/DTGpFa2…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python allows you to crawl data in three ways in minutes

Introduction of ideas:

Write your own third-party library:

Use the third-party fake-userAgent library:

Python allows you to crawl data in three ways in minutes

Introduction of ideas:

Write your own third-party library:

Use the third-party fake-userAgent library:

Related Posts

How to Perform QTUM-BEAM atom exchange on BEAM main net?

Fast cloud protogenics, best practices for migration from data center to cloud native

Android Jetpack for Paging and Room use