Hello, I’m Yue Chuang.

I will select an App to show the data, use Fiddler to capture packets to analyze the interface of data request, and use Python to write crawler logic. Finally, I will save the data to MongDB.

mp.weixin.qq.com/s/06eKx-X87…

1. Preparation

Before following this tutorial, make sure you have completed the following configuration. Here is my operating environment for this article:

  1. Install Python 3.6+, I’m using Python 3.7 here, and make sure it works;
  2. Python compiler Sublime (skip preparation if you have Pycharm installed or already have Pycharm;
  3. Python crawler common library installation;
  4. Install the packet capture tool: Fiddler;
  5. Night God simulator installation;
  6. Install bean fruit food App, through the mobile App store installation, need to turn off the agent, wait until the need to open the agent;
  7. Course materials (including APP)

1.1 Installing Python

For the installation of Python environment, here can I simply write some:

  1. Download method

Go to www.python.org

As shown in figure:

  1. Select the Downloads option above
  2. Select your system in the popup option box (note: if you directly click the gray button on the right, you will download 32 bits)

Enter the download page, as shown in the figure below:

  1. Download for 64-bit files
  2. Download for 32-bit files

Select your corresponding file to download.

Installation Precautions

(Image from the Internet)

Custom options, such as the location of the file can be selected to make Python more compatible with our operation habits.

Default installation: all the way to Next, easier and faster installation.

However, you need to check Add Python 3.6 to Path for both default and custom installations

Special note: remember to check where the arrow points in the picture. Otherwise, you’ll have to manually configure environment variables.

Q: How do you configure environment variables?

R: Control panel – System and Security – System – Advanced System Settings – Environment Variables – System Variables – Double-click PATH – Enter the Python path in the blank space in the edit environment variables window.

check

After installing Python, open the Windows and enter CMD to go to the command line. Then enter Python. If the Python version and other commands are displayed, the Python installation is successful.

1.2 Python compiler Sublime

Website: www.sublimetext.com/

This editor is selected for the following reasons:

  1. Do not need too much programming foundation, quick to get started
  2. Fast start-up speed
  3. The key reason — free

Q&A

When you cannot run the result using the shortcut Ctrl+B, try Ctrl+Shift+P and select Bulid With: Python in the window that pops up.

Or select the Build With option from the Tool option above and select Python in the window that pops up.

1.3 Common crawler library installation

Python installs the library automatically by PIP install XXX.

Just run the following command:

pip3 install -r https://www.aiyc.top/usr/uploads/2020/05/4258804948.txt
Copy the code

1.4 Installing the Packet capture tool Fiddler

1.4.1 Fiddler Packet capture software

1.4.1.1 preface

Fiddler is a very useful HTTP network packet capture tool written in c#. Instead of having to analyze pages in the browser’s developer tools, you can debug page parameters in a more professional and intelligent Fiddler.

Our Fiddler is a Web debugging proxy platform that monitors and modifiers Web data flows. It is a free, open source platform that allows developers and testers to help users view data flows.

As you can see from the figure above (Figure 1), our Fiddler sits between our user (client) and the Web server, acting as a proxy middleware that captures requests made by the client and forwards them. The server receives the request forwarded by Fiddler and sends a response, in which Fiddler captures the returned data and forwards it to the client. From the description above, we can see that Fiddler acts as a middleman in the middle so that we can capture all the data and make changes through this middleman. This is how we capture the APP package.

1.4.1.2 Powerful functions
  1. Fiddler is a powerful tool that works with Internet Explorer, Chrome, Safari, Firefox, and Opera. For these browsers, Fiddler can show you how the data is passed in the presentation layer, in what format, in what language, Transport mode caching mechanism and connection mode and so on.
  2. It can be used on mobile devices such as iPhone, iPad and so on.

To understand the advantages and disadvantages of Fiddler:

1. The advantages

  1. Its advantage is that you can view the Web data flow between all browsers, client applications or servers;
  2. Any request and response can be modified manually or automatically;
  3. Can decrypt HTTPS data streams for viewing and modification;
  4. Modify the data requested or even automatically redirect the request to modify the data returned by the server.

2. The shortcomings

  1. Fiddler supports only Http, Https, FTP, and Websocket data streams.
  2. It cannot monitor or modify other data (i.e., data types it does not support), such as SMTP, POP3, etc.
  3. Fiddler cannot handle requests and responses that exceed 2GB of data.

1.4.2 Downloading Fiddler

Next, let’s take a look at downloading and installing the Fiddler software. We can directly access its download interface through the address: www.telerik.com/fiddler, as shown below (Figure 2) :

After clicking, the following interface (Figure 3) will appear, requiring us to fill in information. If you don’t know what to fill in, you can fill in my picture, and fill in your own email:

Click on it and it will automatically start downloading.

1.4.3 Fiddler installation

Once the download is complete, I’m ready to install it. It’s also pretty easy to install, basically just keep clicking next. Note that Fiddler does not create shortcuts once installed, so you can customize the installation path to make it easier for you to find and run later. You can see that in the beginning, of course.

To make it easier for students to install, I recorded a video of the Fiddler installation. You can watch it by clicking on the link below:

Video link: www.aiyc.top/archives/44…

For the first startup, the following information may appear (Figure 4). Click no.

The normal operating interface is as follows (Figure 5) :

[External link pictures failed to save, the source site may have anti-link mechanism, it is recommended to save the pictures and directly upload them (img-LWzhzTRS-1600652136238)(Beanguo APP capture.)

At this point, you’ve successfully installed Fiddler,

1.5 Installation of Night God simulator

It is easy to install the Nighgod simulator here. In order to control the length of the article, if you do not want to install it, please refer to this link: www.aiyc.top/archives/49…

1.6 Installing the Bean Fruit App

For detailed installation, please refer to the following link: www.aiyc.top/archives/50…

1.7 Course Materials (including APP)

Click on the link to: gitee.com/aiycgit/Stu…

2. Start Fiddler and configure the simulator agent

Start Fiddler and set up the phone proxy in the simulator. Watch the video below for details.

Once configured, our phone captures packets using Fiddler, which is what we call a middleman.

2.1 Fiddler Supplement (Fiddler Configuration and Basic Operations)

2.1.1 Basic Information

Once launched, Fiddler automatically starts working, and we can open a browser and click on a few pages to see that Fiddler grabs a bunch of packets. (Figure 6-1)

2.1.2 Basic Software Interface

Let’s take a closer look at some Fiddler.

The names of each part of Fiddler are already in the figure above (Figure 6), with our toolbar at the top, our session list on the left, command line tools on the bottom left, HTTP Request on the right (sorry for the extra “S” on request), and HTTP Response on the bottom right.

  • Captuing in the lower left corner is Fiddler’s grab switch.
  • Session list: All data captured will be displayed in the session list.
  • Command line tools: We can do some filtering and so on, we can type directly into the commandhelpA page will automatically pop up for more information or command-line commands.
  • HTTP request: is our HTTP request information, including the request method, HTTP protocol, request header (user-agent) and so on;
  • HTTP response: the server sends us the information back;

Let’s move on to this picture (Figure 6-3) :

  1. [#] Request order
  2. [result] Indicates the HTTP status code
  3. [protocol] Request protocol
  4. [Host] Specifies the domain name of the requested address
  5. [URL] Uniform resource locator, the address of the request
  6. [Body] Request size byte
  7. [Caching] Cache control
  8. [context-type] Request response Type
  9. [Process] The name of the Process that makes the request
  10. [Comments] Indicates the Comments added by the user
  11. [Custom] User-defined value set by the user

2.1.3 Fiddler simulation scenario

Various scenarios can be simulated in Ruels:

  1. Login verification prior to request;
  2. Compress web pages to test performance;
  3. Simulate requests from different clients;
  4. Simulation of different network speeds, test the fault tolerance of the page;
  5. Disable cache to debug server static resources.
  6. Hide Image Request: Hide Image Request
  7. Hide CONNECTs: Filters out our TCP/IP handshake, because we don’t need to see it — 3 handshakes and 3 waves (I usually turn it on);
  8. Apply GZIP Encoding: Enables page compression, which means the server does not transfer a lot of data at once. Similar to how you compress files on a computer to save space, the purpose is to compress pages for better performance.

2.1.4 Viewing the Requested content

The inspectors on the right are the same as those in the developer TAB of the browser. (figure 7)

2.1.5 We cannot view the information on the HTTPS website

I can see the yellow color in the diagram below (Figure 8), but don’t worry.

2.2 set up

So some of the basic interface information that’s mentioned above it’s not hard, what’s so hard about it?

A: The difficulty is how to make a setting

So let’s set up our Fiddler so that it can grab packets of our browser data, and we’ll talk about App data later.

2.2.1 Set the CAPTURE browser’s data packets — HTTPS TAB

Next, I continue to set the Options in Tools, we can see that there are many tabs (Figure 8-2) :

Not only do we want to grab HTTP packets, but we also want to grab HTTPS data, and the same mobile APP has HTTP and HTTPS. So here we need it to decrypt THE HTTPS traffic, and here by the way to solve the above mentioned problem of not being able to fetch HTTPS.

So, we can use a fake CA certificate to trick the browser and server into decrypting HTTPS, pretending that Fiddler is an HTTPS server and pretending that Fiddler is a browser in front of the real HTTPS server. For those who are not clear, see Figure 1 above.

We first click Tools in the toolbar, and then click Options, as shown below (Figure 9) :

Next click on the HTTPS TAB and check Decrypt HTTPS as shown in figure 10:

Select yes for all, and then visit the HTTPS website again to capture packets. The tip in Figure 10 above is that we need to install a CA certificate, but the certificate is Fiddler’s own certificate, so we can choose yes. In this case Fiddler is acting as a man-in-the-middle attack.

I mentioned it above, but just to make it clear, LET me mention it again: As you can see from the figure above (Figure 11), the request from our client is captured by our Fiddler, decrypted by Fiddler’s certificate, encrypted and forwarded to the server. The server returns the requested data (encrypted). Fiddler receives the data, decrypts it, and sends it back to the client (plaintext). That’s what Fiddler does in the middle.

I can now see that there is also an option under Decrypt HTTPS, as shown in Figure 12:

This lets you select the object to grab:

  • . From all processes: Capture all processes;
  • . Browsers only. (In order to make it easier for the white person to get started so as not to grab the requested information, we first select “grab the browser process only”)
  • . Browsers capture data other than browser processes.
  • . From remote clients only: Fetches only remote clients. (Select this option when fetching the data side of the App)

The final version we set in the HTTPS TAB looks like this (Figure 13) :

2.2.2 Connections TAB

Next, let’s set up the Connections TAB. After clicking on the Connections TAB, the interface looks like this (Figure 14) :

This is how you use the word “Fiddler listens on port”. This is how you use the word “Fiddler listens on port”. In this case, I usually set it to 8889, but the default is fine. If you don’t check Allow Remote Computers to Connect, your phone won’t be able to connect to Fiddler, which means your packet capture tool won’t be able to capture packets on your phone.

When you click (Allow Remote Computers to connect), the following prompt box will pop up (Figure 15) :

Click OK, and then click “OK”. The operations are as follows (Figure 16) :

At this point, it’s set up, and we’re ready to grab the browser data.

2.2.3 Fiddler bag capture APP

First, go to Tools-> Options ->Connections->Allow Remote Computers to connect

After the setting, you need to open your mobile phone and set the agent of the mobile phone to the IP and port of the computer side in the mobile phone Settings. Because the setting environment of different mobile phones is different, you need to baidu here by yourself. Suppose our computer address is 192.168.1.1, after the agent is configured on the mobile phone, open the browser of the mobile phone. Enter 192.168.1.1:8888. The following page is displayed

Click on FiddlerRoot Centificate at the bottom to download and install the certificate.

Finally, browse baidu Takeout, the mobile APP, and you can grab the bag on the computer side.

2.3 Simple Fiddler operation

2.3.1 Clearing information in the Session list

There are two methods. The first is the operation method shown in the following figure: First select All (Control + A), right click, and choose Remove >>> “All Sessions”. For specific operations, see the following figure (Figure 17) :

The second method, directly up the GIF (Figure 18) :

2.3.2 Setting the Browser Proxy

We need to configure our browser to use a proxy to access the Internet. How do we set this up?

Here we need to use a plug-in on Chrome, called: Proxy SwitchyOmega here need to Google app store, link: chrome.google.com/webstore/ca… , then type Proxy SwitchyOmega to add the extension, but given that most people do not support scientific Internet access, here are three links to download:

Operation video: www.aiyc.top/archives/45…

Recommended download address: www.aiyc.top/archives/45…

Download at github.com/FelisCatus/…

Download address 2:proxy-switchyomega.com/download/

S randomly choose a download address to download, installation is also very simple can directly drag and drop to the browser can be installed. Generally, after successful installation, it will appear in the following position (Figure 19) :

[External link picture dump failed, the source site may have anti-theft mechanism, it is recommended to save the picture directly upload (img-DLFHJ4F7-1600652136255)(Beanguo APP grab. Assets / 43869360-8c65-11EA-BaeA-1d374979c00e)]

Click on the plugin in the red box in the image above and you will see something like the one below (Figure 20) :

Click on “Options” and the following page will appear (Figure 21) :

Click “New Scene mode” and you will see the following interface (Figure 22) :

Please refer to the following picture to fill in the configuration information, and then click “Create” (Figure 23) :

After clicking Create, you will see the following interface (Figure 24) :

Select HTTP as the proxy protocol and enter 127.0.0.1 as the proxy port. This is the port set when you set Fiddler to 8889, so this is 8889.

Be sure to click “Apply Options” when you’re done, otherwise you won’t be able to save this scenario. Then we can change the browser to Fiddler’s scenario mode, which means we can grab data with Fiddler. The operation is as follows (Figure 26) :

That’s all you need to do, but it’s important to note that you need to restart Fiddler after you’ve configured it or you’ll see that it’s not connected to the Internet (figure 27) :

After restarting the application, we can clear the list of Fiddler sessions and go to: www.aiyc.top/

From the above operations, we can see that we have successfully captured the data, so how to look at the information inside?

We can directly click on the corresponding data and operate as shown in the following GIF (Figure 29) :

Above, we choose the Inspectors and click Headers. We usually use Raw to check. We can also check the data returned by our website, and the operations are as follows (Figure 30) :

After will give you how to set the phone packets of fetching App, that there is a set of, but before talking about how to set up, let’s take a look at why want to set up, here we list is empty first session, request baidu website, and check the data returned, operating dynamic graph (figure 31) as follows:

As you can see, the data returned to me is garbled, as shown in figure 32:

So what’s the solution?

Method 1: Click the hint in the position as shown in the following figure (Figure 33) to solve the problem:

However, in method 1, every time we encounter garbled characters, we have to click again, which is obviously not what we want, we can set one more step, as shown in the following figure (Figure 34) :

That should solve the problem.

2.3.3 Sending custom Requests

In the Composer TAB on the right, we can customize the request, either by writing one manually or by dragging one from the session on the left.

We just need to write a simple URL, and of course we can customize some user-agents

3. Grab the bean fruit food App

3.1 the target

  1. Analysis of bean fruit gourmet data package
  2. Fetching data through Python multi-threaded – thread pools
  3. Hide the crawler by using a proxy IP
  4. Save the data to Mongodb

Open up Fiddler, open up Beanberry, we need to grab the recipe category of beanberry,

When we can operate, you can see that the right scroll (scroll down/slide), the left packet capture is also always capture packets, of course you can choose to capture other data. Here are just: recipes -> vegetables -> potatoes -> learn to cook the most;

The next step is to analyze the captured data packets, to see if there is any data interface I can request and need, and to confirm whether the server will return the corresponding data to us.

3.2 Data Packet Analysis

Through the above operation we can get a lot of data packets, and then we will analyze the data packets we need to lock, looking for these data packets actually have some rules, such as our bean fruit food, that its data must be a master station or host head (what the master domain name), to request;

So, we can Finde, or Control + F, or Edit->FindSessions from the menu bar. After that, the interface looks like this:

  1. Find: Search content
  2. The options:
    1. [Requests and response, Requests only, response only, Url only];
    2. Examine D.
    3. Match case: case sensitive;
    4. Regular Expression: Regular Expression.
    5. Search binaries;
    6. Decode compressed content: Decode compressed content.
    7. Search only Selected Sessions: Searches only the selected sessions.
    8. Select matches;
    9. Unmark old results: Unmark old results (e.g., the second filter, Unmark the last selected color);
    10. Result Highlight B.

At API.douguO.net, we can see that all packets containing API.douguO.net turn yellow.

Then I click to go to a search and find the following:

But it turns out that this packet is, in fact, one of the packets in it, and it’s not the one that we’re going to choose to learn the most.

I’ll continue the analysis to see which ones match our data package:

I’m looking at the second page because we originally scrolled (swiped the screen) :

So we know that the App has 20 recipe data per page.

Note: If you find the corresponding JSON data above and it doesn’t look normal, use the browser or another tool to format the JSON data.

Online Json format url: c.runoob.com/front-end/5…

The effect is as follows:

Original:

After formatting:

Finally we found the data interface.

I extracted the data out, the content of students can click on the link to read directly: gitee.com/aiycgit/Stu…

3.3 Write crawler code

3.3.1 Construct the request header

Here we construct the headers that Fiddler grabbed into the data into a dictionary, using a regular expression:

(. *?) : (. *)

"$1" : "$2",

The shortcut key for sublime is Control + H, and other editors are optional.

Replace All:

"client":"4"."version":"6962.2"."device":"MI 5"."sdk":"22,5.1. 1"."channel":"qqkp"."resolution":"1920 * 1080"."display-resolution":"1920 * 1080"."dpi":"2.0"."android-id":"1f0189836f997940"."pseudo-id":"9836f9979401f018"."brand":"Xiaomi"."scale":"2.0"."timezone":"28800"."language":"zh"."cns":"2"."carrier":"CMCC"."imsi":"460071940762499"."User-Agent":"Mozilla/5.0 (Linux; Android 5.1.1; MI 5  Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/74.0.3729.136 Mobile Safari/537.36"."act-code":"1588254561"."act-timestamp":"1588254561"."uuid":"0c576456-a996-45c5-b351-f57431558133"."battery-level":"0.97"."battery-state":"3"."mac":"F0:18:98:36:F9:97"."imei":"355757944762497"."terms-accepted":"1"."newbie":"0"."reach":"10000"."Content-Type":"application/x-www-form-urlencoded; charset=utf-8"."Accept-Encoding":"gzip, deflate"."Connection":"Keep-Alive"."Cookie":"duid=64268418"."Host":"api.douguo.net"."Content-Length":"179".Copy the code

And then, to optimize headers you have to optimize it, it’s not something that we think we can do or not do, it’s something that we need to capture more packets to come to a conclusion.

And, of course, constantly testing, commenting and uncommenting, and then requesting to see if the data is returned properly.

After a lot of tests I found that IMEI can not be removed, then what is this IMEI?

You can open the Settings of our Nightrix simulator, and we can find the following IMEI Settings:

This IMEI is generated randomly, and we need to keep it;

We can also see the MAC address, which we can comment out, and the two Android-ids and pseudo-ids we can comment out;

We also comment out the IMSI, as well as the Cookie, because you always carry this Cookie to access and it is easy to be checked and found by the server. If the Cookie is out of date (invalid), it is also not accessible, and the content-length can be commented out. In this way, a headers is forged. I have tested the comments above. If there are any, please comment them out or leave a comment below.

3.3.2 Construct Data for all columns (index_data)

Copy it and construct it into a dictionary.

3.3.2 Analyze all columns & compile crawlers

And then we need to write the request, and notice that we have to add the agent after we request it, that is, after the test is successful, or you might not know whether there was a problem with your agent or something wrong with the request, and the data didn’t come back.

In order to capture all the data in the vegetable, I’m going to analyze the home page, and even if I’m going to analyze the home page, I’m going to find this API,

http://api.douguo.net/recipe/flatcatalogs
Copy the code

Why this?

Let’s take a look at its JSON:

From the above, we can conclude that this is the home page data we need.

The next step is to write the code. Finally, we write the following code:

""" project = 'Code', file_name = 'spider_Dougu.py ', author = 'AI Yue Chuang' time = '2020/5/21 11:13', product_name = PyCharm, AI Yue Chuang code is far away from bugs with the God animal protecting I love animals. They taste delicious. """
Import request library
import json
import requests
from urllib import parse
from requests.exceptions import RequestException
# Analyze request type: POST;
# Analyze changing data: URL changes;
# post data is different;
# Analyze data that does not change: headers is unchanged;

HEADERS = {
	"client":"4"."version":"6962.2"."device":"MI 5"."sdk":"22,5.1. 1"."channel":"qqkp"."resolution":"1920 * 1080"."display-resolution":"1920 * 1080"."dpi":"2.0".# "android-id":"1f0189836f997940",
	# "pseudo-id":"9836f9979401f018",
	"brand":"Xiaomi"."scale":"2.0"."timezone":"28800"."language":"zh"."cns":"2"."carrier":"CMCC".# "imsi":"460071940762499",
	"User-Agent":"Mozilla/5.0 (Linux; Android 5.1.1; MI 5  Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/74.0.3729.136 Mobile Safari/537.36"."act-code":"1588254561"."act-timestamp":"1588254561"."uuid":"0c576456-a996-45c5-b351-f57431558133"."battery-level":"0.97"."battery-state":"3".# "mac":"F0:18:98:36:F9:97",
	"imei":"355757944762497"."terms-accepted":"1"."newbie":"0"."reach":"10000"."Content-Type":"application/x-www-form-urlencoded; charset=utf-8"."Accept-Encoding":"gzip, deflate"."Connection":"Keep-Alive".# "Cookie":"duid=64268418",
	"Host":"api.douguo.net".# "Content-Length":"179",
}

# print(parse.unquote('%E5%9C%9F%E8%B1%86'))

class DouGuo_Crawler:
	def __init__(self) :
		self.headers = HEADERS
		# self.url = DETAIL_API_URL
		self.index_data = {
			"client": "4".# "_session": "1590028187496355757944762497",
			# "v": "1590023768",# timestamp, which is the time you requested the data
			"_vs": "2305"."sign_ran": "94b96c996a4a222540e435ee06ef133a"."code": "b366f93caa7bac04",}def request(self, url, data = None) :
		try:
			response = requests.post(url = url, headers = self.headers, data = data)
			if response.status_code == 200:
				return response
			return None
		except RequestException:
			return None
	
	
	def main(self) :
		url = 'http://api.douguo.net/recipe/flatcatalogs'
		html = self.request(url = url, data = self.index_data)
		print(html.text)

if __name__ == '__main__':
	douguo = DouGuo_Crawler().main()
Copy the code

Running result:

{"state":"success"."result": {"nv":"1590045372"."cs": [{"name":"\u70ed\u95e8"."id":"1"."ju":"recipes:\/\/www.douguo.com\/search?key=\u70ed\u95e8&_vs=400"."cs": [{"name":"\u5bb6\u5e38\u83dc"."id":"18"."ju":"recipes:\/\/www.douguo.com\/search?key=\u5bb6\u5e38\u83dc&_vs=400"."cs": [{"name":"\u7ea2\u70e7\u8089"."id":"5879"."ju":"recipes:\/\/www.douguo.com\/search?key=\u7ea2\u70e7\u8089&_vs=400"."cs": []."image_url":""}, {"name":"\u53ef\u4e50\u9e21\u7fc5"."id":"5880"."ju":"recipes:\/\/www.douguo.com\/search?key=\u53ef\u4e50\u9e21\u7fc5&_vs=400"."cs": []."image_url":""}, {"name":"\u7cd6\u918b\u6392\u9aa8"."id":"5881"."ju":"recipes:\/\/www.douguo.com\/search?key=\u7cd6\u918b\u6392\u9aa8&_vs=400"."cs": []."image_url":""}, {"name":"\u9c7c\u9999\u8089\u4e1d"."id":"5882"."ju":"recipes:\/\/www.douguo.com\/search?key=\u9c7c\u9999\u8089\u4e1d&_vs=400"."cs": []."image_url":""}, {"name":"\u5bab\u4fdd\u9e21\u4e01"."id":"5883"."ju":"recipes:\/\/www.douguo.com\/search?key=\u5bab\u4fdd\u9e21\u4e01&_vs=400"."cs": []."image_url":""}, {"name":"\u7ea2\u70e7\u6392\u9aa8"."id":"5884"."ju":"recipes:\/\/www.douguo.com\/search?key=\u7ea2\u70e7\u6392\u9aa8&_vs=400"."cs": []."image_url":""}, {"name":"\u54b8"."id":"478"."ju":"recipes:\/\/www.douguo.com\/search?key=\u54b8&_vs=400"."cs": []."image_url":""}, {"name":"\u81ed\u5473"."id":"483"."ju":"recipes:\/\/www.douguo.com\/search?key=\u81ed\u5473&_vs=400"."cs": []."image_url":""}, {"name":"\u82e6"."id":"479"."ju":"recipes:\/\/www.douguo.com\/search?key=\u82e6&_vs=400"."cs": []."image_url":""}, {"name":"\u9c9c"."id":"5878"."ju":"recipes:\/\/www.douguo.com\/user?id=23162809&tab=1"."cs": []."image_url":""}]."image_url":""}]."image_url":""}]."ads": [{"dsp": {"id":"ad5671"."pid":"6080244867596723"."ch":1."url":""."i":""."cap":"\u5e7f\u544a"."position":"1recipecategory"."query":""."client_ip":""."req_min_i":60."channel":""."media_type":0."request_count":1."max_impression_count":0."name":"\u840c\u840c\u54d2\u63a8\u5e7f\u5458"."logo":"http:\/\/i1.douguo.net\/upload\/\/photo\/2\/1\/1\/70_215540a4a78ff0584e3f41d2de1a7191.jpg"."user": {"nick":"\u840c\u840c\u54d2\u63a8\u5e7f\u5458"."user_photo":"http:\/\/i1.douguo.net\/upload\/\/photo\/2\/1\/1\/70_215540a4a78ff0584e3f41d2de1a7191.jpg"."lvl":7."user_id":23010754},"show":0."ximage":""."canclose":0},"cid":"18"}}}]Copy the code

The output results are too much, omit part of the results, and then format the results with JSON data to watch the specific data content.

3.3.3 Parse the Json captured by all columns

Next, I will parse the captured JSON data. Before parsing, we need to analyze the data: We can see that all the data results are actually in the CS of result, as shown in the figure below:

After folding, we can see that there are 14 columns, and the corresponding Json is also 14, as shown in the figure below:

This is the data we are going to iterate over. Let’s look at the structure again:

Result ->cs->cs->cs ->cs->cs

import json

def parse_flatcatalogs(self, html) :
	""" :param html: Parse target text :return: """
	json_content = json.loads(html)['result'] ['cs']
	for index_items in json_content:
		for index_item in index_items['cs'] :for item in index_item['cs'] :print(item)
Copy the code

The running results are as follows:

C:\Users\clela\AppData\Local\Programs\Python\Python37\python.exe "C:/Code/ Pycharm_daima/App/ Spider_Douguo.py"
{'name': 'Braised pork in brown sauce'.'id': '5915'.'ju': 'recipes://www.douguo.com/search?key= braise in soy sauce meat & _vs = 400'.'cs': [].'image_url': ' '}
{'name': 'Coke Chicken Wings'.'id': '5916'.'ju': 'recipes://www.douguo.com/search?key= coke chicken wings & _vs = 400'.'cs': [].'image_url': ' '}
{'name': 'Sweet and Sour Pork ribs'.'id': '5917'.'ju': 'recipes://www.douguo.com/search?key= sweet and sour spareribs & _vs = 400'.'cs': [].'image_url': ' '}
{'name': 'Shredded pork with Fish aroma'.'id': '5918'.'ju': 'recipes://www.douguo.com/search?key= braised pork & _vs = 400'.'cs': [].'image_url': ' '}
{'name': 'Kung Pao Chicken'.'id': '5919'.'ju': 'recipes://www.douguo.com/search?key= kung pao chicken & _vs = 400'.'cs': [].'image_url': ' '}
{'name': 'Braised spareribs'.'id': '5920'.'ju': 'recipes://www.douguo.com/search?key= to braise in soy sauce sparerib & _vs = 400'.'cs': [].'image_url': ' '}
{'name': 'taste'.'id': '482'.'ju': 'recipes://www.douguo.com/search?key= odor & _vs = 400'.'cs': [].'image_url': ' '}
{'name': 'bitter'.'id': '479'.'ju': 'recipes://www.douguo.com/search?key= bitter & _vs = 400'.'cs': [].'image_url': ' '}
{'name': 'fresh'.'id': '5878'.'ju': 'recipes://www.douguo.com/user?id=23162809&tab=1'.'cs': [].'image_url': ' '}
Copy the code

The output result is too much, so some of the results are omitted here.

3.3.4 Write Data for the details page

So what are we going to do?

We are going to get the details of each page, but we will write the first potato, I will copy the captured potato packet to analyze:

client=4&_session=1590129789130355757944762497&keyword=%E5%9C%9F%E8%B1%86&order=0&_vs=11102&type=0&auto_play_mode=2&sign _ran=c7d28e4120793b7687e6f13b0e0df1d8&code=9a4596275dd6d8c1Copy the code

In this case, why just keep the client? The request headers have been analyzed and constructed before, and the URL has nothing to analyze. The request pattern is known as the POST request, so let’s analyze the client. I have two methods to analyze the client except one which is Python code, as follows:

from urllib import parse

str_txt = 'client=4&_session=1590129789130355757944762497&keyword=%E5%9C%9F%E8%B1%86&order=0&_vs=11102&type=0&auto_play_mode=2&sig n_ran=c7d28e4120793b7687e6f13b0e0df1d8&code=9a4596275dd6d8c1'

result = parse.unquote(str_txt)
print(result)
Copy the code

Running result:

client=4&_session=1590129789130355757944762497& keyword = potato & order =0&_vs=11102&type=0&auto_play_mode=2&sign_ran=c7d28e4120793b7687e6f13b0e0df1d8&code=9a4596275dd6d8c1
Copy the code

Method 2: Using Fiddler

3.3.5 Why not construct urls (apis)?

Now, some of you might ask, why only look at the client and not the API?

This is because this API is similar to bean fruit or other apps. After you find this excuse, the data you pass in is different, and the data it puts back to you is also different. Students can change a column to catch packets for specific operations, such as I choose: Seafood -> shrimps:

Then choose the recipe -> Learn to Make more:

Finally, we slide the interface and capture packets. The result of packet capture is as follows:

Format all results:

So that’s why we’re just going to look at the client, and what we’re going to do is, I’m going to construct a dictionary of the parsed client, which I’ve already shown, but I’m not going to show you too much here.

3.3.6 Construct data into a dictionary

The structure results are as follows:

"client":"4".# "_session":"1590129789130355757944762497",
"keyword":"Potato"."order":"0"."_vs":"11102"."type":"0"."auto_play_mode":"2"."sign_ran":"c7d28e4120793b7687e6f13b0e0df1d8"."code":"9a4596275dd6d8c1".Copy the code

And write code like this:

def __init__(self) :
	self.headers = HEADERS
	# self.url = DETAIL_API_URL
	self.index_data = {
		"client": "4".# "_session": "1590028187496355757944762497",
		# "v": "1590023768",# timestamp, which is the time you requested the data
		"_vs": "2305"."sign_ran": "94b96c996a4a222540e435ee06ef133a"."code": "b366f93caa7bac04",
	}
	self.detail_data = {
		"client":"4".# "_session":"1590129789130355757944762497",
		"keyword":"Potato"."order":"0"."_vs":"11102"."type":"0"."auto_play_mode":"2"."sign_ran":"c7d28e4120793b7687e6f13b0e0df1d8"."code":"9a4596275dd6d8c1",}Copy the code

We need to modify self.detail_data and modify keyword. Of course, if the data above can be optimized again, you are welcome to click read the original article and leave a message to me, and then App crawlers such as Douyin will appear.

So, change it to the following code:

self.detail_data = {
			"client":"4".# "_session":"1590129789130355757944762497",
			"keyword": None."order":"0"."_vs":"11102"."type":"0"."auto_play_mode":"2"."sign_ran":"c7d28e4120793b7687e6f13b0e0df1d8"."code":"9a4596275dd6d8c1",}Copy the code

Next, modify the function: parse_flatcatalogs

def parse_flatcatalogs(self, html) :
	""" :param html: Parse target text :return: """
	json_content = json.loads(html)['result'] ['cs']
	for index_items in json_content:
		for index_item in index_items['cs'] :for item in index_item['cs']:
				self.detail_data['keyword'] = item['name']
				print(self.detail_data)
Copy the code

The result is as follows, the result is too much, omit part of the result here:

{'client': '4'.'keyword': 'Braised pork in brown sauce'.'order': '0'.'_vs': '11102'.'type': '0'.'auto_play_mode': '2'.'sign_ran': 'c7d28e4120793b7687e6f13b0e0df1d8'.'code': '9a4596275dd6d8c1'}
{'client': '4'.'keyword': 'Coke Chicken Wings'.'order': '0'.'_vs': '11102'.'type': '0'.'auto_play_mode': '2'.'sign_ran': 'c7d28e4120793b7687e6f13b0e0df1d8'.'code': '9a4596275dd6d8c1'}
{'client': '4'.'keyword': 'Sweet and Sour Pork ribs'.'order': '0'.'_vs': '11102'.'type': '0'.'auto_play_mode': '2'.'sign_ran': 'c7d28e4120793b7687e6f13b0e0df1d8'.'code': '9a4596275dd6d8c1'}
{'client': '4'.'keyword': 'Shredded pork with Fish aroma'.'order': '0'.'_vs': '11102'.'type': '0'.'auto_play_mode': '2'.'sign_ran': 'c7d28e4120793b7687e6f13b0e0df1d8'.'code': '9a4596275dd6d8c1'}
{'client': '4'.'keyword': 'Kung Pao Chicken'.'order': '0'.'_vs': '11102'.'type': '0'.'auto_play_mode': '2'.'sign_ran': 'c7d28e4120793b7687e6f13b0e0df1d8'.'code': '9a4596275dd6d8c1'}
Copy the code

3.3.7 Construct queue & Submit data to queue

We have successfully constructed the data, the next thing we need is to use multithreading, so we need a queue, so we need to introduce the following library:

from multiprocessing import Queue

# Create queue
queue_list = Queue()
Copy the code

So, our queue has been created, and then we can submit our data to our queue, that is, we submit our self.detail_data to our queue as queue_list.put(self.detail_data)

def parse_flatcatalogs(self, html) :
	""" :param html: Parse target text :return: """
	json_content = json.loads(html)['result'] ['cs']
	for index_items in json_content:
		for index_item in index_items['cs'] :for item in index_item['cs']:
				self.detail_data['keyword'] = item['name']
				# print(self.detail_data)
				queue_list.put(self.detail_data)

print(queue_list.qsize()) # Check how much data is in the queue
Copy the code

3.3.8 Obtaining the Recipe List

Previously, we have obtained a variety of ingredients. The next step is to obtain the recipe of each ingredient (we have done the most). Select the API of the recipe, as shown below:

recipe_api = 'http://api.douguo.net/recipe/v2/search/0/20' # Test the first 20 pieces of data
Copy the code

Write a function:

def the_recipe(self, data) :
	""" fetch recipe data :return: """
	print(F "Currently processed ingredients:{data['keyword']}")
	recipe_api = 'http://api.douguo.net/recipe/v2/search/0/20' # Test the first 20 pieces of data
	response = self.request(url = recipe_api, data = data).text
	# return response
	print(response)
Copy the code

Modify the main function to test the modifications:

def main(self) :
	# url = 'http://api.douguo.net/recipe/v2/search/0/20'
	url = 'http://api.douguo.net/recipe/flatcatalogs'
	html = self.request(url = url, data = self.index_data).text
	# print(html)
	self.parse_flatcatalogs(html)
	# print(queue_list.get())
	self.the_recipe(data = queue_list.get())
Copy the code

Test run results, too many results, show partial results:

Currently processed ingredients: Coke chicken wings {"state":"success"."result": {"sts": ["\u53ef\u4e50\u9e21\u7fc5"]."hidden_sorting_tags":0."list": [{"ju":"recipes:\/\/www.douguo.com\/details?id=206667"."type":13."r": {"stc":0."sti":0."an":"stta\u5c0f\u94ed"."id":206667."cookstory":""."n":"\u53ef\u4e50\u9e21\u7fc5[\u7b80\u5355\u5230\u6ca1\u4e0b\u8fc7\u53a8\u4e5f\u4f1a\u505a]"."img":"https:\/\/cp1.douguo.com\/upload\/caiku\/8\/f\/c\/300_8f6d0f8b7195c0397c92fe39643761fc.jpg"."dc":16948."fc":34462."ecs":1."hq":0."a": {"id":2042182."n":"stta\u5c0f\u94ed"."v":1."verified_image":""."progress_image":""."p":"http:\/\/tx1.douguo.net\/upload\/photo\/4\/0\/3\/70_u49617908834149171436.jpg"."lvl":6."is_prime":false,"lv":0},"p":"https:\/\/cp1.douguo.com\/upload\/caiku\/8\/f\/c\/600_8f6d0f8b7195c0397c92fe39643761fc.jpg"."p_gif":"https:\/\/cp1.douguo.com\/upload\/caiku\/8\/f\/c\/600_8f6d0f8b7195c0397c92fe39643761fc.jpg"."cook_difficulty":"\u5207\u58a9(\u521d\u7ea7)"."cook_time":"10~30\u5206\u949f"."tags": [{"t":"\u665a\u9910"}, {"t":"\u5348\u9910"}, {"t":"\u7092\u9505"}, {"t":"\u9505\u5177"}, {"t":"\u70ed\u83dc"}, {"t":"\u54b8\u751c"},
Copy the code

Of course, we can put this data into Json format, as follows:

We put the corresponding to do the most, coke chicken wings for comparison:

It indicates that data capture is normal, here we can continue to click in to see the details, because the first no details, here I click on the second day:

Let’s open the third one to verify, but the third one is not covered here:

So you can only click on the fourth one to see if there is any, the results of the fourth one is not, to the fifth one is, the comparison is as follows:

Therefore, we can know that our request is ok, and we continue to operate, that is, to parse the obtained data. The required data is as follows:

def the_recipe(self, data) :
	""" fetch recipe data :return: """
	print(F "Currently processed ingredients:{data['keyword']}")
	recipe_api = 'http://api.douguo.net/recipe/v2/search/0/20' # Test the first 20 pieces of data
	response = self.request(url = recipe_api, data = data).text
	# print(response)
	recipe_list = json.loads(response)['result'] ['list']
	# print(type(recipe_list))
	for item in recipe_list:
		print(item)
Copy the code

The result of operation, more data, also omit part of the data:

Currently out of the processing of the ingredients: three fresh'https://cp1.douguo.com/upload/caiku/5/0/4/600_50d4ffc003fd4b43449cb0d8ed902174.jpg'.'p_gif': 'https://cp1.douguo.com/upload/caiku/5/0/4/600_50d4ffc003fd4b43449cb0d8ed902174.jpg'.'cook_difficulty': 'Cut pier (primary)'.'cook_time': '10 to 30 minutes'.'tags': [{'t': 'Northeastern cuisine'}, {'t': 'blast'}, {'t': 'fire'}, {'t': 'salt'}, {'t': '下饭菜'}, {'t': 'Fast dish'}, {'t': 'hot'}, {'t': 'vegetable'}, {'t': 'POTS'}, {'t': 'frying pan'}, {'t': 'lunch'}, {'t': 'dinner'}, {'t': 'Chinese food'}].'vc': '15733890'.'recommend_label': '1,495 people have done it recently.'.'display_ingredient': 1.'major': [{'note': 'moderation'.'title': 'the eggplant'}, {'note': 'a'.'title': "Potato"}, {'note': 'One of each.'.'title': 'Green red pepper'}, {'note': 'moderation'.'title': 'Onion, ginger, garlic'}].'au': 'recipes://www.douguo.com/details?id=975713'.'pw': 799.'ph': 450.'rate': 4.7.'recommendation_tag': '1,495 people have done it.'}}
{'ju': 'recipes://www.douguo.com/details?id=1396536'.'type': 13.'r': {'stc': 0.'sti': 0.'an': 'Light years away - Love is like the first sight'.'id': 1396536.'cookstory': 'This dish is my specialty. When I first learned this dish, my husband tried to eat it every day. I made it several times and finally succeeded. The husband said no longer need to eat not authentic, ha ha. That's the smell of the restaurant outside. '.'n': 'Three fresh dishes can make a restaurant taste at home'.'img': 'https://cp1.douguo.com/upload/caiku/7/1/5/300_71c899c8795448414aaf426aa910a335.jpg'.'dc': 784.'fc': 84217.'ecs': 0.'hq': 0.'a': {'id': 10146188.'n': 'Light years away - Love is like the first sight'.'v': 1.'verified_image': ' '.'progress_image': ' '.'p': 'http://tx1.douguo.net/upload/photo/1/0/e/70_u3574750219030220248.jpg'.'lvl': 7.'is_prime': False.'lv': 0}, 'p': 'https://cp1.douguo.com/upload/caiku/7/1/5/600_71c899c8795448414aaf426aa910a335.jpg'.'p_gif': 'https://cp1.douguo.com/upload/caiku/7/1/5/600_71c899c8795448414aaf426aa910a335.jpg'.'cook_difficulty': 'Cut pier (primary)'.'cook_time': '10 to 30 minutes'.'tags': [{'t': 'the salty fresh'}, {'t': 'salt'}, {'t': 'fire'}, {'t': 'Northeastern cuisine'}, {'t': 'Chinese food'}].'vc': '2967774'.'recommend_label': '784 people have done it recently. '.'display_ingredient': 1.'major': [{'note': 'a'.'title': 'Long eggplant'}, {'note': 'a'.'title': "Potato"}, {'note': 'a'.'title': 'green pepper'}, {'note': 'a third'.'title': 'pepper'}, {'note': 'some'.'title': 'the garlic'}, {'note': 'two'.'title': 'green'}, {'note': 'moderation'.'title': 'Peanut oil'}, {'note': 'a spoonful of'.'title': 'light'}, {'note': '1/3 spoon'.'title': 'Braised soy sauce'}, {'note': 'two tablespoons of'.'title': 'starch'}, {'note': 'half spoon'.'title': 'the vinegar'}, {'note': 'a spoonful of'.'title': 'sugar'}, {'note': 'moderation'.'title': 'salt'}, {'note': 'half bowl'.'title': 'water'}].'au': 'recipes://www.douguo.com/details?id=1396536'.'pw': 800.'ph': 800.'rate': 4.8.'recommendation_tag': '784 people did it. '}}
Copy the code

However, we will find advertisements, and after that we can take them out.

Next, set the data return type to dictionary. I can also analyze and find that: when the type is 13, it is the recipe we need, and we can extract the required data:

So the code looks like this:

def the_recipe(self, data) :
	""" fetch recipe data :return: """
	print(F "Currently processed ingredients:{data['keyword']}")
	recipe_api = 'http://api.douguo.net/recipe/v2/search/0/20' # Test the first 20 pieces of data
	response = self.request(url = recipe_api, data = data).text
	# print(response)
	recipe_list = json.loads(response)['result'] ['list']
	# print(type(recipe_list))
	for item in recipe_list:
		# recipe_info: Recipe information
		recipe_info = {}
		recipe_info['ingredients'] = data['keyword']
		if item['type'] = =13:
			recipe_info['user_name'] = item['r'] ['an'] # User name
			recipe_info['recipe_id'] = item['r'] ['id']
			recipe_info['cookstory'] = item['r'] ['cookstory'].strip()
			recipe_info['recipe_name'] = item['r'] ['n']
			recipe_info['major'] = item['r'] ['major']
			print(recipe_info)
		else:
			continue
Copy the code

However, it should be noted that after my previous analysis, the ID extracted above is the ID of the ingredients, not the user ID but the ID of the recipe, so why keep it? Because what else do I need to ask an API for?

That is, the data we have captured at present, there is no practice in it, students compare it by themselves.

So where does this work?

In other words, there’s a packet here that we didn’t capture, and then when we ask again to do it we’ll capture the packet and see what happens to the packet that we did. So, we need to keep this id for now.

Recipe = recipe = recipe

Next, we open up our Fiddler to re-capture the bag, and the APP stays in the recipe list to capture the bag directly. Finally, we captured the following data packets:

Indicating that this is the packet we need:

3.3.9.1 Construct the cooking method URL

So let’s write this request as a crawler. Let’s copy the API:

http://api.douguo.net/recipe/detail/1091474
Copy the code

1091474 is actually the id we saved above, and then we construct the URL:

detaile_cookstory_url = 'http://api.douguo.net/recipe/detail/{id}'.format(id = recipe_info['recipe_id'])
Copy the code
3.3.9.2 Construct data for cooking methods

Next we need to construct the data, also copy it, and decode it:

# Before decoding
client=4&_session=1590191363409355757944762497&author_id=0&_vs=11101&_ext=%7B%22query%22%3A%7B%22kw%22%3A%22%E7%BA%A2%E7%83%A7%E8%82%89%22%2C%22src%22%3A%2211101%22%2C%22idx%22%3A%221%22%2C%22type%22%3A%2213%22%2C%22id%22%3A%221091474%22%7D%7D&is_new_user=1&sign_ran=fa68eee8c3458c196bd65c15c4b06c3b&code=e5cf0cdca6a10f39

# After decoding
client=4&_session=1590191363409355757944762497&author_id=0&_vs=11101&_ext={"query": {"kw":"Braised pork in brown sauce"."src":"11101"."idx":"1"."type":"13"."id":"1091474"}}&is_new_user=1&sign_ran=fa68eee8c3458c196bd65c15c4b06c3b&code=e5cf0cdca6a10f39
Copy the code

Optimization:

client=4&_session=1590191363409355757944762497&author_id=0&_vs=11101&_ext={"query": {"kw":"Braised pork in brown sauce"."src":"11101"."idx":"1"."type":"13"."id":"1091474"}}
Copy the code

Is also constructed into a dictionary:

"client":"4"."_session":"1590191363409355757944762497"."author_id":"0"."_vs":"11101"."_ext":"{"query": {"kw":"Braised Pork","src":"11101","idx":"1","type":"13","id":"1091474"}}".Copy the code

We also need to the above “{” query” : {” kw “:” braise in soy sauce meat “, “SRC” : “11101”, “independence idx” : “1”, “type” : “13”, “id” : “1091474”}} “, into the optimization and display abnormal:

If the syntax is incorrect, replace the double quotation marks with single quotation marks.

'{" query ": {" kw" : "braise in soy sauce meat", "SRC" : "11101", "independence idx" : "1", "type" : "13", "id" : "1091474"}}'

Finally our cookstory_data looks like this:

cookstory_data = {
			"client": "4"."_session": "1590191363409355757944762497"."author_id": "0"."_vs": "11101"."_ext": '{" query ": {" kw" : "braise in soy sauce meat", "SRC" : "11101", "independence idx" : "1", "type" : "13", "id" : "1091474"}}',}Copy the code

Our session doesn’t need to be commented out, and our ID and KW are constantly changing, and the rest we don’t change, because as you’ll see by capturing packets a few times, it’s the same. Here to satisfy the lazy little white, I randomly choose a recipe to catch the packet again.

3.3.9.3 Comparing constant data in data

Let’s look at the data again:

# Before decoding:
client=4&_session=1590361734034355757944762497&author_id=0&_vs=11101&_ext=%7B%22query%22%3A%7B%22kw%22%3A%22%E7%BA%B8%E6%9D%AF%E8%9B%8B%E7%B3%95%22%2C%22src%22%3A%2211101%22%2C%22idx%22%3A%221%22%2C%22type%22%3A%2213%22%2C%22id%22%3A%221204922%22%7D%7D&is_new_user=1&sign_ran=a866cb568112ed9fd859fc2689fa0aaa&code=3ca60fe2ceb34663

# After decoding:
client=4&_session=1590361734034355757944762497&author_id=0&_vs=11101&_ext={"query": {"kw":"Cupcakes"."src":"11101"."idx":"1"."type":"13"."id":"1204922"}}&is_new_user=1&sign_ran=a866cb568112ed9fd859fc2689fa0aaa&code=3ca60fe2ceb34663
Copy the code

Let’s compare the data of braised pork in brown sauce above:

# braise in soy sauce meat
client=4&_session=1590191363409355757944762497&author_id=0&_vs=11101&_ext={"query": {"kw":"Braised pork in brown sauce"."src":"11101"."idx":"1"."type":"13"."id":"1091474"}}&is_new_user=1&sign_ran=fa68eee8c3458c196bd65c15c4b06c3b&code=e5cf0cdca6a10f39

# Cupcakes
client=4&_session=1590361734034355757944762497&author_id=0&_vs=11101&_ext={"query": {"kw":"Cupcakes"."src":"11101"."idx":"1"."type":"13"."id":"1204922"}}&is_new_user=1&sign_ran=a866cb568112ed9fd859fc2689fa0aaa&code=3ca60fe2ceb34663
Copy the code

Yes, it is observed that except for KW and ID, SRC, type and so on are unchanged.

At this time, some students may compare themselves after class, and find that the number of IDX is sometimes different. Here I will answer first:

A: Maybe it was changed after the update. After the packet is captured, it is constructed in Python, some parameters are commented out, and the data can be requested, so you don’t have to care whether it is changed or not. Or when you request, 1 is the result of A and 2 is the result of B. (Also through my test, the data of IDX is still accurate without modification.

3.3.9.4 Writing code
def the_recipe(self, data) :
	""" fetch recipe data :return: """
	print(F "Currently processed ingredients:{data['keyword']}")
	recipe_api = 'http://api.douguo.net/recipe/v2/search/0/20' # Test the first 20 pieces of data
	response = self.request(url = recipe_api, data = data).text
	# print(response)
	recipe_list = json.loads(response)['result'] ['list']
	# print(type(recipe_list))
	for item in recipe_list:
		# recipe_info: Recipe information
		recipe_info = {}
		recipe_info['ingredients'] = data['keyword']
		if item['type'] = =13:
			recipe_info['user_name'] = item['r'] ['an'] # User name
			recipe_info['recipe_id'] = item['r'] ['id']
			recipe_info['cookstory'] = item['r'] ['cookstory'].strip()
			recipe_info['recipe_name'] = item['r'] ['n']
			recipe_info['major'] = item['r'] ['major']
			# print(recipe_info)
			detaile_cookstory_url = 'http://api.douguo.net/recipe/detail/{id}'.format(id = recipe_info['recipe_id'])
			# print(detaile_cookstory_url)
			cookstory_data = {
				"client": "4".# "_session": "1590191363409355757944762497",
				"author_id": "0"."_vs": "11101"."_ext": '{"query":{"kw":"%s","src":"11101","idx":"1","type":"13","id":"%s"}}'% (str(data['keyword']), str(recipe_info['recipe_id'])),
			}
			recipe_detail_resopnse = self.request(url = detaile_cookstory_url, data = cookstory_data).text
			# print(recipe_detail_resopnse)
			detaile_response_dict = json.loads(recipe_detail_resopnse)
			recipe_info['tips'] = detaile_response_dict['result'] ['recipe'] ['tips']
			recipe_info['cook_step'] = detaile_response_dict['result'] ['recipe'] ['cookstep']
			logger.info(F 'current recipe_info:>>>{recipe_info}')
			# logger.info(' current data ')
		else:
			continue
Copy the code

The run result is as follows, omit part of the result:

The current ingredients: Coke chicken wings2020- 05 -25 16:45:07,613-__main__ -info - current recipe_info:>>>{'ingredients': 'Coke Chicken Wings'.'user_name': 'stta small inscription'.'recipe_id': 206667.'cookstory': ' '.'recipe_name': 'Coke chicken wings [so simple that you can make them without ever cooking]'.'major': [{'note': '10'.'title': 'chicken'}, {'note': 'a'.'title': 'ginger'}, {'note': 'two'.'title': 'green'}, {'note': 'One can'.'title': 'coke'}, {'note': 'Three tablespoons'.'title': 'Soy sauce (extremely fresh)'}].'tips': 'Chicken wings with a knife or fork more easily into the taste, if there is time to marinate in advance, if there are conditions to add a few drops of lemon juice bonus oh. Do not add salt, marinating can use cooking wine and soy sauce, soy sauce light soy sauce does not matter, marinating more than 20 minutes can be. \ n -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- \ n FAQ q&a: \ n \ n q: don't put the salt? \n A: Soy sauce contains salt. \n\n Q: Light or dark soy sauce? Q: Do the chicken wings need to be marinated in advance? How to marinate for? \n A little soy sauce and cooking wine will do. \n\n Q: I put soy sauce in the marinating, do I need to put it in the cooking? \n answer: Yes! Q: Do you need to blanch the chicken wings? \n A: If your wings are white in the middle and don't have red or black blood clots, you don't need them. Vice versa. \n\n Q: Isn't it dirty not to blanch? \n ANSWER: Rinse, please. \n\n Q: What kind of coke would you like? \n A: The cheap one. \ n \ n -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- \ n this dish tastes sweeter \ n if you are a YunGuiChuan, or in the west of the friend, You can add some chili (please don't add Chinese pepper) \n\n If you are a friend from the north, especially from Shandong, 200 ml of cola please + 100 ml to 100 ml of cola cooking wine \ n -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- regrets about \ n \ n \ n according to their own recipes made a today Sure enough, the recipe is no problem. \n\n released recipes to now, it is not easy to cook from xiaobai, to chef, to western chef, Japanese chef. I really like to cook. It's good to have you! It's great to have beans! '.'cook_step': [{'position': '1'.'content': 'Put some oil in the bottom of the pan and saute the ginger and shallots.'.'thumb': 'https://cp1.douguo.com/upload/caiku/f/6/1/140_f6ebfcafae6e4aa5b7f3e367f26b04b1.jpg'.'image_width': 640.'image_height': 480.'image': 'https://cp1.douguo.com/upload/caiku/f/6/1/800_f6ebfcafae6e4aa5b7f3e367f26b04b1.jpg'}, {'position': '2'.'content': 'Stir fry the chicken wings until pale, add the coke and soy sauce, and cover the pot.'.'thumb': 'https://cp1.douguo.com/upload/caiku/d/a/4/140_da0d38fa2c885b8e829d17b1dcbbbd34.jpg'.'image_width': 640.'image_height': 480.'image': 'https://cp1.douguo.com/upload/caiku/d/a/4/800_da0d38fa2c885b8e829d17b1dcbbbd34.jpg'}, {'position': '3'.'content': 'Cook over medium heat until the sauce is reduced.'.'thumb': 'https://cp1.douguo.com/upload/caiku/1/b/3/140_1b122a5b97b867d9119caf2274e413c3.jpg'.'image_width': 640.'image_height': 480.'image': 'https://cp1.douguo.com/upload/caiku/1/b/3/800_1b122a5b97b867d9119caf2274e413c3.jpg'}}]2020- 05 -25 16:45:12.295-__main__ -info - current recipe_info:>>>{'ingredients': 'Coke Chicken Wings'.'user_name': 'Leaf's Love and Kitchen'.'recipe_id': 201871.'cookstory': 'This dish is highly recommended, not to mention children, but also old people. And suit person reducing weight very much, because a drop of oil does not need to put oh. When my niece ate the wings for the first time, she ate seven wings for the first time ever, and my mother kept Shouting: No, no, no. As well as my 3 and a half year old son, picky picky headache, but eat this, even the soup is not let go, mixed with a small bowl of rice to eat clean. \n\n So if you have a child struggling to eat, give this a try. On a cold winter's day, it would be perfect for serving. Don't underestimate the charm! This is really more delicious than ordinary coke chicken wings! '.'recipe_name': 'Coke Chicken Wings (oil free and Ginger)'.'major': [{'note': '400g'.'title': 'Middle chicken wing'}, {'note': '15g'.'title': 'ginger'}, {'note': '1 can'.'title': 'coke'}, {'note': 'half'.'title': 'lemon'}, {'note': '2 tablespoons'.'title': 'light'}, {'note': '1 spoon'.'title': 'the old smoke'}, {'note': '2g'.'title': 'Sugar (optional)'}, {'note': '2g'.'title': 'salt'}].'tips': '1, the wings are piercing and cut to make the wings more flavorful. N2, with lemon juice is to go fishy, also can better soften the meat quality. Shorten the frying time. And add flavor. The result is very good \ N3, cook face down first. But make sure it's a non-stick pan. \ N4, chicken wings will own a lot of oil, so do not put oil again. There is plenty of oil. \ N5, don't add too much sugar, because there is already a lot of sugar in coke. Afraid of sweet can also not put sugar \ N6, ginger can be bold to put more, because the taste of coke and ginger fusion will let you be flattered. \ N7, skim the foam from the pan, except with a spoon. You can also prepare a small sheet of tin foil, knead it with your hand, and open it. Take it out and the foam is trapped in the folds of the foil, a very simple method. \n8, you can cook it more over medium heat, it will add more flavor. Then boil the juice. It's got sugar in it, and it'll dry out in no time. N9, soup with rice, also very delicious. '.'cook_step': [{'position': '1'.'content': 'Wash the middle wings of the chicken and drain. Or use kitchen paper towels to blot moisture, fork some eyes in the front of the chicken wing (the side with goose bumps), on the back of the chicken wing with a knife to cut 2 small mouth \n cut ginger into 3MM wide ginger. Put the chicken wings into a large bowl, then squeeze the lemon with your hands, squeeze the lemon juice as much as possible, mix well, and leave for 3-5 minutes..'thumb': 'https://cp1.douguo.com/upload/caiku/b/f/b/140_bf8213a5b765c86f9acdbb6f09bc4f1b.jpg'.'image_width': 1600.'image_height': 1200.'image': 'https://cp1.douguo.com/upload/caiku/b/f/b/800_bf8213a5b765c86f9acdbb6f09bc4f1b.jpg'}, {'position': '2'.'content': 'Take a non-stick skillet, heat it up, and put the wings straight in, goose bumps side down.'.'thumb': 'https://cp1.douguo.com/upload/caiku/9/1/2/140_91dca0da2408ee6a51e34600a36b9242.jpg'.'image_width': 800.'image_height': 600.'image': 'https://cp1.douguo.com/upload/caiku/9/1/2/800_91dca0da2408ee6a51e34600a36b9242.jpg'}, {'position': '3'.'content': 'When the oil is out, turn over and fry.'.'thumb': 'https://cp1.douguo.com/upload/caiku/4/6/2/140_46b090e16b541d87597d893f14e021c2.jpg'.'image_width': 800.'image_height': 600.'image': 'https://cp1.douguo.com/upload/caiku/4/6/2/800_46b090e16b541d87597d893f14e021c2.jpg'}, {'position': '4'.'content': 'Then you can fry the wings as much as you can, like I did in the picture, until the oil comes out on both sides, until both sides are golden brown.'.'thumb': 'https://cp1.douguo.com/upload/caiku/d/0/d/140_d0d236c86bd23dd50a099e1395397bbd.jpg'.'image_width': 800.'image_height': 600.'image': 'https://cp1.douguo.com/upload/caiku/d/0/d/800_d0d236c86bd23dd50a099e1395397bbd.jpg'}, {'position': '5'.'content': 'Add the sliced ginger and stir-fry with the chicken wings for about 30 seconds.'.'thumb': 'https://cp1.douguo.com/upload/caiku/2/9/7/140_297dadc823cf0fdd60fb93ac4c588e97.jpg'.'image_width': 800.'image_height': 600.'image': 'https://cp1.douguo.com/upload/caiku/2/9/7/800_297dadc823cf0fdd60fb93ac4c588e97.jpg'}, {'position': '6'.'content': 'Pour in the coke, just as flat as the wings or just under the wings. Salt, sugar, light soy sauce, dark soy sauce. '.'thumb': 'https://cp1.douguo.com/upload/caiku/2/f/2/140_2f02fab255f69267708c2d37479d4922.jpg'.'image_width': 800.'image_height': 600.'image': 'https://cp1.douguo.com/upload/caiku/2/f/2/800_2f02fab255f69267708c2d37479d4922.jpg'}, {'position': '7'.'content': 'Skim the foam from the pan.'.'thumb': 'https://cp1.douguo.com/upload/caiku/d/1/2/140_d11065f96e24b947ddcf6d0b4deb4752.jpg'.'image_width': 800.'image_height': 600.'image': 'https://cp1.douguo.com/upload/caiku/d/1/2/800_d11065f96e24b947ddcf6d0b4deb4752.jpg'}, {'position': '8'.'content': 'Cook on a medium heat for about 10 minutes. Then heat up and reduce the juice. '.'thumb': 'https://cp1.douguo.com/upload/caiku/f/e/7/140_fe29798d4cd536b447f1aea295371c77.jpg'.'image_width': 670.'image_height': 446.'image': 'https://cp1.douguo.com/upload/caiku/f/e/7/800_fe29798d4cd536b447f1aea295371c77.jpg'}}]2020- 05 -25 16:45:13.519-__main__ -info - current recipe_info:>>>{'ingredients': 'Coke Chicken Wings'.'user_name': 'Light picker'.'recipe_id': 1070655.'cookstory': 'Conscience guarantee, one step to make super delicious Coke chicken wings! One step is to turn on the rice cooker, put in the ingredients and press the cook button. Isn't that simple! Material dosage is mastered very well also, chicken wing calculates by root, coke buys 500 milliliter small bottle, connect soy sauce, cooking wine and oil are to use ceramic spoon to take quantity, I do not believe still can someone do failure! '.'recipe_name': 'Coke Chicken Wings for rice Cooker'.'major': [{'note': '1 vial (500ML) '.'title': 'coke'}, {'note': '8'.'title': 'chicken'}, {'note': '2 TBSP'.'title': 'the soy sauce}, {'note': '8'.'title': 'ginger'}, {'note': '1 TBSP'.'title': 'Rice wine or cooking wine'}, {'note': '1 TBSP'.'title': 'oil'}].'tips': 'Don't add salt, soy sauce tastes salty. Just use regular soy sauce. '.'cook_step': [{'position': '1'.'content': 'Wash the chicken wings and slice the ginger. '.'thumb': 'https://cp1.douguo.com/upload/caiku/8/5/9/140_85dc807a52611577111a41f624c274f9.jpg'.'image_width': 2448.'image_height': 3264.'image': 'https://cp1.douguo.com/upload/caiku/8/5/9/800_85dc807a52611577111a41f624c274f9.jpg'}, {'position': '2'.'content': Put 1 TBSP of cooking oil in a rice cooker and stir in the sliced ginger. '.'thumb': 'https://cp1.douguo.com/upload/caiku/2/9/e/140_2984b75726ab83d613a8ec9bdc47c64e.jpg'.'image_width': 2448.'image_height': 3264.'image': 'https://cp1.douguo.com/upload/caiku/2/9/e/800_2984b75726ab83d613a8ec9bdc47c64e.jpg'}, {'position': '3'.'content': 'Put the wings in. Ginger slices should be placed under the chicken wings, to the fishy, to prevent sticky pot. '.'thumb': 'https://cp1.douguo.com/upload/caiku/6/5/a/140_65581e3a3cf1c590e2456cc0f8be6c5a.jpg'.'image_width': 2448.'image_height': 3264.'image': 'https://cp1.douguo.com/upload/caiku/6/5/a/800_65581e3a3cf1c590e2456cc0f8be6c5a.jpg'}, {'position': '4'.'content': 'Add 1 small bottle (500ml) of coke, making sure it covers the wings. '.'thumb': 'https://cp1.douguo.com/upload/caiku/4/d/6/140_4dbfef984db6d25341b969f92d3136d6.jpg'.'image_width': 2448.'image_height': 3264.'image': 'https://cp1.douguo.com/upload/caiku/4/d/6/800_4dbfef984db6d25341b969f92d3136d6.jpg'}, {'position': '5'.'content': 'Add 2 tablespoons soy sauce, 1 tablespoon cooking wine. '.'thumb': 'https://cp1.douguo.com/upload/caiku/1/5/e/140_156df43cd12523824965719c22d387be.jpg'.'image_width': 2448.'image_height': 3264.'image': 'https://cp1.douguo.com/upload/caiku/1/5/e/800_156df43cd12523824965719c22d387be.jpg'}, {'position': '6'.'content': 'Press the cook button. '.'thumb': 'https://cp1.douguo.com/upload/caiku/3/0/a/140_302ee95370e83cecd8b0d93b797e07ca.jpg'.'image_width': 2448.'image_height': 3264.'image': 'https://cp1.douguo.com/upload/caiku/3/0/a/800_302ee95370e83cecd8b0d93b797e07ca.jpg'}, {'position': '7'.'content': 'Just wait until the rice cooker jumps to heat. '.'thumb': 'https://cp1.douguo.com/upload/caiku/8/4/5/140_847f877dddb7859e7f0d2d221f923095.jpg'.'image_width': 2448.'image_height': 3264.'image': 'https://cp1.douguo.com/upload/caiku/8/4/5/800_847f877dddb7859e7f0d2d221f923095.jpg'}, {'position': '8'.'content': 'By the way, if you like spicy food, you can add dried chillies, and a few grains of garlic taste good. '.'thumb': 'https://cp1.douguo.com/upload/caiku/d/2/3/140_d20df3ba8675ff5339a0b2af547574b3.jpg'.'image_width': 3456.'image_height': 5184.'image': 'https://cp1.douguo.com/upload/caiku/d/2/3/800_d20df3ba8675ff5339a0b2af547574b3.jpg'}}]Copy the code

We can convert the output data into JSON for easy reading, as follows:

json_recipe_iinfo = json.dumps(recipe_info)
logger.info(F 'current recipe_info:>>>{json_recipe_iinfo}')
Copy the code

3.3.10 Writing of data import logic

So far, we still need to do data warehousing, IP hiding, and of course multithreading which we haven’t implemented yet. Here we need to import a library like this:

import pymongo
Copy the code

But in this case, we’re going to choose to put the code that stores the data into a separate code file. In this case, I’m going to create a file called db.py in the project directory.

If you are not familiar with MongoDB in Python, check out the link below: www.aiyc.top/archives/51…

The database code is as follows:

""" project = 'Code', file_name = 'db.py', author = 'AI Charm 'time = '2020/5/26 7:25', product_name = PyCharm, AI Yue Chuang code is far away from bugs with the God animal protecting I love animals. They taste delicious. """
import pymongo
from pymongo.collection import Collection

class Connect_mongo(object) :
	def __init__(self) :
		self.client = pymongo.MongoClient(host='localhost', port=27017)
		self.db = self.client['dou_guo_meishi_app'] # Specify database
	
	def insert_item(self, item) :
		db_collection = Collection(self.db, 'dou_guo_mei_shi_item') # Create table (collection)
		# db_collection = self.db.dou_guo_mei_shi_item
		db_collection.insert_one(item)
		
mongo_info = Connect_mongo()
Copy the code

We also added the following code to Spider_douguo. Py:

from db import mongo_info

def the_recipe(self, data) :
	""" fetch recipe data :return: """
	# print(f" {data['keyword']}")
	recipe_api = 'http://api.douguo.net/recipe/v2/search/0/20' # Test the first 20 pieces of data
	response = self.request(url = recipe_api, data = data).text
	# print(response)
	recipe_list = json.loads(response)['result'] ['list']
	# print(type(recipe_list))
	for item in recipe_list:
		# recipe_info: Recipe information
		recipe_info = {}
		recipe_info['ingredients'] = data['keyword']
		if item['type'] = =13:
			recipe_info['user_name'] = item['r'] ['an'] # User name
			recipe_info['recipe_id'] = item['r'] ['id']
			recipe_info['cookstory'] = item['r'] ['cookstory'].strip()
			recipe_info['recipe_name'] = item['r'] ['n']
			recipe_info['major'] = item['r'] ['major']
			# print(recipe_info)
			detaile_cookstory_url = 'http://api.douguo.net/recipe/detail/{id}'.format(id = recipe_info['recipe_id'])
			# print(detaile_cookstory_url)
			cookstory_data = {
				"client": "4".# "_session": "1590191363409355757944762497",
				"author_id": "0"."_vs": "11101"."_ext": '{"query":{"kw":"%s","src":"11101","idx":"1","type":"13","id":"%s"}}'% (str(data['keyword']), str(recipe_info['recipe_id'])),
			}
			recipe_detail_resopnse = self.request(url = detaile_cookstory_url, data = cookstory_data).text
			# print(recipe_detail_resopnse)
			detaile_response_dict = json.loads(recipe_detail_resopnse)
			recipe_info['tips'] = detaile_response_dict['result'] ['recipe'] ['tips']
			recipe_info['cook_step'] = detaile_response_dict['result'] ['recipe'] ['cookstep']
			# json_recipe_iinfo = json.dumps(recipe_info)
			# logger.info(f' current recipe_info:>>>{json_recipe_iinfo}')
			logger.info(F "current stored recipe is :>>>{recipe_info['recipe_name']}")
			mongo_info.insert_item(recipe_info)
		else:
			continue
Copy the code

After running the code, we can look at the data in the database:

3.3.11 Multithreaded logic writing

Here I’m just demonstrating the first 20 recipes. You can build your own URL, and then we can write multithreading. Here we need to import the following libraries:

from concurrent.futures import ThreadPoolExecutor
Copy the code

The code for the main function is as follows:

def main(self) :
	# url = 'http://api.douguo.net/recipe/v2/search/0/20'
	url = 'http://api.douguo.net/recipe/flatcatalogs'
	html = self.request(url = url, data = self.index_data).text
	# print(html)
	self.parse_flatcatalogs(html)
	# print(queue_list.qsize())
	# for _ in range(queue_list.qsize()):
	# data = queue_list.get()
	with ThreadPoolExecutor(max_workers=20) as thread:
		while queue_list.qsize() > 0:
			thread.submit(self.the_recipe, queue_list.get())
Copy the code

After running, you can see the captured data:

3.3.12 IP hidden

The hidden purpose and function of IP, I will not repeat here. Here, I create a test file, to test whether our agent is effective, here I use [ABU Cloud agent] : www.abuyun.com/, students can choose, if you are not clear about the use of agent, you can click this link to read: Mp.weixin.qq.com/s/eAsFO7u0z…

Use a paid agent

Above, we only used one agent, while crawlers often need to use multiple agents, so how to construct it? Here are two main methods:

  1. One is to use multiple IP’s for free;
  2. One is to use paid IP agents;

Free IP is often not good, so you can build IP proxy pool, but for beginners to make an IP proxy pool cost is too high, if only the individual usually play crawler, can consider paid IP, a few dollars to buy a few hours of dynamic IP, in most cases are enough to climb a website.

Here recommended a paid agent “ABU cloud agent”, good effect is not expensive, if you do not want to work hard to IP proxy pool, it might as well spend a few dollars to easily solve.

For the first time, you can choose to buy a one-hour dynamic version trial, click to generate tunnel proxy information as credentials to add to the code.

Copy the information into the official Requests code and run it to see how proxy IP works:

www.abuyun.com/http-proxy/…

import requests
# Target page to test
targetUrl = "http://icanhazip.com"
def get_proxies() :
    # Proxy server
    proxyHost = "http-dyn.abuyun.com"
    proxyPort = "9020"
    # Proxy tunnel authentication information
    proxyUser = "H8147158822SW5CD"
    proxyPass = "CBE9D1D21DC94189"
    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
      "host" : proxyHost,
      "port" : proxyPort,
      "user" : proxyUser,
      "pass" : proxyPass,
    }
    proxies = {
        "http"  : proxyMeta,
        "https" : proxyMeta,
    }
    for i in range(1.6):
        resp = requests.get(targetUrl, proxies=proxies)
        # print(resp.status_code)
        print('%s request IP: %s'%(i,resp.text))
get_proxies()
Copy the code

You can see that each request uses a different IP, isn’t that simple? It’s much easier than the IP proxy pool.

The first1The IP address of the second request is:125.117134.158.2The IP address of the second request is:49.71117.45.3The IP address of the second request is:112.244117.94.4The IP address of the second request is:122.239164.35.5The IP address of the second request is:125.106147.24.

[Finished in 2.8s]
Copy the code

Of course, you can also set it like this:

""" project = 'Code', file_name = 'lcharm ', author = 'AI Yue Chong' time = '2020/5/26 10:36', product_name = PyCharm, AI Yue Chuang code is far away from bugs with the God animal protecting I love animals. They taste delicious. """
import requests
# 112.48.28.233
url = 'http://icanhazip.com'
# Below try:...... except:...... It's an error-proof mechanism
# username: password@Proxy server IP address :port
proxy = {'http':'http://H8147158822SW5CD:[email protected]:9020'}
try:
    response = requests.get(url, proxies = proxy)
    print(response.status_code)
    if response.status_code == 200:
        print(response.text)
except requests.ConnectionError as e:
    If there is an error, output error information
    print(e.args)
Copy the code

Above, you saw how to set the proxy IP in Requests.

Without an agent
import requests
url = 'http://icanhazip.com'
# Below try:...... except:...... It's an error-proof mechanism
try:
	response = requests.get(url) # Do not use proxies
	print(response.status_code)
	if response.status_code == 200:
		print(response.text)
except requests.ConnectionError as e:
	If there is an error, output error information
	print(e.args)
Copy the code

Output results:

200
112.4828.233.

[Finished in 1.0s]
Copy the code

Using a proxy

import requests
# 112.48.28.233
url = 'http://icanhazip.com'
# Below try:...... except:...... It's an error-proof mechanism
# username: password@Proxy server IP address :port
proxy = {'http':'http://H8147158822SW5CD:[email protected]:9020'}
try:
	response = requests.get(url, proxies = proxy)
	print(response.status_code)
	if response.status_code == 200:
		print(response.text)
except requests.ConnectionError as e:
	If there is an error, output error information
	print(e.args)
Copy the code

Output results:

200
221.227240.111.
Copy the code

Now that we have successfully used the proxy, we will write the crawler code.

Proxy write

Because, I can only buy five requests per second, so I have to change my thread pool to 5, otherwise I will report an error. (Rich people buy at will or consider praising xiaobian, next douyin)

I’m max_workers and I changed it to 2.