Grab the free proxy IP with a Python crawler

— Illustrations by Ash Thorp & Maciej Kuciara —

♚

HDMI, JUST WANT AND JUST DO

Blog address: zhihu.com/people/hdmi-blog

I don’t know if you’ve ever encountered a “too frequently visited” message that requires us to wait a while or enter a captcha to unblock it, but it will happen later. The reasons for this phenomenon is what we want to crawl the web took measures against the crawler, such as when an IP request too many web pages per unit time, the server will be denial of service, this situation is caused by access frequency IP, this kind of situation on unlock cannot very good solve, so we want to go to the camouflage native IP request page, That’s what we’re going to talk about today using proxy IP.

At present, there are many proxy IP addresses on the Internet, some are free and some are paid, such as Xitulin agent, etc. Although the free proxy does not cost money, the effective proxy is few and unstable, the paid proxy may be a little better, but today I only climb the free proxy and will check whether it is available, and put the available IP into MongoDB for the convenience of taking it out next time.

Running platform: Windows

Python version: Python3.6

IDE: Sublime Text

Others: Chrome

The brief process is as follows:

Step 1: Understand how the Requests proxy is used

Step 2: Crawl to the IP and port from the proxy web page

Step 3: Check whether the crawled IP address is available

Step 4: Store the crawled available agents to MongoDB

Step 5: Randomly select an IP from the database stored in available IP and return after the test is successful

For Requests, the proxy setup is simple, simply passing in the Proxies parameter.

Note, however, that I installed Fiddler on the local PC and used it to create an HTTP proxy service on port 8888 (using the Chrome plugin SwitchyOmega). 127.0.0.1:8888, as long as we set up the proxy, we can successfully switch the local IP into the server IP address connected to the proxy software.

Copy the code

import requests
` `
The proxy = '127.0.0.1:8888'
proxies = {
'http':'http://' + proxy,
'https':'http://' + proxy
}
` `
try:
response = requests.get('http://httpbin.org/get',proxies=proxies)
print(response.text)
except requests.exceptions.ConnectionError as e:
print('Error',e.args)

Here I used http://httpbin.org/get as the test website, we can visit the web page to get information about the request, where the Origin field is the client IP, we can judge whether the proxy is successful according to the returned result. The result is as follows:

Copy the code

{
"Args" : {},
"headers":{
"Accept":"*/*",
"Accept-Encoding":"gzip, deflate",
"Connection":"close",
"Host":"httpbin.org",
"The user-agent" : "python - requests / 2.18.4"
},
"origin":"xx.xxx.xxx.xxx",
"url":"http://httpbin.org/get"
}

The next step is to crawl the proxy IP. First we open the Chrome browser to look at the web page and find the IP and port elements.

It can be seen that the proxy IP addresses and related information are stored in a table, so it is easy to extract relevant information when BeautifulSoup is used to extract them. However, it is important to note that there may be repeated IP addresses, especially when we climb multiple proxy pages at the same time and store them in the same array. So we can use collections to remove duplicate IPS.

The IP addresses to be climbed are stored in the array and then tested one by one.

Using the requests proxy method mentioned above, we used http://httpbin.org/ip as the test site, which returned our IP address directly and stored it in the MomgoDB database after passing the test.

Connect to the database, specify the database and collection, and then insert the data.

Run it finally to see the results

After running for a long time, it’s rare to see three tests in a row, screenshot save yourself quickly, the fact is, after all is a free agent, effective or very little, and survival time is really short, however, crawl quantity is big, still can find available, we just used as a practice, is barely enough. Now let’s take a look at what’s stored in the database.

There aren’t a lot of IPS in the database right now because there aren’t a lot of pages to climb, and there aren’t a lot of valid IPS, and I haven’t done that much climbing, but I’ve saved them. Now let’s see how we can randomly pick them out.

I was worried that the IP would become invalid after being put into the database for a period of time, so I re-tested it before taking it out. If the IP was returned successfully, it would be directly removed from the database if it failed.

This way, when we need to use the proxy, we can pull it out of the database at any time.

The total code is as follows:

Copy the code

import random
import requests
import time
import pymongo
from bs4 import BeautifulSoup
` `
` `
Select the URL of the proxy
url_ip = "http://www.xicidaili.com/nt/"
` `
Set the wait time
set_timeout = 5
` `
# number of pages to crawl agent, 2 indicates the IP address to crawl 2 pages
num = 2
` `
# number of proxy uses
count_time = 5
` `
# construct headers
Headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}
` `
Test the URL of the IP address
url_for_test = 'http://httpbin.org/ip'
` `
` `
def scrawl_xici_ip(num):
' ' '
The proxy IP address was crawled
' ' '
ip_list = []
for num_page in range(1,num):
url = url_ip + str(num_page)
response = requests.get(url,headers=headers)
if response.status_code == 200:
content = response.text
soup = BeautifulSoup(content,'lxml')
trs = soup.find_all('tr')
for i in range(1,len(trs)):
tr = trs[i]
tds = tr.find_all('td')
ip_item = tds[1].text + ':' + tds[2].text
# print(ip_item)
ip_list.append(ip_item)
Ip_set = set(ip_list
ip_list = list(ip_set)
Time.sleep (count_time) # wait 5 seconds
return ip_list
` `
` `
def ip_test(url_for_test,ip_info):
' ' '
Test the obtained IP address. If the test succeeds, the IP address is stored in MongoDB
' ' '
for ip_for_test in ip_info:
# set proxy
proxies = {
'http': 'http://' + ip_for_test,
'https': 'http://' + ip_for_test,
}
print(proxies)
try:
response = requests.get(url_for_test,headers=headers,proxies=proxies,timeout=10)
if response.status_code == 200:
ip = {'ip':ip_for_test}
print(response.text)
Print (' Test passed ')
write_to_MongoDB(ip)
except Exception as e:
print(e)
continue
` `
` `
def write_to_MongoDB(proxies):
' ' '
Save the IP that passes the test to MongoDB
' ' '
client = pymongo.MongoClient(host='localhost',port=27017)
db = client.PROXY
collection = db.proxies
result = collection.insert(proxies)
print(result)
Print (' MongoDB successfully stored ')
` `
` `
def get_random_ip():
' ' '
Pick up a random IP address
' ' '
client = pymongo.MongoClient(host='localhost',port=27017)
db = client.PROXY
collection = db.proxies
items = collection.find()
length = items.count()
ind = random.randint(0,length-1)
useful_proxy = items[ind]['ip'].replace('\n','')
proxy = {
'http': 'http://' + useful_proxy,
'https': 'http://' + useful_proxy,
}
response = requests.get(url_for_test,headers=headers,proxies=proxy,timeout=10)
if response.status_code == 200:
return useful_proxy
else:
Print (' {IP} invalid '. Format (useful_proxy))
collection.remove(useful_proxy)
Print (' removed from MongoDB ')
get_random_ip()
` `
` `
def main():
ip_info = []
ip_info = scrawl_xici_ip(2)
sucess_proxy = ip_test(url_for_test,ip_info)
finally_ip = get_random_ip()
Print (' finally_ip ')
` `
` `
` `
if __name__ == '__main__':
main()

Python Chinese Community The spiritual tribe of Python Chinese developers around the world.jpg”)

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Click **** to read the original article and become a free member of **** community

Grab the free proxy IP with a Python crawler

Related Posts

@ Data annotations

GO Install the golang.org/x/net extension library

Mysql Learning Notes – Slow query