—  Illustrations by Ash Thorp & Maciej Kuciara —

HDMI, JUST WANT AND JUST DO

Blog address: zhihu.com/people/hdmi-blog

\

I don’t know if you’ve ever encountered a “too frequently visited” message that requires us to wait a while or enter a captcha to unblock it, but it will happen later. The reasons for this phenomenon is what we want to crawl the web took measures against the crawler, such as when an IP request too many web pages per unit time, the server will be denial of service, this situation is caused by access frequency IP, this kind of situation on unlock cannot very good solve, so we want to go to the camouflage native IP request page, That’s what we’re going to talk about today using proxy IP.

At present, there are many proxy IP addresses on the Internet, some are free and some are paid, such as Xitulin agent, etc. Although the free proxy does not cost money, the effective proxy is few and unstable, the paid proxy may be a little better, but today I only climb the free proxy and will check whether it is available, and put the available IP into MongoDB for the convenience of taking it out next time.

Running platform: Windows

Python version: Python3.6

IDE: Sublime Text

Others: Chrome

The brief process is as follows:

Step 1: Understand how the Requests proxy is used

Step 2: Crawl to the IP and port from the proxy web page

Step 3: Check whether the crawled IP address is available

Step 4: Store the crawled available agents to MongoDB

Step 5: Randomly select an IP from the database stored in available IP and return after the test is successful

For Requests, the proxy setup is simple, simply passing in the Proxies parameter.

Note, however, that I installed Fiddler on the local PC and used it to create an HTTP proxy service on port 8888 (using the Chrome plugin SwitchyOmega). 127.0.0.1:8888, as long as we set up the proxy, we can successfully switch the local IP into the server IP address connected to the proxy software.

Copy the code
  1. import requests
  2. ` `
  3. The proxy = '127.0.0.1:8888'
  4. proxies = {
  5.    'http':'http://' + proxy,
  6.    'https':'http://' + proxy
  7. }
  8. ` `
  9. try:
  10.    response = requests.get('http://httpbin.org/get',proxies=proxies)
  11.    print(response.text)
  12. except requests.exceptions.ConnectionError as e:
  13.    print('Error',e.args)

Here I used http://httpbin.org/get as the test website, we can visit the web page to get information about the request, where the Origin field is the client IP, we can judge whether the proxy is successful according to the returned result. The result is as follows:

Copy the code
  1. {
  2. "Args" : {},
  3.    "headers":{
  4.        "Accept":"*/*",
  5.        "Accept-Encoding":"gzip, deflate",
  6.        "Connection":"close",
  7.        "Host":"httpbin.org",
  8. "The user-agent" : "python - requests / 2.18.4"
  9.    },
  10.    "origin":"xx.xxx.xxx.xxx",
  11.    "url":"http://httpbin.org/get"
  12. }

The next step is to crawl the proxy IP. First we open the Chrome browser to look at the web page and find the IP and port elements.

It can be seen that the proxy IP addresses and related information are stored in a table, so it is easy to extract relevant information when BeautifulSoup is used to extract them. However, it is important to note that there may be repeated IP addresses, especially when we climb multiple proxy pages at the same time and store them in the same array. So we can use collections to remove duplicate IPS.

The IP addresses to be climbed are stored in the array and then tested one by one.

Using the requests proxy method mentioned above, we used http://httpbin.org/ip as the test site, which returned our IP address directly and stored it in the MomgoDB database after passing the test.

Connect to the database, specify the database and collection, and then insert the data.

Run it finally to see the results

After running for a long time, it’s rare to see three tests in a row, screenshot save yourself quickly, the fact is, after all is a free agent, effective or very little, and survival time is really short, however, crawl quantity is big, still can find available, we just used as a practice, is barely enough. Now let’s take a look at what’s stored in the database.

There aren’t a lot of IPS in the database right now because there aren’t a lot of pages to climb, and there aren’t a lot of valid IPS, and I haven’t done that much climbing, but I’ve saved them. Now let’s see how we can randomly pick them out.

I was worried that the IP would become invalid after being put into the database for a period of time, so I re-tested it before taking it out. If the IP was returned successfully, it would be directly removed from the database if it failed.

This way, when we need to use the proxy, we can pull it out of the database at any time.

The total code is as follows:

Copy the code
  1. import random
  2. import requests
  3. import time
  4. import pymongo
  5. from bs4 import BeautifulSoup
  6. ` `
  7. ` `
  8. Select the URL of the proxy
  9. url_ip = "http://www.xicidaili.com/nt/"
  10. ` `
  11. Set the wait time
  12. set_timeout = 5
  13. ` `
  14. # number of pages to crawl agent, 2 indicates the IP address to crawl 2 pages
  15. num = 2
  16. ` `
  17. # number of proxy uses
  18. count_time = 5
  19. ` `
  20. # construct headers
  21. Headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}
  22. ` `
  23. Test the URL of the IP address
  24. url_for_test = 'http://httpbin.org/ip'
  25. ` `
  26. ` `
  27. def scrawl_xici_ip(num):
  28. ' ' '
  29. The proxy IP address was crawled
  30. ' ' '
  31.    ip_list = []
  32.    for num_page in range(1,num):
  33.        url = url_ip + str(num_page)
  34.        response = requests.get(url,headers=headers)
  35.        if response.status_code == 200:
  36.            content = response.text
  37.            soup = BeautifulSoup(content,'lxml')
  38.            trs = soup.find_all('tr')
  39.            for i in range(1,len(trs)):
  40.                tr = trs[i]
  41.                tds = tr.find_all('td')      
  42.                ip_item = tds[1].text + ':' + tds[2].text
  43.                # print(ip_item)
  44.                ip_list.append(ip_item)
  45. Ip_set = set(ip_list
  46.                ip_list = list(ip_set)
  47. Time.sleep (count_time) # wait 5 seconds
  48.    return ip_list
  49. ` `
  50. ` `
  51. def ip_test(url_for_test,ip_info):
  52. ' ' '
  53. Test the obtained IP address. If the test succeeds, the IP address is stored in MongoDB
  54. ' ' '
  55.    for ip_for_test in ip_info:
  56. # set proxy
  57.        proxies = {
  58.            'http': 'http://' + ip_for_test,
  59.            'https': 'http://' + ip_for_test,
  60.            }
  61.        print(proxies)
  62.        try:
  63.            response = requests.get(url_for_test,headers=headers,proxies=proxies,timeout=10)
  64.            if response.status_code == 200:
  65.                ip = {'ip':ip_for_test}
  66.                print(response.text)
  67. Print (' Test passed ')
  68.                write_to_MongoDB(ip)    
  69.        except Exception as e:
  70.            print(e)
  71.            continue
  72. ` `
  73. ` `
  74. def write_to_MongoDB(proxies):
  75. ' ' '
  76. Save the IP that passes the test to MongoDB
  77. ' ' '
  78.    client = pymongo.MongoClient(host='localhost',port=27017)
  79.    db = client.PROXY
  80.    collection = db.proxies
  81.    result = collection.insert(proxies)
  82.    print(result)
  83. Print (' MongoDB successfully stored ')
  84. ` `
  85. ` `
  86. def get_random_ip():
  87. ' ' '
  88. Pick up a random IP address
  89. ' ' '
  90.    client = pymongo.MongoClient(host='localhost',port=27017)
  91.    db = client.PROXY
  92.    collection = db.proxies
  93.    items = collection.find()
  94.    length = items.count()
  95.    ind = random.randint(0,length-1)
  96.    useful_proxy = items[ind]['ip'].replace('\n','')
  97.    proxy = {
  98.        'http': 'http://' + useful_proxy,
  99.        'https': 'http://' + useful_proxy,
  100.        }  
  101.    response = requests.get(url_for_test,headers=headers,proxies=proxy,timeout=10)
  102.    if response.status_code == 200:
  103.        return useful_proxy
  104.    else:
  105. Print (' {IP} invalid '. Format (useful_proxy))
  106.        collection.remove(useful_proxy)
  107. Print (' removed from MongoDB ')
  108.        get_random_ip()
  109. ` `
  110. ` `
  111. def main():
  112.    ip_info = []
  113.    ip_info = scrawl_xici_ip(2)
  114.    sucess_proxy = ip_test(url_for_test,ip_info)
  115.    finally_ip = get_random_ip()
  116. Print (' finally_ip ')
  117. ` `
  118. ` `
  119. ` `
  120. if __name__ == '__main__':
  121.    main()

\

Python Chinese Community The spiritual tribe of Python Chinese developers around the world.jpg”)

\

\

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Click **** to read the original article and become a free member of **** community