When crawling certain sites, we often set the proxy IP to avoid the crawler being blocked. We obtain agent IP address usually extract domestic well-known IP agents (such as tomato acceleration www.fanqieip.net/) agent. These agents… The principle behind IP.

1 Proxy Type

There are four types of agents. In addition to the transparent proxy, anonymous proxy, high-hiding proxy, and obfuscating proxy mentioned earlier. In terms of security, the order of the four proxy types is high – hidden > obscure > Anonymous > Transparent.

2 Agent Principle

The proxy type depends on the configuration of the proxy server. Different configurations form different proxy types. REMOTE_ADDR, HTTP_VIA, and HTTP_X_FORWARDED_FOR are the decisive factors in the configuration.

1) REMOTE_ADDR

REMOTE_ADDR represents the client’S IP address, but its value is not provided by the client, but specified by the server based on the client’s IP address.

If you use a browser to access a web site directly, the web server for that site (Nginx, Apache, etc.) sets REMOTE_ADDR to the client’s IP address.

If we set up a proxy for our browser, our request to visit the target site will go through the proxy server, and the proxy server will translate the request to the target site. The web server of the site sets REMOTE_ADDR to the IP of the proxy server.

2)X-Forwarded-For(XFF)

X-forwarded-for is an HTTP extension header that represents the real IP address of the HTTP request end. When a client uses a proxy, the Web server does not know the real IP address of the client. To avoid this, proxy servers usually add an X-Forwarded-for header, which adds the client’s IP address to the header.

The format of the x-Forwarded-For header is as follows:

X-Forwarded-For: client, proxy1, proxy2

Client INDICATES the IP address of the client. Proxy1 is the device IP farthest from the server. Proxy2 is the IP of the secondary proxy device; From the format, you can see that there can be multiple layers of proxy from client to server.

If an HTTP request passes through three proxies Proxy1, Proxy2 and Proxy3 with IP addresses of IP1, IP2 and IP3 respectively, and the user’s real IP address is IP0, then according to XFF standard, the server will receive the following information:

X-Forwarded-For: IP0, IP1, IP2

Proxy3 directly connects to the server, and it appends IP2 to XFF to indicate that it is forwarding requests for Proxy2. There is no IP3 in the list, which is available on the server via the Remote Address field. We know that HTTP connection is based on TCP connection, and there is no concept of IP in HTTP protocol. Remote Address comes from TCP connection and represents the IP Address of the device that establishes TCP connection with the server. In this case, it is IP3.

3)HTTP_VIA

Via is a header in the HTTP protocol, which records the proxy and gateway that an HTTP request passes through. If one proxy server passes through, one proxy server is added, and two proxies are added after two.

3 Proxy type difference

1) Transparent Proxy

The proxy server configuration is as follows:

REMOTE_ADDR = Proxy IP

HTTP_VIA = Proxy IP

HTTP_X_FORWARDED_FOR = Your IP

Transparent proxy can “hide” the client’s IP address directly, but it can still look up the client’s IP address from HTTP_X_FORWARDED_FOR.

2) Anonymous Proxy

The proxy server configuration is as follows:

REMOTE_ADDR = proxy IP

HTTP_VIA = proxy IP

HTTP_X_FORWARDED_FOR = proxy IP

Anonymous proxy can hide the CLIENT IP address. With anonymous proxy, the server can know that the client is using a proxy, but cannot know the real IP address of the client.

3) Distorting Proxy

The proxy server configuration is as follows:

REMOTE_ADDR = Proxy IP

HTTP_VIA = Proxy IP

HTTP_X_FORWARDED_FOR = Random IP address

The principle is similar to that of anonymous proxy, but camouflaged more closely. If the client uses an obfuscated proxy, the server still knows that the client is using the proxy, but gets a fake client IP address.

2) Elite Proxy or High interim Proxy

The proxy server configuration is as follows:

REMOTE_ADDR = Proxy IP

HTTP_VIA = not determined

HTTP_X_FORWARDED_FOR = not determined

High-hiding proxy can not only make the server not clear whether the client is using proxy, but also ensure that the server can not get the real IP address of the client.

4. Selection of agents

A normal anonymous proxy can hide the client’s real IP, but it can change our request information and the server might think we are using a proxy. However, with this proxy, although the visited web site does not know the IP address of the client, it can still know that you are using the proxy, of course, some pages that can detect IP can still check the IP address of the client.

The highly anonymous proxy does not change the client’s request, so it looks to the server as if a real client browser is accessing it. In this case, the client’s real IP is hidden and the server does not think we are using the proxy.

Therefore, when the crawler needs to use proxy IP, it should try to choose common anonymous proxy and high anonymous proxy. In addition, tomato-accelerated HTTPS proxies are recommended to ensure that data is not known by proxy servers.