• Detecting Bots in Apache & Nginx Logs
  • By Mark Litwintschik
  • The Nuggets translation Project
  • Translator: luoyaqifei
  • Proofreader: Forezp, 1992chenlu

Detect crawler robots in Apache and Nginx logs

From the fact that browser plugins that block javascript-based tracking now enjoy nine-figure user numbers, web traffic logging can be a great place to get a sense of how many people are visiting your site. But as anyone who has monitored web traffic logs for any length of time knows, there are hordes of crawler robots crawling websites. However, it is a challenge to distinguish robot-generated traffic from human-generated traffic in web server logs.

In this blog post, I’ll walk you through the steps I used to create a robot detection script based on IPv4 ownership and browser string.

The code used in this article is in this snippet.

The IP address belongs to the database

First, I’ll install Python and some dependency packages. The following commands will be executed during a new Ubuntu 14.04.3 LTS installation.

$ sudo apt-get update
$ sudo apt-get install \
    python-dev \
    python-pip \
    python-virtualenvCopy the code

Next I’ll create a Python virtual environment and activate it. This alleviates permission issues that are common when using the PIP installation library.

$ virtualenv findbots
$ source findbots/bin/activateCopy the code

MaxMind offers a free database of country and city registrations for IPv4 addresses. Along with these data sets, they have released a Python-based library called “GeoIP2” that maps their data sets to memory-mapped files and performs very fast queries using C-based Python extensions.

The following commands will install their packages, download and unzip their data sets at the city level.

$ pip install geoip2
$ curl -O http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.mmdb.gz
$ gunzip GeoLite2-City.mmdb.gzCopy the code

I’ve looked at some Web traffic logs, and I’ve picked up traffic that happens to request “robots.txt”. From that list, I highlighted a few of the most common IP addresses and found that many of them actually belonged to hosts and cloud service providers. I wonder if it is possible to assemble a list, however incomplete, of all the IPv4 addresses of these providers.

Google has a DNS-BASED mechanism for collecting lists of IP addresses they use to serve the cloud. This initial call will give you a list of hosts to query.

$ dig -t txt _cloud-netblocks.googleusercontent.com | grep spfCopy the code
 _cloud-netblocks.googleusercontent.com. 5 IN TXT "v=spf1 include:_cloud-netblocks1.googleusercontent.com include:_cloud-netblocks2.googleusercontent.com include:_cloud-netblocks3.googleusercontent.com include:_cloud-netblocks4.googleusercontent.com include:_cloud-netblocks5.googleusercontent.com ?all"Copy the code

As clarified above, _cloud-netblocks[1-5]. Googleusercontent.com will contain SPF records that include their working IPv4 and IPv6 CIDR addresses. Looking up all five addresses as follows should give you an up-to-date list.

$ dig -t txt _cloud-netblocks1.googleusercontent.com | grep spfCopy the code
_cloud-netblocks1.googleusercontent.com. 5 IN TXT "V = SPF1 IP4:8.34.208.0/20 IP4:8.35.192.0/21 IP4:8.35.200.0/23 IP4:108.59.80.0/20 IP4:108.170.192.0/20 Ip4:108.170.208.0/21 IP4:108.170.216.0/22 IP4:108.170.220.0/23 IP4:108.170.222.0/24? all"Copy the code

Last March, in a Hadoop-based MapReduce task, I tried to capture the WHOIS details of the entire IPv4 address space and published a blog post. The mission ran for nearly two hours before ending prematurely, leaving me with a sizable, if incomplete, data set of 235,532 WHOIS records. The data set is a year old and should be valuable, if not outdated.

$ ls -lCopy the code
-rw-rw-r-- 1 mark mark  5946203 Mar 31  2016 part- 00001.
-rw-rw-r-- 1 mark mark  5887326 Mar 31  2016 part- 00002.. -rw-rw-r--1 mark mark  6187219 Mar 31  2016 part- 00154.
-rw-rw-r-- 1 mark mark  5961162 Mar 31  2016 part- 00155.Copy the code

When I focused on checking the IP addresses of crawler robots crawling to “robots.txt,” these six companies popped up a lot besides Google: Amazon, Baidu, Digital Ocean, Hetzner, Linode and New Dream Network. I ran the following command to try to retrieve their IPv4 WHOIS records.

$ grep -i 'amazon'            part00 -* > amzn
$ grep -i 'baidu'             part00 -* > baidu
$ grep -i 'digital ocean'     part00 -* > digital_ocean
$ grep -i 'hetzner'           part00 -* > hetzner
$ grep -i 'linode'            part00 -* > linode
$ grep -i 'new dream network' part00 -* > dreamCopy the code

From the above six files, I need to parse the secondary encoded JSON string, which contains the filename and frequency information. I used the iPython code to get the different CIDR blocks as follows:

import json


def parse_cidrs(filename):
    lines = open(filename, 'r+b').read().split('\n')

    recs = []

    for line in lines:
        try:
            recs.append(
                json.loads(
                    json.loads(':'.join(line.split('\t') [0].split(':') [1:]))))
        except ValueError:
            continue

    return set([str(rec.get('network', {}).get('cidr', None))
                for rec in recs])


for _name in ['amzn'.'baidu'.'digital_ocean'.'hetzner'.'linode'.'dream']:
    print _name, parse_cidrs(_name)Copy the code

Here is an example of a cleaned WHOIS record, with the contact information removed.

{
    "asn": "38365"."asn_cidr": "182.61.0.0/18"."asn_country_code": "CN"."asn_date": "2010-02-25"."asn_registry": "apnic"."entities": [
        "IRT-CNNIC-CN"."SD753-AP"]."network": {
        "cidr": "182.61.0.0/16"."country": "CN"."end_address": "182.61.255.255"."events": [{"action": "last changed"."actor": null."timestamp": "2014-09-28T05:44:22Z"}]."handle": "182.61.0.0-182.61.255.255"."ip_version": "v4"."links": [
            "http://rdap.apnic.net/ip/182.0.0.0/8"."http://rdap.apnic.net/ip/182.61.0.0/16"]."name": "Baidu"."parent_handle": "182.0.0.0-182.255.255.255"."raw": null."remarks": [{"description": "Beijing Baidu Netcom Science and Technology Co., Ltd..."."links": null."title": "description"}]."start_address": "182.61.0.0"."status": null."type": "ALLOCATED PORTABLE"
    },
    "query": "182.61.48.129"."raw": null
}Copy the code

This list of seven companies is not a comprehensive list of the origins of crawler robots. I found that in addition to a distributed crawler team connected from all over the world, a lot of crawler traffic came from some residential IP addresses in Ukraine and China, and the source was hard to distinguish. To be honest, if I wanted a comprehensive list of practical IP addresses for crawler robots, ALL I had to do was look at the order of HTTP headers, check TCP/IP behavior, search for forged IP registrations (see page 28), the list came up, and it was a game of cat and mouse.

Install the library

For this project, I will use some well-written libraries. Apache Log Parser parses traffic logs generated by Apache and Nginx. The library supports parsing over 30 different types of information from log files, and I found it to be fairly resilient and reliable. Python User Agents can parse strings for User Agents and perform some of the basic classification operations used by Agents. Colorama assists in creating highlighted ANSI output. Netaddr is a mature, well-maintained library of network address manipulation.

$ pip install -e git+https://github.com/rory/apache-log-parser.git#egg=apache-log-parser \
              -e git+https://github.com/selwin/python-user-agents.git#egg=python-user-agents \
              colorama \
              netaddrCopy the code

Crawler robot monitoring script

The next part is running monitor.py. This script receives web traffic logs from the STDIN (standard input) pipe. This means that you can use SSH to view the logs on a remote server and run the script locally.

I started by importing two libraries from the Python standard library and installing five external libraries through PIP.

import sys
from urlparse import urlparse

import apache_log_parser
from colorama import Back, Style
import geoip2.database
from netaddr import IPNetwork, IPAddress
from user_agents import parseCopy the code

Next I set up MaxMind’s GeoIP2 library to use the “geolite2-city.mmdb” city-level library.

I also set apache_log_parser to handle the stored Web log format. Your log format may be different, so you may need to spend some time comparing your Web server’s traffic log configuration with the library’s format documentation.

Finally, I have a dictionary of CIDR blocks that I found belonging to those seven companies. In this list, Baidu is not a host or cloud provider per se, but runs a lot of crawler robots that can’t be identified by their user agents.

reader = geoip2.database.Reader('GeoLite2-City.mmdb')

_format = "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""
line_parser = apache_log_parser.make_parser(_format)

CIDRS = {
    'Amazon': ['107.20.0.0/14'.'122.248.192.0/19'.'122.248.224.0/19'.'172.96.96.0/20'.'174.129.0.0/16'.'175.41.128.0/19'.'175.41.160.0/19'.'175.41.192.0/19'.'175.41.224.0/19'.'176.32.120.0/22'.'176.32.72.0/21'.'176.34.0.0/16'.'176.34.144.0/21'.'176.34.224.0/21'.'184.169.128.0/17'.'184.72.0.0/15'.'185.48.120.0/26'.'207.171.160.0/19'.'213.71.132.192/28'.'216.182.224.0/20'.'23.20.0.0/14'.'46.137.0.0/17'.'46.137.128.0/18'.'46.51.128.0/18'.'46.51.192.0/20'.'50.112.0.0/16'.'50.16.0.0/14'.'52.0.0.0/11'.'52.192.0.0/11'.'52.192.0.0/15'.'52.196.0.0/14'.'52.208.0.0/13'.'52.220.0.0/15'.'52.28.0.0/16'.'52.32.0.0/11'.'52.48.0.0/14'.'52.64.0.0/12'.'52.67.0.0/16'.'52.68.0.0/15'.'52.79.0.0/16'.'52.80.0.0/14'.'52.84.0.0/14'.'52.88.0.0/13'.'54.144.0.0/12'.'54.160.0.0/12'.'54.176.0.0/12'.'54.184.0.0/14'.'54.188.0.0/14'.'54.192.0.0/16'.'54.193.0.0/16'.'54.194.0.0/15'.'54.196.0.0/15'.'54.198.0.0/16'.'54.199.0.0/16'.'54.200.0.0/14'.'54.204.0.0/15'.'54.206.0.0/16'.'54.207.0.0/16'.'54.208.0.0/15'.'54.210.0.0/15'.'54.212.0.0/15'.'54.214.0.0/16'.'54.215.0.0/16'.'54.216.0.0/15'.'54.218.0.0/16'.'54.219.0.0/16'.'54.220.0.0/16'.'54.221.0.0/16'.'54.224.0.0/12'.'54.228.0.0/15'.'54.230.0.0/15'.'54.232.0.0/16'.'54.234.0.0/15'.'54.236.0.0/15'.'54.238.0.0/16'.'54.239.0.0/17'.'54.240.0.0/12'.'54.242.0.0/15'.'54.244.0.0/16'.'54.245.0.0/16'.'54.247.0.0/16'.'54.248.0.0/15'.'54.250.0.0/16'.'54.251.0.0/16'.'54.252.0.0/16'.'54.253.0.0/16'.'54.254.0.0/16'.'54.255.0.0/16'.'54.64.0.0/13'.'54.72.0.0/13'.'54.80.0.0/12'.'54.72.0.0/15'.'54.79.0.0/16'.'54.88.0.0/16'.'54.93.0.0/16'.'54.94.0.0/16'.'63.173.96.0/24'.'72.21.192.0/19'.'75.101.128.0/17'.'79.125.64.0/18'.'96.127.0.0/17'].'Baidu': ['180.76.0.0/16'.'119.63.192.0/21'.'106.12.0.0/15'.'182.61.0.0/16'].'DO': ['104.131.0.0/16'.'104.236.0.0/16'.'107.170.0.0/16'.'128.199.0.0/16'.'138.197.0.0/16'.'138.68.0.0/16'.'139.59.0.0/16'.'146.185.128.0/21'.'159.203.0.0/16'.'162.243.0.0/16'.'178.62.0.0/17'.'178.62.128.0/17'.'188.166.0.0/16'.'188.166.0.0/17'.'188.226.128.0/18'.'188.226.192.0/18'.'45.55.0.0/16'.'46.101.0.0/17'.'46.101.128.0/17'.'82.196.8.0/21'.'95.85.0.0/21'.'95.85.32.0/21'].'Dream': ['173.236.128.0/17'.'205.196.208.0/20'.'208.113.128.0/17'.'208.97.128.0/18'.'67.205.0.0/18'].'Google': ['104.154.0.0/15'.'104.196.0.0/14'.'107.167.160.0/19'.'107.178.192.0/18'.'108.170.192.0/20'.'108.170.208.0/21'.'108.170.216.0/22'.'108.170.220.0/23'.'108.170.222.0/24'.'108.59.80.0/20'.'130.211.128.0/17'.'130.211.16.0/20'.'130.211.32.0/19'.'130.211.4.0/22'.'130.211.64.0/18'.'130.211.8.0/21'.'146.148.16.0/20'.'146.148.2.0/23'.'146.148.32.0/19'.'146.148.4.0/22'.'146.148.64.0/18'.'146.148.8.0/21'.'162.216.148.0/22'.'162.222.176.0/21'.'173.255.112.0/20'.'192.158.28.0/22'.'199.192.112.0/22'.'199.223.232.0/22'.'199.223.236.0/23'.'208.68.108.0/23'.'23.236.48.0/20'.'23.251.128.0/19'.'35.184.0.0/14'.'35.188.0.0/15'.'35.190.0.0/17'.'35.190.128.0/18'.'35.190.192.0/19'.'35.190.224.0/20'.'8.34.208.0/20'.'8.35.192.0/21'.'8.35.200.0/23',].'Hetzner': ['129.232.128.0/17'.'129.232.156.128/28'.'136.243.0.0/16'.'138.201.0.0/16'.'144.76.0.0/16'.'148.251.0.0/16'.'176.9.12.192/28'.'176.9.168.0/29'.'176.9.24.0/27'.'176.9.72.128/27'.'178.63.0.0/16'.'178.63.120.64/27'.'178.63.156.0/28'.'178.63.216.0/29'.'178.63.216.128/29'.'178.63.48.0/26'.'188.40.0.0/16'.'188.40.108.64/26'.'188.40.132.128/26'.'188.40.144.0/24'.'188.40.48.0/26'.'188.40.48.128/26'.'188.40.72.0/26'.'196.40.108.64/29'.'213.133.96.0/20'.'213.239.192.0/18'.'41.203.0.128/27'.'41.72.144.192/29'.'46.4.0.128/28'.'46.4.192.192/29'.'46.4.84.128/27'.'46.4.84.64/27'.'5.9.144.0/27'.'5.9.192.128/27'.'5.9.240.192/27'.'5.9.252.64/28'.'78.46.0.0/15'.'78.46.24.192/29'.'78.46.64.0/19'.'85.10.192.0/20'.'85.10.228.128/29'.'88.198.0.0/16'.'88.198.0.0/20'].'Linode': ['104.200.16.0/20'.'109.237.24.0/22'.'139.162.0.0/16'.'172.104.0.0/15'.'173.255.192.0/18'.'178.79.128.0/21'.'198.58.96.0/19'.'23.92.16.0/20'.'45.33.0.0/17'.'45.56.64.0/18'.'45.79.0.0/16'.'50.116.0.0/18'.'80.85.84.0/23'.'96.126.96.0/19'],}Copy the code

I created a utility function that passes in an IPv4 address and a list of CIDR blocks, and it tells me if the IP address belongs to any of the given CIDR blocks.

def in_block(ip, block):
    _ip = IPAddress(ip)
    return any([True
                for cidr in block
                if _ip in IPNetwork(cidr)])Copy the code

The following function takes objects from both the request (REQ) and the browser agent and tries to determine whether the traffic source/browser agent is from the crawler. The browser proxy object is constructed using the Python user proxy library, and there are some tests to determine whether the user proxy string belongs to a known crawler robot. I’ve extended these tests with some tokens I’ve seen in the library’s classification system. At the same time I iterate through the CIDR block to see if the IPv4 address of the remote host is in it.

def bot_test(req, agent):
    ua_tokens = ['daum/', # Daum Communications Corp.
                 'gigablastopensource',
                 'go-http-client',
                 'http://',
                 'httpclient',
                 'https://',
                 'libwww-perl',
                 'phantomjs',
                 'proxy',
                 'python',
                 'sitesucker',
                 'wada.vn',
                 'webindex',
                 'wget']

    is_bot = agent.is_bot or \
             any([True
                  for cidr in CIDRS.values()
                  if in_block(req['remote_host'], cidr)]) or \
             any([True
                  for token in ua_tokens
                  if token in agent.ua_string.lower()])

    return is_botCopy the code

The following is the main part of the script. Web traffic logs are read line by line from standard input. Each line of content is resolved into a token-version of the request, user agent, and requested URI. These objects make it easier to work with the data without having to go through the trouble of parsing it in the air.

I tried to query the cities and countries associated with these IPv4 addresses using MaxMind’s library. If any type of query fails, the result is simply set to None.

After the crawler robot test, I’m ready to export. If the request appears to be sent from the crawler robot, it is highlighted on a red background.

if __name__ == '__main__':
    while True:
        try:
            line = sys.stdin.readline()
        except KeyboardInterrupt:
            break

        if not line:
            break

        req = line_parser(line)
        agent = parse(req['request_header_user_agent'])
        uri = urlparse(req['request_url'])

        try:
            response = reader.city(req['remote_host'])
            country, city = response.country.iso_code, response.city.name
        except:
            country, city = None, None

        is_bot = bot_test(req, agent)

        agent_str = ', '.join([item
                               for item in agent.browser[0:3] +
                                           agent.device[0:3] +
                                           agent.os[0:3]
                               ifitem is not None and type(item) is not tuple and len(item.strip()) and item ! ='Other'])

        ip_owner_str = ' '.join([network + ' IP'
                                  for network, cidr in CIDRS.iteritems()
                                  if in_block(req['remote_host'], cidr)])

        print Back.RED + 'b' if is_bot else 'h', \
              country, \
              city, \
              uri.path, \
              agent_str, \
              ip_owner_str, \
              Style.RESET_ALLCopy the code

Reptile robot testing actual combat

Here is an example where, when putting this into a monitor script, I connect the last hundred lines of the output Web traffic log in the following way.

$ ssh server \
    'tail -n100 -f access.log' \
    | python monitor.pyCopy the code

Requests that might come from a crawler robot will be highlighted with a red background and a “B” prefix. Traffic that does not have a crawler will be prefixed with an “H” for human. The following is sample output from the script, but without the ANSI background color.

. b US Indianapolis /robots.txt Python Requests2.2 Linux 3.2. 0
h DE Hamburg /tensorflow-vizdoom-bots.html Firefox 45.0 Windows 7
h DE Hamburg /theme/css/style.css Firefox 45.0 Windows 7
h DE Hamburg /theme/css/syntax.css Firefox 45.0 Windows 7
h DE Hamburg /theme/images/mark.jpg Firefox 45.0 Windows 7
b US Indianapolis /feeds/all.atom.xml rogerbot 1.0 Spider Spider Desktop
b US Mountain View /billion-nyc-taxi-kdb.html  Google IP
h CH Zurich /billion-nyc-taxi-rides-s3-vs-hdfs.html Chrome 56.02924. Windows 7
h IE Dublin /tensorflow-vizdoom-bots.html Chrome 56.02924. Mac OS X 10.12. 0
h IE Dublin /theme/css/style.css Chrome 56.02924. Mac OS X 10.12. 0
h IE Dublin /theme/css/syntax.css Chrome 56.02924. Mac OS X 10.12. 0
h IE Dublin /theme/images/mark.jpg Chrome 56.02924. Mac OS X 10.12. 0
b SG Singapore /./theme/images/mark.jpg Slack-ImgProxy Spider Spider Desktop Amazon IPCopy the code