Technical column: github.com/yongxinz/te…

Meanwhile, welcome to follow my wechat official account AlwaysBeta, more exciting content waiting for you.

In my work, I often encounter a kind of requirement, which is to find the information of the corresponding IP address based on the IP address segment. If you put the query process into a relational database, it will bring a lot of IO consumption, speed can not meet, obviously is not appropriate.

So what are the better ways? Some attempts have been made to do this, as detailed below.

Build index file

I saw an IP2Region project on GitHub. The author generated a file containing secondary indexes to achieve fast query, fast enough, millisecond level. But if you want to update address segments or location information, you have to regenerate the file every time, which is not very convenient.

However, I recommend you to take a look at this project, which is worth learning the idea of indexing. In the author’s open source project, there is only relevant code for query, but no code for generating index files. I wrote a code for generating index files according to the schematic diagram, as follows:

# -*- coding:utf-8 -*-


import time
import socket
import struct

IP_REGION_FILE = './data/ip_to_region.db'

SUPER_BLOCK_LENGTH = 8
INDEX_BLOCK_LENGTH = 12
HEADER_INDEX_LENGTH = 8192


def generate_db_file(a):
    pointer = SUPER_BLOCK_LENGTH + HEADER_INDEX_LENGTH

    region, index = ' '.' '

    # File format
    # 1.0.0.0 | 1.0.0.255 Australia | | | | | 0 0 0 0
    # 1.0.1.0 | 1.0.3.255 | | | 0 | | fujian province fuzhou city China telecom
    with open('./ip.merge.txt'.'r') as f:
        for line in f.readlines():
            item = line.strip().split('|')
            print item[0], item[1], item[2], item[3], item[4], item[5], item[6]
            start_ip = struct.pack('I', struct.unpack('! L', socket.inet_aton(item[0[]))0])
            end_ip = struct.pack('I', struct.unpack('! L', socket.inet_aton(item[1[]))0])
            region_item = '|'.join([item[2], item[3], item[4], item[5], item[6]])
            region += region_item

            ptr = struct.pack('I', int(bin(len(region_item))[2:].zfill(8) + bin(pointer)[2:].zfill(24), 2))
            index += start_ip + end_ip + ptr
            pointer += len(region_item)

    index_start_ptr = pointer
    index_end_ptr = pointer + len(index) - 12
    super_block = struct.pack('I', index_start_ptr) + struct.pack('I', index_end_ptr)

    n = 0
    header_index = ' '
    for index_block in range(pointer, index_end_ptr, 8184):
        header_index_block_ip = index[n * 8184:n * 8184 + 4]
        header_index_block_ptr = index_block
        header_index += header_index_block_ip + struct.pack('I', header_index_block_ptr)

        n += 1

    header_index += index[len(index) - 12: len(index) - 8] + struct.pack('I', index_end_ptr)

    with open(IP_REGION_FILE, 'wb') as f:
        f.write(super_block)
        f.write(header_index)
        f.seek(SUPER_BLOCK_LENGTH + HEADER_INDEX_LENGTH, 0)
        f.write(region)
        f.write(index)


if __name__ == '__main__':
    start_time = time.time()
    generate_db_file()

    print 'cost time: ', time.time() - start_time
Copy the code

Use Redis cache

Currently, there are two ways to cache IP and location information:

The first is to convert the start IP address, end IP address and all intermediate IP addresses into integers, and then store the converted IP address as the key and the home address information as the value in Redis in the form of a string.

First, add the start AND end IP addresses to the ordered set IP2cityID, with the city ID as the member and the converted IP as the score, and then add the city ID and location information to the hash cityid2City. The city ID is the key, and the location information is the value.

The first way is not to do more introduction, simple and rough, very not recommended. Of course, the query speed is very fast, milliseconds level, but the disadvantages are also very obvious, I used 1000 data to do the test, the cache time is long, about 20 minutes, occupy a large space, nearly 1G.

Here’s the second way, looking directly at the code:

# generate_to_redis.py
# -*- coding:utf-8 -*-

import time
import json
from redis import Redis


def ip_to_num(x):
    return sum([256 ** j * int(i) for j, i in enumerate(x.split('. ') : :- 1]])# connect Redis
conn = Redis(host='127.0.0.1', port=6379, db=10)

start_time = time.time()

# File format
# 1.0.0.0 | 1.0.0.255 Australia | | | | | 0 0 0 0
# 1.0.1.0 | 1.0.3.255 | | | 0 | | fujian province fuzhou city China telecom
with open('./ip.merge.txt'.'r') as f:
    i = 1
    for line in f.readlines():
        item = line.strip().split('|')
        Add the start IP and end IP to the ordered set ip2cityID
        The # members are the city ID and ID + # respectively, and the score is an integer value calculated according to the IP
        conn.zadd('ip2cityid', str(i), ip_to_num(item[0]), str(i) + The '#', ip_to_num(item[1]) + 1)
        Add the city information to the hash cityid2City, where key is the city ID and value is the JSON sequence of the city information
        conn.hset('cityid2city', str(i), json.dumps([item[2], item[3], item[4], item[5]]))

        i += 1

end_time = time.time()

print 'start_time: ' + str(start_time) + ', end_time: ' + str(end_time) + ', cost time: ' + str(end_time - start_time)
Copy the code
# test.py
# -*- coding:utf-8 -*-

import sys
import time
import json
import socket
import struct
from redis import Redis

# connect Redis
conn = Redis(host='127.0.0.1', port=6379, db=10)

Convert IP to an integer
ip = struct.unpack(! "" L", socket.inet_aton(sys.argv[1[]))0]

start_time = time.time()
# Sort the ordered set from largest to smallest, taking the first data less than the input IP value
cityid = conn.zrevrangebyscore('ip2cityid', ip, 0, start=0, num=1)
If cityID is null, or if # is matched, the address segment is not found
if not cityid or cityid[0].endswith(The '#') :print 'no city info... '
else:
    Select * from hash table where city ID = 0
    ret = json.loads(conn.hget('cityid2city', cityid[0]))
    print ret[0], ret[1], ret[2]

end_time = time.time()
print 'start_time: ' + str(start_time) + ', end_time: ' + str(end_time) + ', cost time: ' + str(end_time - start_time)
Copy the code
# python generate_to_redis.py 
start_time: 1554300310.31, end_time: 1554300425.65, cost time: 115.333260059
Copy the code
# python test_2. Py 1.0.16.0Japan0 0
start_time: 1555081532.44, end_time: 1555081532.45, cost time: 0.000912189483643
Copy the code

The test data is about 500,000 pieces, the cache time is less than 2 minutes, the memory is 182M, and the query speed is milliseconds. Obviously, this approach is worth trying.

The time complexity of zRevRangebyScore method is O(log(N)+M), where N is the cardinality of the ordered set and M is the cardinality of the result set. It can be seen that the larger the value of N is, the slower the query efficiency will be. The specific amount of data that can be efficiently queried needs to be verified. But I don’t think this is a problem to worry about.

The above.

GitHub asks Star: My tech blog