Simple ip2Region implementation

In the application of mobile Internet, it is often necessary to do some statistical analysis of user side information according to the user’s location information. To get the user’s location information, there are generally two methods: GPS location information and user IP address. Since every phone doesn’t always have GPS on, and sometimes precise location (down to city level) isn’t necessary, it’s a good idea to start by analyzing a user’s location based on IP address. To do this, you need a mapping library between IP and location and rely on it to start an IP to location service. This paper starts with the requirements and analyzes the design of the mapping database and how IP can be quickly converted into geographical location in combination with the IP2Region with 8.4K stars on Github.

introduce

IP location services are very common, and many companies offer similar paid services, such as Ali, Autonavi, Baidu, etc. Of course, there are also open free services, such as GeoIP, Pure IP, etc. These services are either parsed through an HTML page or requested through an interface, but either way they can’t do without an HTTP request, not to mention the fact that most services restrict QPS. The following table enumerates some common ways to get an IP address.

Open API services	way	limit	The sample
Taobao IP address library	interface	QPS per user should be less than 1	`Curl - d "IP = 218.97.9.25 & the accessKey = alibaba - inc" http://ip.taobao.com/outGetIpInfo`
Scott map	interface	There are 100,000 access limits per user per day and 30 million access limits for enterprise developers	`The curl "https://restapi.amap.com/v3/ip?ip=218.97.9.25&key=f4cf14aca974dfbb0501c582ce3fce77"`
GeoIP	HTML parsing	`Curl -d "IP =218.97.9.25&submit= submit" https://www.geoip.com`
Pure IP	HTML parsing	`The curl http://www.cz88.net/ip/?ip=218.97.9.25`

In daily work, it is usually necessary to convert the IP in a large number of user request logs into user location information for subsequent analysis. The key to this is a large amount of data and fast processing. Obviously, we can’t meet our daily needs by requesting API public services every time.

Brute force generates IP database

For daily needs, a simple and crude approach is to obtain the location information corresponding to all public IP addresses through the API in advance. According to the TIPS below, we can estimate that it will take 10 years to traverse the 330 million domestic IP addresses by accessing the Taobao IP address database. For autonavi enterprise users, it takes about 11 days to traverse domestic IP addresses. That feels like an acceptable 11 days.

IP addresses are currently referred to as IPv4, which uses 32-bit (4-byte) addresses, so the address space is approximately 4.29 billion 232=42949672962^32=4294967296232= 4294967296296.

However, some addresses are reserved for special purposes, such as private networks (about 18 million) and multicast addresses (about 270 million), which reduce the number of addresses available on the Internet. According to wiki, there are 330 million IPv4 users in China, compared with 1.54 billion in the United States.

Here we agreed location information data format: national | | | | area provinces city ISP, if there is no corresponding interface returned in the field of information, the corresponding field in the filling 0. So we can obtain the following file data by sequential request API service (address in ascending order) :

0.0.0.0 | | | 0 0 0 | | network IP network IP 0.0.0.1 | | | 0 0 0 | | internal network IP network IP... 1.0.15.255 | | 0 | China telecom guangdong | guangzhou |... 255.255.255.255 | | | 0 0 0 | | network IP network IPCopy the code

Once you have this file, you can read it into memory, save it with a dictionary, the key is the IP address, and the value is the location information. A program can return location information in O(1) time, but the size of the program or file can be roughly calculated. Suppose we use utf-8 for storage, a record is shortest 0.0.0.0 | | | | 0 0 0 0 | 0, 17 bytes, the size of the IP library files for 17 * 4294967296 = 73014444032 B = 71303 MB = 71 gb. This size is unacceptable for any program.

Space optimization

IP library file optimization

From the above file data, it is found that a large number of adjacent IP addresses have the same location information (customers will try to connect together when applying for a segment of IP addresses), so we can combine such records into one record. File data in ascending order:

0.0.0.0 | 0.255.255.255 | | | 0 0 0 | | internal network IP network IP... 1.0.8.0 | 1.0.15.255 | | 0 | | | in guangzhou, guangdong province China telecom... "| 255.255.255.255 | | | 0 0 0 | | internal network IP network IPCopy the code

The latest ip.merge. TXT file in the ip2Region library contains 658207 records, and the file size is 39 MB.

IP Address optimization

From the above file data, it is found that a large number of IP addresses are stored as strings, whereas IPv4 uses 32-bit addresses. The shortest string 0.0.0.0 is 7 bytes, and the longest string 111.111.111.111 is 15 bytes. If converted to an integer, they both take up 4 bytes. 0.0.0.0 is int(0), and 111.111.111.111 is int(1869573999).

Location information optimization

From the above file data found the location of the same information is corresponding to different IP section (the customer may in different period of time to apply for IP section), so there are still a lot of location information in the IP library file, we can only keep a place in the memory information, and use a pointer or file offset + data length to obtain the corresponding location information.

The IP address library is optimized

Based on the above optimization, we can generate the final IP library: ip2region.db, which is only 8.1m.

Structure of IP library

The structure of IP library file ip2Region. db is divided into four parts: super block, header index area, data area, and index area. The details are shown in the figure below:

Super block

The first index pointer points to the beginning position of the index block, which is the first index block of the first partition, and the last index pointer points to the end position of the index block -12, which is the head address of the last index block of the last index partition. In this way, 8 bytes of the super block can be directly read during the query, and the address range of the index block can be quickly obtained.

The header index area

The header index is a secondary index to the index block for b+tree searches. The total length of the index partition divided by the length of the index partition 12 x (1024 x 8/12-1) is the actual number of header indexes. The area is 2048 x 8 bytes in size and consists of 2048 8 bytes header index blocks. Header The first four bytes of an index block store the starting IP value of the first index block in each index partition, and the last four bytes point to the address of that index block. The header index area is defined as close to 16K, because the entire header index area can be read through the disk four times, and then queried in the memory. The query result can determine that the IP address is in an index partition of the index area, and then read the 8K index partition into the memory twice according to the address, and then query in the memory. This reduces the number of disk reads.

Data area

Save the data, data format is as follows: China | | | | in shenzhen city, guangdong province in southern China, Dr Peng said countries and regions, provinces, cities, operators

The index area

The index area is composed of index blocks. Each index block occupies 12 bytes, including the start IP address, end IP address, and data information. The first three bytes of the data information store the data address, and the last byte stores the data length. Each index block corresponds to a record in ip.merge. TXT, which represents the index of an IP segment. When the specified IP is between the start IP address and the end IP address of an index block, the index is matched. By using the data address and data length in the index block, the corresponding location information data can be read from ip2region.db.

IP library generation

The Github repository of ip2Region provides the generation process of ip2Region. db, which is written in JAVA. The class diagram is as follows:

Through familiar with generating ip2Region. db source code, a brief description of its generation process is as follows:

Use RandomAccessFile to reserve an 8-bytes super block and a 2048*8 bytes header index in the file
Scan the ip.merge. TXT file and do the following for each record:

Based on the start IP, end IP, and data of each record, an index block is generated. The first four bytes store the start IP, the middle four bytes store the end IP, and the last four bytes store the address of the data that has been calculated (written via RandomAccessFile, here maintaining a dictionary of location information to the location of the file, Ensure that the same location information is written only once. And temporarily store the index block in the Linked list indexPool. This step determines all the location information of the data area. 3. Scan all records in ip.merge. TXT and write all index blocks in indexPool to the end of the data area. In this process, int(1024*8/12-1)= 681 index blocks are formed into an index partition, and the start IP and address information (header block) of the first index block in each index partition is recorded and stored in the Linked list of headerPool. The start and end positions of the index area are also recorded. 4. Adjust RandomAccessFile to point to the beginning of the file, and write the starting position of the index area to the first four bytes of the super block, and the end position to the last four bytes of the super block. 5. Continue to write the header block in headerPool to the header area. 6. Adjust RandomAccessFile to point to the end of the file and write the timestamp and copyright information.

Tip: The ip2Region repository also uses global_region.csv data. This file has 5 columns (line number,, region,, zip code), which correspond to the specific information of the region, and can be filled into the information of each location in the data area.

Quick search

Ip2region provides three query algorithms. The worst query time is ms level. Binary search, b+tree search, binary search. Time consuming increases in turn. The search structure diagram is as follows:

Binary search

The super block is used to get the start and end positions of the index area, and each index block is 12 bytes with increasing IP addresses, so binary search can be used to quickly get the location information. The steps are as follows:

Convert the IP value to an integer using the ip2long method
Read the super block to obtain the start and end positions of the index range. Subtract +1 from the two to obtain the total number of index blocks
Using dichotomy to solve directly, comparing the size of the start IP, the end IP and the current IP in the index block, we can find the corresponding index block of the IP, and get the data address and data length according to the next four bytes of the index block, so as to get the location information.

B + tree search

The b+tree search uses the header index area. The first step is to use binary search in the header index area. After locating an index partition, use binary search in the corresponding index partition. Compared to binary search, it is faster because the number of disk reads is much lower than binary search. The steps are as follows:

Convert an IP value to an integer using ip2Long
Use dichotomy to search in the header index area and compare the corresponding header index block and its corresponding index partition.
Read the corresponding index partition, and then through dichotomy to the corresponding index block, so as to obtain the location information.

Binary search based on memory

This method is similar to the binary search method, except that the former reads all ip2Region. db into memory, while the latter reads ip2region.db file continuously.

conclusion

The ip2Region library solves only a very common IP location problem, but makes the service small and fast (and of course offers a multi-language client), thus getting a star of 8.4K on Github.

Small memory footprint

The location information of adjacent IP addresses is the same. The IP address segment is used to ensure that the adjacent IP addresses correspond to the same location information to prevent the location information from being stored repeatedly
IP is converted to INT, just as the string 111.111.111.111 is converted to INT (1869573999), shrinking from 15 bytes to 4 bytes
Different IP segments also have the same location information. The pointer is used to point to the specific location information. Ensure that the location information is saved only once (the full scan is stored in the dictionary).

Fast search

IP order, using binary search to reduce the time complexity to O(logN)
The second level index header uses the index area to reduce the frequency of disk read and write. First determine the index partition, then determine the index location from the index partition, and then determine the location information data.

Multilingual client support

Support for Java, C#, PHP, C, Python, nodeJS, PHP extensions (php5 and php7), Golang, rust, Lua, Lua_c, nginx.

reference

Ip2region Database file structure and principle
Ip2region source
Wikipedia for ipv4
IPv4 address assignment list of countries
Amap API
Baidu Map API

If this article has helped you, or you are interested in technical articles, you can pay attention to the wechat public number: Technical tea Party, can receive the relevant technical articles in the first time, thank you!

This article was automatically published by ArtiPub, an article publishing platform