In a distributed system, some scenarios need to use globally unique IDS, which can be related to business scenarios, such as generating globally unique payment serial numbers and generating globally unique business order numbers. For example, a globally unique ID is required after a database is divided into tables

Distributed ID generation features:

Before analysis, let's first clarify the generating characteristics of business ID. Based on this feature, we can have a deeper understanding of the following ways.Copy the code

1. Globally unique:

This is a basic requirement. No duplication.Copy the code

2. The number type is in an increasing trend

The latter ID must be larger than the former, which is considered by the MySQL storage engine to ensure data write performance.Copy the code

3. The length is short

To improve query efficiency, this is also from the MySQL database specification, especially when ID is the primary key.Copy the code

4. Information security

If ids are generated continuously, business information will be leaked, and they may even be guessed, so it needs to be irregular.Copy the code

5. High availability and low latency

ID generation is fast, can withstand high concurrency, latency is low enough not to become a service bottleneck.Copy the code

Use the database to increment the Id

Advantage:

1. Simple coding, depending on the database. 2. Numeric natural sort, for paging or need to sort the result is very helpful. 3. Have certain readability of business.Copy the code

Defect:

1. If you want to add an Id to a table, you cannot add an Id to a table. If you want to add an Id to a table, you cannot add an Id to a table. 2.DB data records can be inferred based on THE ID number, which is not suitable for some data-sensitive scenarios. 3. When the parent table is inserted, if ID is used as the association relation, the parent table Max (ID) needs to be obtained. In multi-threaded scenarios, the same Max (ID) may be obtained. (You are not advised to use the ID as the field associated with the parent and child tables. You are advised to use the field with business meaning.)Copy the code

Applicable scenarios:

Suitable for small applications, no table, low concurrency.Copy the code

Optimization scheme

1. For the Master library single point, if there are multiple Master libraries, the start number and step size of each Master library are different, which can be the number of masters. For example: Master1 generates 1,4,7,10; Master2 generates 2,5,8,11; Master3 generates 3,6,9,12. This effectively generates unique ids in the cluster and greatly reduces the load of ID generation to the database.Copy the code

Primary key auto-increment based on database multi-instance

Above we briefly explained the method of increasing the primary key of the database, and discussed the situation of single machine deployment. If we want to improve the efficiency of ID generation, we can horizontally expand the machine, and balance the pressure of single point database, how to achieve this scheme? That is the basis of the auto_increment, set the step length and step growth let DB before generated ID increasing trend and do not repeat.

advantages

1. Solve the single point problem of ID generation and balance the load.Copy the code

disadvantages

1. The step size must be determined, which will bring difficulties to the subsequent expansion, and the pressure of a single database itself is still too large to meet the high concurrency.Copy the code

Applicable scenario

1. The database does not need to be expanded due to a small amount of data.Copy the code

The Sequence features

Introduction to the

This feature is database-level and allows sequence numbers to be shared in the case of a single library with multiple tables. It can solve the situation where the table is in the same database, but if the table is in different databases, the sequence number will not be shared. Not all databases have sequences. For example, Oracle, DB2, and PostgreSQL have sequences, but MySQL, SQL Server, and Sybase do not.Copy the code

UUID

Introduction to the

Common mode,128 bits. It can be either database or procedural generation, generally globally unique.Copy the code

Advantage:

1. Simple coding. 2. Unique in the world, able to handle data migration, system data consolidation, database changes and other situations with ease.Copy the code

Defect:

1. Without sorting, we cannot guarantee increasing trend. 2.UUID is usually stored in strings, which lowers the query efficiency. 3. Storage space is large. If massive data is stored, storage efficiency must be considered. 4. Large amount of data is transmitted. 5. Unreadable.Copy the code

Applicable scenario

1. It can be used to generate scenarios such as token, which are unrecognizable enough, unordered and readable, and long enough. 2. Can be used for no pure number requirements, disorderly increase, no readability requirements of the scene.Copy the code

The solution

1. To solve the unreadable UUID problem, use the UUID to Int64 method.Copy the code

GUID

GUID: Is Microsoft’s implementation of the UUID standard. There are other implementations of UUID, not just GUids. The advantages and disadvantages are the same as UUID.

COMB

Introduction to the

COMB is a unique database design idea that can be understood as an improved GUID that combines the GUID and system time for better performance in indexing and retrieval. There is no COMB type in The database, which was devised by Jimmy Nilsson in his article "The Cost of GUIDs as Primary Keys." The basic design idea for COMB's data type is as follows: Since the UniqueIdentifier data has low index efficiency due to its irregularity, which affects the performance of the system, can we reserve the first 10 bytes of the UniqueIdentifier through combination and use the last 6 bytes to represent the DateTime of GUID generation? In this way, we combine the time information with the UniqueIdentifier, and increase the order while retaining the uniqueness of the UniqueIdentifier, so as to improve the indexing efficiency.Copy the code

advantages

1. To solve the problem of unordered UUID, Comb algorithm (Combined GUID /timestamp) is provided in its primary key generation method. Keep 10 bytes of the GUID and use another 6 bytes to represent the time (DateTime) at which the GUID was generated. 2. The performance is better than that of UUID.Copy the code

Twitter’s Snowflake algorithm

Introduction to the

Snowflake is Twitter’s open source distributed ID generation algorithm that results in a long ID. The idea is to use 41bits as the number of milliseconds, 10bits as the machine ID (5 bits for the data center, 5 bits for the machine ID), 12bits as the serial number within milliseconds (meaning each node can generate 4096 ids per millisecond), and finally a symbolic bit, always 0. The Snowflake algorithm can be modified according to the needs of its own project. Such as estimating the number of data centers in the future, the number of machines per data center, and the number of concurrent applications in a single millisecond to adjust the number of bits needed in the algorithm.

advantages

1. Independent of database, flexible and convenient, and better performance than database. 2. The timestamp value is at the high level, the machine code is fixed in the middle, the self-increasing sequence is in the position, and the whole ID is increasing trend. 3. Can generate millions of different ids per second, good performance. 4. Flexibly challenge bit division based on database node layout in service scenarios.Copy the code

disadvantages

1. It is incremental on a single machine, but due to the distributed environment, it is impossible for the clocks on each machine to be completely synchronized, and sometimes it is not global incremental. If the clock is rolled back, duplicate ids will be generated. Therefore, algorithms based on this method throw exception processing to prevent ID generation when clock is rolled back, which may cause service unavailability.Copy the code

Applicable scenario

1. The obvious disadvantage of snowflake algorithm is that it is clock dependent. It is feasible to generate distributed ids in this way if the machine does not have clock back, of course, small-scale systems can completely use it.Copy the code

Redis generated ID

Introduction to the

When using a database to generate ids is not performance enough, we can try using Redis to generate ids. This depends on Redis being single-threaded, so it is possible to generate globally unique ids. This can be done using Redis’s atomic operations INCR and INCRBY. Redis clustering can be used for higher throughput. Suppose there are five Redis in a cluster. You can initialize each Redis with a value of 1,2,3,4,5, and then a step size of 5.

This, arbitrary load to which machine is determined, difficult to modify in the future. However, 3-5 servers can basically satisfy the server, can obtain different ids. But the step size and initial value must be required beforehand. Single points of failure can also be solved using Redis clustering.

In addition, it is good to use Redis to generate daily serial numbers starting from 0. For example, order number = date + daily growth number. You can generate a Key every day in Redis and use INCR to accumulate.

advantages

1. Sequential increment, readable. 2. Meet certain performance requirements.Copy the code

disadvantages

1. Strong reliance on Redis may lead to single point problems. 2. Take up broadband, and need to consider the impact of network delay and other problems.Copy the code

Applicable scenario

1. The performance requirements are not too high, and the small business is lighter scenarios, and the operation of Redis has certain requirements, pay attention to network problems and single point of pressure problems, if it is distributed, it will consider more problems, so this way is less used in a group of cases.Copy the code

In fact, the reliability of Redis scheme needs to be studied. After all, it depends on the network. Delay failure or downtime may lead to service unavailability, and this risk has to be considered in the system design.

Leaf program based on Meituan

As can be seen from the above several distributed ID schemes, they can solve certain problems, but all have obvious defects. Therefore, Meituan made an optimization based on the database scheme and proposed a database scheme named Leaf-segment.

In the original scheme, we need to read the database every time we get ID, which is easy to cause pressure on the database in the case of high concurrency and large amount of data. Can we get a batch of ids at one time, so that we do not need to visit the database frequently?

Leaf implementation scheme

The solution of leaf-segment is to obtain an ID segment each time. After the segment is used up, the database will obtain a new number segment, which can greatly reduce the pressure of the database. How to do that? Very simple, we design a table as follows:Copy the code
Field Type Null Key Default Extra
biz_tag varchar(128) NO PRI
max_id bigint(20) NO 1
step int(11) NO NULL
desc varchar(256) YES NULL
update_time timestamp NO CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP
Biz_tag is used to distinguish services. Max_id indicates the maximum number segment allocated to biz_tag, step indicates the number segment length allocated to biz_tag, and desc and update_time indicate the service description and the time when the number segment was last updated. We used to have to access the database every time we get an ID. Now we just need to set the Step to a reasonable enough value like 1000. Now we can access the database after 1000 ids run out, which looks really cool. We can now design the entire process of obtaining distributed IDS as follows: 1. The user service needs a user ID when registering a user; The interface is requested to generate the ID service (which is a separate application). 2. The service that generates the ID will query the database and find the user_tag ID. Now max_id is 0, step=1000. 3. The service that generated the ID returns max_id and step to the user service and updates max_id to max_id = max_id + step (1000). Max_id =0, step=1000; 5. This user service can use the ID of the [max_id + 1, max_id+step] interval, that is, [1, 1000] 6. The user service saves this interval into the JVM. 7. When the user service needs ids, it obtains ids in sequence from the interval [1,1000] using getAndIncrement method in AtomicLong. [max_id + 1, max_id+step] = [1001,2000]; [max_id +step] = [1001,2000]; And can customize max_id starting point, can customize the step length, very flexible easy expansion, at the same time, this way is also very good solve the problem of the database of pressure, and ID them roughly is stored in the JVM, maximum performance, usability is also ok, real-time database is down, because the JVM cache them roughly, The system can hold up for a while.Copy the code

advantages

1. Flexible expansion and strong performance can support most service scenarios. 2. The ID increases in a trend, meeting the storage and query performance requirements of the database. 3. High availability. Even if the ID generating server is unavailable, services can be available in a short period of time, which buys time for troubleshooting. 4. You can customize the size of max_ID to facilitate service migration and machine horizontal expansion.Copy the code

disadvantages

1. The ID number is not random. Complete sequential increment may cause security problems. 2. A DB outage may cause the entire system to become unavailable. This risk still exists because the number segment can only last for a period of time. 3. In the distributed environment, nodes may scramble to allocate ID number segments at the same time, which may lead to concurrency problems and duplicate ID generation.Copy the code

The above disadvantages also need to be paid attention to, meituan technical team also came up with a clever solution — double Buffer. As mentioned above, it is possible for multiple nodes to request the ID range at the same time, so it is better to avoid this situation. The leaf-segment optimized the way to obtain one ID segment to obtain two ID segments. After one segment is used up, there is no need to update the ID segment immediately. In this way, the conflict problem can be effectively solved. In addition, the double buffer method is adopted. When the current number segment consumes 10%, the next number segment is checked to see if it is ready. At the same time, when the next number reaches 10%, the next number is tested to see if it is ready, and so on.

Optimized implementation scheme

1. The current ID is in buffer1. Each ID is in buffer1. 2. When the Id in buffer1 reaches 100, 10% of the range is reached. 3. If buffer2 reaches 10%, check whether buffer2 has obtained the ID. If buffer2 has not obtained the ID, the thread immediately initiates a request to obtain the ID. 4. If buffer1 runs out, it automatically switches to Buffer2. 5. When buffer2 reaches 10%, it will start the thread again and set it to buffer1. 6. Round trips.Copy the code

The dual buffer scheme is well considered, with a separate thread observing when the next buffer will be updated. Switching between the two buffers also solves the concurrency problem that may be caused by the temporary updating of the number segment in the database. This way increases the availability of the JVM business ID, and suggested that the length of the segment for the business peak 100 times of QPS (experience, specific can be set according to their own business), so even if the DB is down, the generation of business ID will also be able to maintain a long time, and can be compatible with occasional problems such as network jitter effectively.

Advantages:

1. Basic database problems have been solved and it works. 2. Based on THE JVM to store the number segment of double buffer, reduce database query, reduce network dependence, higher efficiency.Copy the code

Disadvantages:

1. The length of the segment is fixed. In case of heavy traffic, the number segment may be frequently updated because the allocated number segment may be used up in a short time. 2. If the number segment length is too long, if the number segment in the cache is not used up, the number segment obtained by other nodes may be larger than the previous one.Copy the code

In view of the above shortcomings, Meituan has put forward a new dynamic adjustment of the length of the segment.

Dynamically adjust Step

In general, if your business does not have obvious peaks and troughs, can adjust the Step need not too care about, because a steady business long-term running down are basically fixed between a Step, but if it is like Meituan has obvious activity, then the Step is to have enough flexibility to adapt to the business boom or slump in different time period.

Implementation scheme

Assuming that the service QPS is Q, the number segment length is L, and the number segment update period is T, then Q * T = L. The length of L is fixed at first, so as Q grows, T gets smaller and smaller. But the essential requirement of this scenario is that T is fixed. So if L can be positively correlated with Q, then T can approach a fixed value. Therefore, when the number segment is updated each time, the program will determine the next number segment length nextStep according to the period T of the last number segment update and the number segment length step. The following is a simple algorithm to illustrate the meaning of dynamic update: T < 15min, nextStep = step * 2 15min < T < 30min, nextStep = step T > 30min, nextStep = step / 2. Of course, in the face of instantaneous flow of dozens or hundreds of times of sudden increase, this scheme still can not meet the tolerance of database in a period of time is not available, the system can still run stably. Because in essence, although this solution does some fault tolerance in the DB layer, but the way the ID number segment is delivered, ultimately still need to rely on DB, and finally, still need to make sufficient efforts in the database high availability. 2.Copy the code