motivation

Why did you think of using UUID as the database primary key? Consider a scenario where there are multiple production environments running separately, sometimes deriving from one environment to another, and therefore primary keys are required to be non-conflicting. Self-incrementing integers cannot satisfy this requirement. How about a central transmitter service? Many Internet companies have done this. However, in the above scenario, each production environment relies on the same identifier service, which is difficult to cross data centers or even geographically. In fact, this requires a decentralized strategy of issuing numbers, where no matter which server can send numbers, and these numbers do not conflict with each other.

There’s a decentralized numbering implementation out there, and that’s UUID! UUID is called the Universal Unique Identifier, which is a 128bit integer, calculated by the algorithm. As long as any server in the distributed system creates the UUID value according to the standard, it is guaranteed that there will be no duplication at the global scope.

For example, UUID. RandomUUID () in Java is a completely random number implemented with SecureRandom. This is a random number that is secure in the sense of cryptography, and the more random bits, the safer it is (you can’t guess the rule, and it is not easy to repeat). Since 128 bits are all random bits, it is theoretically assumed that they will not be repeated on Earth.

inspection

It can be used as a primary key in a database. It can be used as a primary key in a database.

advantages

  • Decentralized generation -> has no single point of risk
  • Decentralized generation -> can be generated in parallel without design (auto-incremented primary keys are generated serially and are slow)
  • Stateless generation -> data records can have primary keys before they are stored in the database (auto-increment primary keys can only be obtained after the data records are stored in the database), making programming easier

disadvantages

  • Unordered -> cannot use primary key sort instead of time sort to avoid loading data records (some implementations can order, but Java built-in implementation is completely unordered)
  • Unordered -> reduces database write performance
  • Unordered -> reduces database read performance

The advantages go without saying, let’s discuss the disadvantages.

Start by figuring out how a relational database system works. Similar to a file system, a relational database system stores data in pages. A page (typically 4KB, 8KB, or 16KB) can hold a batch of data records. The whole page is also read or written when reading or writing. For tables with primary key columns, the default primary key index is provided. The index entries contain the primary key values and are sorted by the primary key. The primary key index of MySQL adopts cluster index. Indexes are integrated with data, and data records are stored in the order of primary keys. PostgreSQL’s primary key index uses a non-clustered index, and the index and data are stored in one copy (the index is equivalent to the “primary key value -> data location map table”). There are dedicated index pages and data pages.

(1) It is not possible to use primary key sort instead of time sort to avoid loading data records. When a query only needs to return the primary key but needs to be sorted by time, it can be sorted by the primary key if the primary key is chronologically ordered. The query can be completed by simply accessing the index, which contains the primary key value, rather than the data record.

(2) Reduce database write performance When adding data records with INSERT statements, update the primary key index. The randomness of the UUID primary key causes MySQL data pages and PostgreSQL index pages to be written at random, most likely only one line is added to a page, so many pages need to be read and written (read from disk to memory, modify and write back, cache hits do not read from disk but still write back). Ordered primary keys are more likely to add multiple lines to the same page, so fewer pages can be written. (Instead of brushing every line, database systems buffer it slightly, making it possible for a page to receive a few more lines.) In short, a random primary key cannot take advantage of the spatial locality of the data. PostgreSQL is only affected by index pages, which is better than MySQL data pages. But PostgreSQL Wal’s FULL_PAGE_WRITES feature also makes it vulnerable to writes.

(3) Reduce database read performance Data records added during adjacent times will be randomly distributed in many data pages of MySQL. New data records may be hot, but they are randomly placed on many data pages, and none of them is hot. And some range queries, such as by creation time queries, need to read many pages of data to get all the matching data records. Again, PostgreSQL only affects index pages, which is better than MySQL data pages.

Another problem is that the UUID (128bit) is larger than the BIGINT (64bit), and the CHAR (32) to hold the UUID would require 256bit, which is more expensive for storage and computation.

This article, written by Percona’s lead architect, discusses MySQL’s performance using the UUID primary key. https://www.percona.com/blog/… These two articles, written by a PostgreSQL expert, discuss the performance of PostgreSQL using the UUID primary key and the write amplification issues encountered by FULL_PAGE_WRITES. https://www.2ndquadrant.com/e… https://www.2ndquadrant.com/e…

URL encoding

Let’s think about URL encoding. The primary key needs to be used to identify the resource in a REST-style URL, such as /users/{id}. The id can be any primary key value. The primary key is then encoded as a string, and the string form of the UUID must be at least 32 bytes (36 bytes if the ‘-‘ delimiter is retained, such as Java UUID.toString()). This string form is actually the hexadecimal representation of the UUID.

If the UUID is Base64 encoded, it can be compressed to 22 bytes. Note that the URL SAFE Base64 encoding is used (Java provides this encoding scheme) because standard Base64 contains URL reserved characters that require the ID string to be escaped.

Some versions of UUIDs are ordered, and their string form satisfies the ASCII order, and Base64 encodes it so that it does not obey the ASCII order. Can use the Firebase – style Base64 encoding, see https://firebase.googleblog.c… .

Order ID

Can you get the benefits of both temporal ordering and safe randomness? You can.

A time-ordered decentralized unique ID implementation:

  1. Generates a completely random UUID with a length of 128bits
  2. Prefix the current UNIX time so that the total length is 192 bits
  3. Firebase-style Base64 encoding, so the total length is 32 bytes, which is equivalent to a normal UUID string

A smaller implementation is UNIX Time with a 64bit random value, which is 32 bytes like a UUID and 22 bytes encoded in Firebase-style Base64. However, it may have an effect on the collision rate, which needs theoretical proof. Another implementation is 48bit UNIX time with 80bit random number, before encoding CHAR (32), CHAR (22) after encoding, this kind of algorithm has Firebase in the production environment, recommended. Cassandra uses UUID V1 and relies on microsecond time and MAC address.

The encoding of the ordered ID

Firebase-style Base64 has a problem: the current ID always starts with a ‘-‘. The user experience is not good because it is easy for the user to ignore this character and think it is not part of an ID.

So I redesigned the code and gave the reader a copy of the Java implementation (48bit UNIX time with 80bit random number, ASCII order URL Safe Base64 encoding).

/** * (22 chars) 48bit milliseconds + 80bit random value (ASCII-ordered URL-safe Base64-encoded) */ public final class UniqueIdUtil { private static final SecureRandom secureRandom = new SecureRandom(); private static final byte[] remapper = new byte[128]; static { byte[] oldCodes = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_".getBytes(StandardCharsets.US_ASCII); byte[] newCodes = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~".getBytes(StandardCharsets.US_ASCII); for (int i = 0; i < oldCodes.length; i++) { remapper[oldCodes[i]] = newCodes[i]; } } private UniqueIdUtil() {} public static String newId() { ByteBuffer byteBuffer = ByteBuffer.wrap(new byte[18]); byteBuffer.putLong(System.currentTimeMillis()); byte[] randomBytes = new byte[10]; secureRandom.nextBytes(randomBytes); byteBuffer.put(randomBytes); return newId(byteBuffer); } static String newId(ByteBuffer byteBuffer) { byte[] original = Arrays.copyOfRange(byteBuffer.array(), 2, 18); byte[] encoded = Base64.getUrlEncoder().withoutPadding().encode(original); for (int i = 0; i < encoded.length; i++) { encoded[i] = remapper[encoded[i]]; } return new String(encoded, StandardCharsets.US_ASCII); }}

To compare two kinds of ID encoding: Firebase-style Base64 encoding: -MPZFW-83QDUZ_VQ6UAMDF homemade Base64 encoding: 0NQ_LNK8M ~ CV5UYUAOTZUG

Is it better?