Interviewer: How to deal with the id primary key after dividing the database into different tables?

This post was first posted by Yanglbme on GitHub’s Doocs (stars over 30K). Project address: github.com/doocs/advan…

The interview questions

How to deal with the id primary key after the database is divided into tables?

Interviewer psychoanalysis

In fact, this is one of the questions that you’re going to have to deal with after you split the tables, is how do I get my ID? If you split multiple tables and each table starts accumulating at 1, you need a globally unique ID to support this. So these are all things you have to think about in your actual production environment.

Analysis of interview questions

Implementation scheme based on database

The id of the database is automatically added

This means that every time you get an ID in your system, you insert a non-business entry into a table in a library, and then get an id from the database. Get this ID and then write to the corresponding sub-database sub-table.

The advantage of this scheme is that it is convenient and simple, and anyone can use it; The disadvantage is that the single library generates the increment ID, if the high concurrency, there will be a bottleneck; If you really want to improve, then open a special service out, this service each time get the current MAXIMUM ID, and then increment several ids, return a batch of ids, and then change the current maximum ID value to a value after increasing several ids; But in any case it’s based on a single database.

Suitable scenario: you are divided into two reasons, or the single library concurrency is too high, or the single library data volume is too large; Unless your concurrency is not high, but the amount of data is too large to cause the expansion of the sub-database sub-table, you can use this scheme, because the maximum concurrency per second may be up to a few hundred, then go to a separate library and table to generate autoincrement primary key.

Set the step of a database sequence or an auto-increment field

You can scale horizontally by setting the database sequence or the increment field step of the table.

For example, there are eight service nodes. Each service node uses a sequence function to generate ids. Each sequence starts with a different ID and increases in step by 8.

Suitable scenario: This scenario is relatively simple to implement and achieves performance goals when the user prevents duplicate ids from being generated. However, the service nodes are fixed and the step size is fixed, so it will be difficult to add service nodes in the future.

UUID

The advantage is that it’s locally generated, not database-based; On the downside, the UUID is too long, takes up too much space, and performs poorly as a primary key. More importantly, UUID is not ordered, which can result in too many random writes to B+ tree indexes (sequential ids can produce partial sequential writes), and since sequential append operations cannot be generated on writes, inserts are required. The entire B+ tree node will be read into memory, and the entire node will be written back to disk after the record is inserted. This operation will significantly degrade performance when the record occupies a large space.

Suitable for: if you want to generate random file names, numbers, etc., you can use UUID, but not for primary key.

UUID. RandomUUID (). The toString (). The replace (" - ", "") - > sfsdf23423rr234sfdafCopy the code

Obtain the current system time

This is to get the current time, but the problem is that when the concurrency is very high, such as thousands of concurrent a second, there will be repeated situations, this is definitely not appropriate. It’s almost out of the question.

Suitable scenario: In general, if you use this scheme, the current time is joined together with many other business fields, as an ID, if the business is acceptable to you, then it is also ok. You can concatenate other business field values with the current time to form a globally unique number.

Snowflake algorithm

Snowflake is an open source distributed ID generation algorithm for Twitter. Implemented in Scala, snowflake takes a 64-bit LONG ID and uses 41 bits of it as the number of milliseconds. Use 10 bits as the work machine ID and 12 bits as the serial number.

1 Bit: No, why not? Since the first bit in binary is a 1, it is a negative number, but the generated ID is a positive number, so the first bit is always 0.
41 bit: Indicates the time stamp, in milliseconds. 41 bits can represent as many numbers as possible2 ^ 1, that is, it can be identified2 ^ 1Milliseconds, which is 69 years in adult years.
10 bit: Records the working machine ID, which indicates that the service can be deployed on up to 2^10 machines, or 1024 machines. However, five bits represent the machine room ID and five bits represent the machine ID. That means at most2 ^ 5One machine room (32 machine rooms), each machine room can be represented2 ^ 5One machine (32 machines).
12 bit: This is used to record different ids generated in the same millisecond. The largest positive integer represented by 12 bit is2 to the 12 minus 1 is 4096That is to say, we can use this 12 bit number to distinguishIn the same millisecond4096 different ids.

0 | 0001100 10100010 10111110 10001001 01011100 00 | 10001 | 1 1001 | 0000 00000000
Copy the code

public class IdWorker {

    private long workerId;
    private long datacenterId;
    private long sequence;

    public IdWorker(long workerId, long datacenterId, long sequence) {
        // sanity check for workerId
        // The machine id cannot exceed 32 and cannot be less than 0
        if (workerId > maxWorkerId || workerId < 0) {
            throw new IllegalArgumentException(
                    String.format("worker Id can't be greater than %d or less than 0", maxWorkerId));
        }
        if (datacenterId > maxDatacenterId || datacenterId < 0) {
            throw new IllegalArgumentException(
                    String.format("datacenter Id can't be greater than %d or less than 0", maxDatacenterId));
        }
        System.out.printf(
                "worker starting. timestamp left shift %d, datacenter id bits %d, worker id bits %d, sequence bits %d, workerid %d",
                timestampLeftShift, datacenterIdBits, workerIdBits, sequenceBits, workerId);

        this.workerId = workerId;
        this.datacenterId = datacenterId;
        this.sequence = sequence;
    }

    private long twepoch = 1288834974657L;

    private long workerIdBits = 5L;
    private long datacenterIdBits = 5L;

    // This is a binary operation, i.e., a maximum of 31 digits in 5 bits, i.e., a maximum of 32 machine ids
    private long maxWorkerId = -1L ^ (-1L << workerIdBits);

    A maximum of 31 digits can be contained in 5 bits. A maximum of 32 digits can be contained in the machine room ID
    private long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);
    private long sequenceBits = 12L;

    private long workerIdShift = sequenceBits;
    private long datacenterIdShift = sequenceBits + workerIdBits;
    private long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;
    private long sequenceMask = -1L ^ (-1L << sequenceBits);

    private long lastTimestamp = -1L;

    public long getWorkerId(a) {
        return workerId;
    }

    public long getDatacenterId(a) {
        return datacenterId;
    }

    public long getTimestamp(a) {
        return System.currentTimeMillis();
    }

    public synchronized long nextId(a) {
        // Get the current timestamp in milliseconds
        long timestamp = timeGen();

        if (timestamp < lastTimestamp) {
            System.err.printf("clock is moving backwards. Rejecting requests until %d.", lastTimestamp);
            throw new RuntimeException(String.format(
                    "Clock moved backwards. Refusing to generate id for %d milliseconds", lastTimestamp - timestamp));
        }

        if (lastTimestamp == timestamp) {
            // A maximum of 4096 digits in a millisecond
            // No matter how many bits you pass in, the bit operation should always be in the range of 4096. Avoid passing a sequence beyond the range of 4096 yourself
            sequence = (sequence + 1) & sequenceMask;
            if (sequence == 0) { timestamp = tilNextMillis(lastTimestamp); }}else {
            sequence = 0;
        }

        // The last time the id was generated is in milliseconds
        lastTimestamp = timestamp;

        // Move the timestamp left to 41 bit;
        // Move the machine room ID left to 5 bit;
        // Move the machine ID left to 5 bit; Put the serial number in the last 12 bits;
        // Finally concatenated into a 64 bit binary number, converted to base 10 is a long
        return ((timestamp - twepoch) << timestampLeftShift) | (datacenterId << datacenterIdShift)
                | (workerId << workerIdShift) | sequence;
    }

    private long tilNextMillis(long lastTimestamp) {
        long timestamp = timeGen();
        while (timestamp <= lastTimestamp) {
            timestamp = timeGen();
        }
        return timestamp;
    }

    private long timeGen(a) {
        return System.currentTimeMillis();
    }

    // --------------- test ---------------
    public static void main(String[] args) {
        IdWorker worker = new IdWorker(1.1.1);
        for (int i = 0; i < 30; i++) { System.out.println(worker.nextId()); }}}Copy the code

41 bit is a timestamp of the current millisecond, that’s all; Then the 5 bits are the machine ID you passed in (up to 32), the 5 bits are the machine ID you passed in (up to 32), and the remaining 12 bits are the serial number. If the last time you generated the ID is less than a millisecond, The order will be accumulated for you, up to 4096 numbers.

So you take this tool class, you create your own service, and you initialize something for every machine in every room, and the machine in that room starts with the number zero. Then each time you receive a request that says this machine in this machine room needs to generate an ID, you find the corresponding Worker generation.

With The Snowflake algorithm, you can develop your own company’s services, even for machine ids and machine ids, which are reserved for you with 5 bits plus 5 bits, and you can switch to something else with business meaning.

The Snowflake algorithm is relatively reliable, so if you really want to do distributed ID generation, if it is high concurrency, then it should perform better, usually tens of thousands of concurrent scenarios, you can use it.

Welcome to follow my wechat public account “Doocs Open Source Community” and push original technical articles as soon as possible.