MySQL character set

The following issues are discussed in this article:

1. GBK and UTF8 occupy several bytes

2. How many bytes the ASCII code occupies in different character sets

MySQL UTF8MB4 occupies 4 bytes of storage space.

The character set problem is rather boring. In the MySQL DBA class, I often encountered some students who were very interested in the character set. I thought there should be many students who also had this problem. So we will sort out the internal proof method in the course, and let you feel: know the style of the practice class of MySQL DBA: it is dry.

Q1 GBK and UTF8 take up a few bytes

First, GBK, UTF8 occupies a few bytes:

As can be seen from the picture above, the same word “zhi” is used

GBK: “d6 AA “two bytes

UTF8: “e7 9f a5” three bytes

0a is the return character of “\n” echo.

From the information above non-ASCII characters occupy different bytes in different character sets.

a

Q2 Does the ASCII code occupy the same number of bytes in different character sets

Let’s see if ASCII is the same in different character sets

As you can see, ASCII codes in both GBK and UTF8 are: 61 takes up 1 byte.

From the above two examples, the tiger character set can be viewed with the help of hexadecimal.

Q3 MySQL UTF8MB4 occupies 4 bytes in storage space.

For simple analysis, create a table with only one field: tb_vARCHar

Character set-based environment:

Take a look at the table and count the bytes

Let’s use hexdump again:

The first line of content: 10 a is equivalent to 10 61, find 61 and 10 above, look ahead, you can find the logo of 10.

Then you can see that a 09 corresponds to nine bytes, corresponding to nine 61’s

Further down: 07 is seven bytes for seven 61s

Further down: 06 is six bytes for six 61s

Look further down: 08 is eight bytes for eight 61s

Look further down: 09 is nine bytes corresponding to three “e7, 9F, a5” characters

At UTF8MB4, the ASCII code occupies 3 bytes, not 4 bytes.

But for the sorting part the upper allocation should be allocated by 4 bytes, this code needs to be confirmed.

You may be wondering: what is the length of the variable-length paragraph after the actual content of the field? This is the content of InnoDB row structure, which will not be expanded here. If you are particularly interested in InnoDB, you can pay attention to the MySQL DBA optimization class of Zhidu Tang. Teacher Ye will explain the core structure of InnoDB personally.

conclusion

It can be seen from the above experiments:

1. GBK and UTF8 occupy several bytes

Answer: GBK occupies 2 bytes, UTF8 occupies 3 bytes; ASCII characters occupy the same size of a byte in different character sets.

2. How many bytes the ASCII code occupies in different character sets

A: ASCII takes up bytes consistently across character sets. Non-ascii characters occupy different bytes in different character sets.

MySQL UTF8MB4 occupies 4 bytes of storage space.

A: Not all characters in UTF8MB4 take up 4 bytes. For example, ASCII characters still take up 1 byte, normal Chinese characters still take up 3 bytes, and currently known emojis take up 4 bytes.