Today, I was suddenly asked by my colleague, what is the difference between UFT8 and UTF8MB4 in MySql? At that time, I also had a question mark, so I specially went to learn about it.

Character set

Uft8 and UTF8MB4 refer to MySQL character set. What is a character set?

concept

Many people often confuse the concept of character, character set, and character encoding. Today we take a closer look.

What are characters?

Charcter is the general name of characters and symbols, including characters, graphic symbols, mathematical symbols and so on. Twenty-six English letters belong to characters, and each Chinese character belongs to a character.

So what is a character set?

A character set is a collection of abstract character combinations (charcters). For example, all Chinese characters count as a character set, and all English letters count as a character set. Notice that I’m saying they’re collections of characters, and there are double quotes. The reason is that a character set is not simply a collection of characters, but, to be precise, a set of rules for symbols and encoding. A character set needs to represent and store characters in a character encoding way. We know that inside a computer, all information is ultimately a binary value. Each binary bit (bit) has two states: 0 and 1. And if you use different combinations of zeros and ones to represent different characters that’s encoding.

So what is character encoding?

Characters are ultimately stored on disk in binary form, and that’s why character encoding is necessary, because computers are ultimately stored in binary form, so the encoding rule is what binary is used to represent that character. For example, in the familiar ASCII code table, the binary number 01000011 corresponds to the decimal number 67, which represents the English letter C. To be precise, character encoding is the representation of a character in a character set in binary form of one or more bytes. Each character set has its own unique encoding mode, so the same character may have different binary forms under different encoding modes of the character set.

Now that we know that UTF8 and UTF8MB4 are both character sets, what is the difference between them?

utf8

MySQL was created with utF8 as the character set. First of all, it can store most of the Chinese characters, which is certainly more than enough for our normal use.

It consists of three bytes, and the largest Unicode character that can be formed is 0xFFFF, the basic Multilingual plane (BMP) in Unicode.

That is, any Unicode character that is not in the basic multitext plane cannot be stored using MySQL’s UTF8 character set.

utf8mb4

MySQL added the UTF8MB4 encoding after 5.5.3, which stands for most Bytes 4, specifically for four-byte Unicode compatibility.

The addition of a byte allows it to support emojis (Emoji is a special Unicode encoding commonly found on ios and Android phones), many less commonly used Chinese characters, and any additional Unicode characters.

use

In my opinion, always use UTF8MB4 instead of UTF8 for better compatibility.

For CHAR data, utF8MB4 consumes more space than utF8MB4. According to MySQL official advice, we can use VARCHAR instead of CHAR.

collation

When creating a library, we often use the following statement:

CREATE DATABASE dbname DEFAULT CHARSET utf8 COLLATE utf8_general_ci;
Copy the code

Now that we know that CHARSET stands for character set, what does COLLATE stand for? It stands for sorting rules.

concept

MySQl collation (collation), generally refers to the character set between the string comparison, sorting rules, MySLQ collation features:

O Two different character sets cannot have the same collation rules;

O Each character set has a default collation rule;

O There is a collation rule naming convention: it starts with its associated character set name, includes a language name in the middle, and ends with _CI (case insensitive), _cs (case sensitive), or _bin (binary).

In fact, we pay less attention to the details of the collation and more attention to whether the collation is case-sensitive.

For example, the system uses the UTF8 character set. If the UTF8_bin sorting rule is used, SQL queries are case-sensitive. Use UTF8_general_CI to be case insensitive (the default utF8 character set for collation is UTF8_general_CI).

Difference between UTF8_unicoDE_CI and UTF8_general_CI

Currently, utF8_UNICode_CI collation rules only partially support Unicode collation rules algorithms. Some characters are still not supported. Also, combined tokens cannot be fully supported. This mainly affects some minority languages in Vietnam and Russia, such as Udmurt, Tatar, Bashkir and Mari.

The main feature of UTF8_unicoDE_CI is support for extension when one letter is treated as equal to other letter combinations. For example, ‘ß’ equals’ ss ‘in German and some other languages.

Utf8_general_ci is a legacy proofreading rule that does not support extension. It can only compare characters one by one. This means that utF8_general_CI collation rules make fast comparisons, but are less accurate than collation rules using UTF8_unicode_CI).

use

For a language, language-specific UTF8 character set collation rules are executed only if utF8_unicode_CI collation is not done well. For example, utF8_unicode_CI works so well for German and French that there is no need to create special UTF8 collation rules for these two languages.

Utf8_general_ci also works with German and French, except that ‘ß’ equals’ s’, not ‘ss’. If your application can accept this, you should use UTF8_general_CI because it is fast. Otherwise, use UTF8_unicoDE_CI because it is more accurate.

conclusion

Unexpected DB create statement CHARSET and COLLATE have so much knowledge, code farmers learning road is really a moment can not stop.

If you are interested, you can visit my blog or pay attention to my public number and headline number. Maybe there will be unexpected surprises. Death00.github. IO/public id: The Path of Health