preface

MySQL is a structured database frequently used by business background systems. To master the knowledge of MySQL is a must for developers. At the same time, in the interview process, the knowledge of MySQL is often used as an interview question to evaluate the ability of candidates.

With the increase of service volume, the requirements for MySQL performance optimization are becoming higher and higher, and indexes are the key direction of performance optimization. Therefore, a deep understanding of MySQL indexes plays an important role in future optimization.

MySQL is a relational database, you must be familiar with the use of MySQL to create a database, you need to specify a coding method.

Most of the time, people will consider UTF-8 to be a generic encoding method based on their understanding of the encoding method, so most of the time, this encoding method will be chosen by default.

However, this often leaves a big hole in your database!!

MySQL support for Unicode

The Unicode character set, now an industry standard in computer science, organizes and encodes most of the world’s writing systems, making it easier for computers to present and process text.

Unicode Transformation Format (UTF) series encodings are proposed to accommodate different data storage and delivery requirements. These include UTF-8, UTF-16, and UTF-32.

Check the official MySQL documentation

(dev.mysql.com/doc/refman/…). MySQL supports the following character sets: UTF8, UCS2, UTF8MB3, UTF8MB4, UTF16, UTF16LE, and UTF32

Different character sets differ in the number of characters they contain and the amount of storage required.

In the MySQL documentation, there is a striking reminder of the encoding methods supported:

Utf8mb3 character set is deprecated, it will be removed in future MySQL releases, please use UTf8mb4 instead. In the current version 8.0, UTF8 refers to UTF8MB3, although it may be changed to UTF8MB4 in the future, but to avoid ambiguity, consider explicitly specifying UTF8MB4 instead of UTF8 for character set references.

That is, when we specify utF-8 encoding in MySQL 8.0, we actually use UTF8MB3 encoding.

So, let’s start with utF8MB3.

utf8mb3

The utF8MB3 character set is an early character set supported by MySQL. It has the following characteristics:

1. BMP characters only (supplementary characters are not supported)

2. Each multi-byte character requires a maximum of three bytes

Note that only BMP characters are supported, so what are BMP characters?

BMP is short for Basic Multilingual Plane, which is a character with code points between 0 and 65535 (or U+0000 and U+FFFF).

BMP does not contain supplementary characters, that is, characters with code points between U+10000 and U+10FFFF. What are the supplementary characters, such as some rare Chinese characters, or Emoji, etc. are supplementary characters.

That is, if you specify utF8MB3 (UTF-8) encoding when creating a table, you won’t be able to express rare characters or emojis.

utf8mb4

In the early days, Unicode was used only for numeric encodings in the range 0 to 0xFFFF, known as the BMP character set. As a result, MySQL was originally designed to only use UTFMB3 (UTF-8), which contains the BMP character set, but as the number of characters grew, it became clear that three bytes were not enough to represent all of them, and Unicode began to support more characters.

As a result, the earlier UTFMB3 was not sufficient in some scenarios, so MySQL added utF8MB4 encoding after 5.5.3.

The UTFMB4 character set has the following characteristics:

1. Support BMP and supplementary characters.

2. Each multi-byte character requires a maximum of 4 bytes.

Utf8mb4 differs from the UTF8MB3 character set, which only supports BMP characters and uses up to three bytes per character:

For BMP characters, UTF8MB4 and UTF8MB3 have the same storage characteristics, that is, the same encoding value, the same encoding, and the same length.

For supplementary characters, UTF8MB4 requires four bytes to store it, while UTF8MB3 cannot store the character at all. So we say utF8MB4 is a superset of UTF8MB3.

Therefore, most of the time, for compatibility reasons, it is recommended to create MySQL tables using UTF8MB4 instead of UTF8!

Utf8mb3 and UTF8MB4

The utF8MB3 and UTF8MB4 character sets are described as follows:

Utf8mb3 supports only Basic Multilingual Plane (BMP) characters. Utf8mb4 also supports supplementary characters in addition to BMP.

Utf8mb3 uses a maximum of 3 bytes per character. Utf8mb4 uses a maximum of four bytes per character.

Utf8mb4 represents more complementary characters than UTf8MB3, but may take up more space.

Convert from UTF8MB3 to UTF8MB4

In the first place, converting a character set from UTF8MB3 to UTF8MB4 is not a problem:

For BMP characters, UTF8MB4 and UTF8MB3 have the same storage characteristics: same encoding value, same encoding, same length.

For supplementary characters, UTF8MB4 requires four bytes to store it, while UTF8MB3 cannot store the character at all. When converting a UTF8MB3 column to UTF8MB4, you don’t have to worry about converting supplementary characters because there are no supplementary characters.

Suppose we have a known table that uses UTF8MB3:

CREATE TABLE t1 ( col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT , col2 CHAR(10) CHARACTER SET utf8 COLLATE utf8_bin NOT ) CHARACTER SET utf8;
Copy the code

The following statement converts t1 to UTf8mb4:

ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4, MODIFY col1 CHAR(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT , MODIFY col2 CHAR(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT ;
Copy the code

The author | talk at random programming

Source | talk at random programming (ID: mhcoding)