Interviewer: Why is utF-8 not recommended for MySQL?

The text is as follows

Remember last year I was storing emoji into MySQL???????? Has been error, cannot import. I found a way to do this by changing utF8 to UTF8MB4.

A year later, I read an article about emoji characters that are four bytes long and usually have to be received using UTF-8. Other encodings can be wrong. It occurred to me that I was working with MySQL last year to change utF8 to UTF8MB4.

Huh? It itself is not utF8 code! So I had to change a hammer?

MySQL utF8 is not a true UTF-8 encoding. ! Oh, my God. MySQL has a bug! With questions to query a lot of relevant materials, only to find that this is actually a historical legacy of MySQL ~~ I laughed, DID not expect such a cow B MySQL will have this story of the past.

One, error review

Write emoji text directly into SQL, execute insert statement error;

INSERT INTO `csjdemo`.`student` (`ID`.`NAME`.`SEX`.`AGE`.`CLASS`.`GRADE`.`HOBBY`)
VALUES ('20'.'Old brother???? '.'male'.'20'.'181'.'grade 9'.'Watch the movie');
Copy the code

[Err] 1366 – Incorrect string value: ‘\xF0\x9F\x98\x93’ for column ‘NAME’ at row 1

Alter database encoding, system encoding and table field encoding format → UTF8MB4, then you can:

INSERT INTO `student` (`ID`.`NAME`.`SEX`.`AGE`.`CLASS`.`GRADE`.`HOBBY`)
VALUES (null, 'Old brother???????? '.'male'.'20'.'181'.'grade 9'.'Watch the movie');
Copy the code

MySQL utf8

MySQL’s “UTF8” is not really UTF-8.

In MySQL, the “UTF8” encoding only supports up to three bytes per character, while true UTF-8 supports up to four bytes per character.

In UTF8 encoding, Chinese is 3 bytes, other numbers, English, symbols are 1 byte.

However, emoji symbols account for 4 bytes, as do some more complex characters and traditional characters. Write failure, should be changed to UTf8MB4.

As shown in the figure above, this is the data stored after the encoding is changed to UTF8MB4, you can clearly compare the number of characters and bytes. Because of this, 4 bytes of content into the UTF8 encoding, certainly not, insert ah, is it (Pan hand). \

MySQL has never fixed this bug, and in 2010 they released a character set called “UTf8MB4” that cleverly circumvents the problem.

1. Utf8mb4 is the real UTF-8

Yes, MySQL’s “UTF8MB4” is the real “UTF-8”.

MySQL’s “UTF8” is a “proprietary encoding” that can encode only a few Unicode characters.

All MySQL and MariaDB users using “UTF8” should switch to “UTf8MB4” and never use “UTF8” again.

So what is coding? What is UTF-8?

As we all know, computers use zeros and ones to store text. For example, if the character “C” is saved as “01000011”, then the computer needs to go through two steps to display this character:

The computer reads “01000011” and gets the number 67 because 67 is encoded as “01000011”.
The computer looked for 67 in the Unicode character set and found “C”.

The same:

My computer maps “C” to 67 in the Unicode character set.
My computer encoded 67 as “01000011” and sent it to the Web server.

Almost all web applications use the Unicode character set because there is no reason to use any other character set.

The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits per character. This is the easiest way to do it, because computers have always treated 32 bits as numbers, and that’s what computers are good at. The problem is, it’s a waste of space.

Utf-8 saves space. In UTF-8, the character “C” only needs 8 bits, and some infrequently used characters, such as “????” You need 32 bits. Other characters may use 16 or 24 bits. An article like this one, if encoded in UTF-8, would take up about a quarter of the space of UTF-32.

A brief history of UTF8

Why did MySQL developers disable “UTF8”? The answer may be found in the MySQL version commit log.

MySQL has supported UTF-8 since version 4.1, in 2003, and the UTF-8 standard used today (RFC 3629) came later.

Older versions of the UTF-8 standard (RFC 2279) supported up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview release.

In September of that year, they made a tweak to the MySQL source code: “UTF8 now only supports 3-byte sequences at most.”

Who submitted the code? Why would he do that? That question is unknown. After moving to Git (MySQL originally used BitKeeper), many of the submitter names in the MySQL code base were lost. There is no clue on the September 2003 mailing list to explain the change.

But we can try to guess:

In 2002, MySQL made the decision that if users could ensure that every row of a data table used the same number of bytes, MySQL would get a big performance boost. To do this, the user needs to define text columns as “CHAR”, each of which always has the same number of characters. If the number of characters inserted is less than the defined number, MySQL will fill in the following space. If the number of characters inserted is more than the defined number, the following portion will be truncated.

MySQL developers initially experimented with UTF-8 using 6 bytes per character, CHAR(1) using 6 bytes, CHAR(2) using 12 bytes, and so on.

It should be said that they did the right thing in the first place, but that version was never released. But that’s what it says in the documentation, and it’s widely circulated, and everyone who knows UTF-8 agrees with what it says in the documentation.

But obviously, MySQL developers or vendors are worried about users doing two things:

Using CHAR to define columns (CHAR is ancient by now, but it was faster to use it in MySQL back then, though not since 2005).
Set the CHAR column encoding to “UTF8”.

My guess is that the MySQL developers were trying to help users who wanted a win-win in space and speed, but they screwed up the “UTF8” coding.

So the result is no winner. Users who want to win both space and speed actually use more space and slower than expected when they use the CHAR column “UTF8”. Users who want to be correct, when they use “utf8” encoding, can’t save things like “????” Such a character because “????” It’s four bytes.

After the invalid character set was published, MySQL could not fix it because it would require all users to rebuild their databases. Finally, MySQL rereleased “UTF8MB4” in 2010 to support true UTF-8.

Third, summary

The main reason is that almost all articles on the web now refer to “UTF8” as the real UTF-8, including my previous articles and projects (face covering); So I hope more friends can read this article.

I believe that there are many people in the same boat as me, this is inevitable.

So, when you set up MySQL and MariaDB databases in the future, remember to change the database encoding to UTF8MB4. One of these days, a programmer in your place or your boss will find out that you’re a technical genius.

From: blog.csdn.net/qq_39390545/article/details/106946166

Interviewer: Why is utF-8 not recommended for MySQL?

The text is as follows

One, error review

MySQL utf8

Third, summary

Related Posts

Elasticsearch technical analysis: Analyze the indexing principle of Elasticsearch

Custom annotations based on Spring AOP

6 ways to quickly count your code execution time