I recently ran into a bug where I tried to save a UTF-8 string in MariaDB encoded with “UTF8” via Rails and got a bizarre error:

__Mon Jun 25 2018 10:35:56 GMT+0800 (CST)____Mon Jun 25 2018 10:35:56 GMT+0800 (CST)__Incorrect string value: '\ xF0 \ x9F \ x98 \ x83 <... 'for column' summary 'at row 1__Mon Jun 25 2018 10:35:56 GMT+0800 (CST)____Mon Jun 25 2018 10:35:56 GMT+0800 (CST)__Copy the code

I’m using a UTF-8 client, the server is UTF-8, the database is UTF-8, and even the string “<… It’s also a legitimate UTF-8.

The problem is that MySQL’s “UTF8” is not really UTF-8.

“Utf8” only supports a maximum of three bytes per character, while true UTF-8 supports a maximum of four bytes per character.

MySQL never fixed the bug and released a character set called “UTf8MB4” in 2010 to circumvent the problem.

Of course, they didn’t advertise the new character set (probably because they were embarrassed by the bug), so developers are still being advised to use “UTF8” on the web, but that advice is wrong.

Briefly summarized as follows:

  • MySQL’s “UTF8MB4” is true “UTF-8”.
  • MySQL’s “UTF8” is a “proprietary encoding” that can encode only a few Unicode characters.

I want to be clear here: all MySQL and MariaDB users using “UTF8” should switch to “UTf8MB4” and never use “UTF8” again.

So what is coding? What is UTF-8?

As we all know, computers use zeros and ones to store text. For example, if the character “C” is saved as “01000011”, then the computer needs to go through two steps to display this character:

  1. The computer reads “01000011” and gets the number 67 because 67 is encoded as “01000011”.
  2. The computer looked for 67 in the Unicode character set and found “C”.

The same:

  1. My computer maps “C” to 67 in the Unicode character set.
  2. My computer encoded 67 as “01000011” and sent it to the Web server.

Almost all web applications use the Unicode character set because there is no reason to use any other character set.

The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits per character. This is the easiest way to do it, because computers have always treated 32 bits as numbers, and that’s what computers are good at. The problem is, it’s a waste of space.

Utf-8 saves space. In UTF-8, the character “C” requires only 8 bits, and some infrequently used characters, such as “”, require 32 bits. Other characters may use 16 or 24 bits. An article like this one, if encoded in UTF-8, would take up about a quarter of the space of UTF-32.

MySQL’s “UTF8” character set is incompatible with other programs.

MySQL, brief

Why did MySQL developers disable “UTF8”? The answer may be found in the commit log.

MySQL has supported UTF-8 since version 4.1, in 2003, and the UTF-8 standard used today (RFC 3629) came later.

Older versions of the UTF-8 standard (RFC 2279) supported up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview release.

In September of that year, they made a tweak to the MySQL source code: “UTF8 now only supports 3-byte sequences at most.”

Who submitted the code? Why would he do that? That question is unknown. After moving to Git (MySQL originally used BitKeeper), many of the submitter names in the MySQL code base were lost. There is no clue on the September 2003 mailing list to explain the change.

But I can try to guess.

In 2002, MySQL made the decision that if users could ensure that every row of a data table used the same number of bytes, MySQL would get a big performance boost. To do this, the user needs to define text columns as “CHAR”, each of which always has the same number of characters. If the number of characters inserted is less than the defined number, MySQL will fill in the following space. If the number of characters inserted is more than the defined number, the following portion will be truncated.

MySQL developers initially experimented with UTF-8 using 6 bytes per character, CHAR(1) using 6 bytes, CHAR(2) using 12 bytes, and so on.

It should be said that they did the right thing in the first place, but that version was never released. But that’s what it says in the documentation, and it’s widely circulated, and everyone who knows UTF-8 agrees with what it says in the documentation.

But obviously, MySQL developers or vendors are worried about users doing two things:

  1. Using CHAR to define columns (CHAR is ancient by now, but it was faster to use it in MySQL back then, though not since 2005).
  2. Set the CHAR column encoding to “UTF8”.

My guess is that the MySQL developers were trying to help users who wanted a win-win in space and speed, but they screwed up the “UTF8” coding.

So the result is no winner. Users who want to win both space and speed actually use more space and slower than expected when they use the CHAR column “UTF8”. Users who want correctness can’t save characters like “” when they use” UTF8 “encoding.

After the invalid character set was published, MySQL could not fix it because it would require all users to rebuild their databases. Finally, MySQL rereleased “UTF8MB4” in 2010 to support true UTF-8.

Why is this so maddening

I’ve been freaking out all week because of this problem. I was fooled by “UTF8” and it took me a long time to find this bug. But I can’t be the only one. Almost all the articles on the web refer to “UTF8” as the real UTF-8.

“Utf8” is only a proprietary character set, and it presents us with new problems that have never been solved.

conclusion

If you’re using MySQL or MariaDB, don’t use “UTf8” encoding, use “UTf8MB4” instead. Here (mathiasbynens. Be/notes/mysql…). A guide is provided for converting the character encoding of an existing database from “UTF8” to “UTF8MB4”.

英文原文 : medium.com/adamhooper…