The problem background

The plethora of input methods and keyboards on mobile phones that support emoji-typing are increasingly interfering with the early designs.

  1. The popularity of mobile shopping, mobile transactions reached 93.6% in the “Double 11” in 2018
  2. According to wechat’s annual report, the “grin” emoji used by the post-80s generation was more than 30.3 billion in 2017.

Recently, our team encountered an online issue caused by users ordering notes using emoji.

By solving this problem, we can learn the following three aspects:

  1. Mysql Coding concepts
  2. Mysql > alter table Mysql > alter table Mysql > alter table Mysql
  3. Emoji use in Mysql

Problem analysis

Yesterday, I suddenly received the business feedback, monitoring alarm, several orders were stuck, but they were not pushed to the downstream in a certain system, and finally the delivery time was missed.

After querying the log center, ES searches for abnormal information about the Job pull data saving database of a system.

Expand detailed logs to find key logs:

ERROR 1366: Incorrect string value: ‘\xF0\x9F\x99\x8F… ‘for column ‘Remark’ at row 2.

Remark: An error occurs when you attempt to write a 4-byte character to a 3-byte column.

Then we looked at the user’s remarks Remark information, ** no one at home in the morning, please send in the afternoon, thank you 😊

Well, that’s a hard thank you, buddy. Emoji, no problem ordering. The DBA searched the history order database and it was supported. The database is recently transferred from SQL Server to mysql. An error occurred when pulling and printing the order information.

Question why

Related characters and coding knowledge

There are several encoding methods for the Unicode character set, such as utF-8, UTF-16, UTF-32, etc. Utf-8 is a variable-length encoding that uses 1-4 bytes for characters, UTF-16 uses a fixed 2 bytes, and UTF-32 uses 4 bytes for storage.

Unicode is implemented as 2-byte little-endian UTF-16, or utF-32 if 4-byte.

Unicode specifies emoji code points and meanings in the second panel (Plane 1, SMP). Each emoji uses 4 bytes. An Emoji is a character in the Unicode block, U1F601-\ u1F64F.

SQL Server database table, nvARCHAR type string default is variable length Unicode string.

MySQL version 5.5.3 UtF8 Character set UTF8 indicates a maximum of three bytes. Utf8mb4 is supported in versions 5.5.3 and later. The default character set of MySQL version 8.0 is UTF8MB4.

MYSQL stores UTF8 characters in big-endian mode without BOM by default.

Character set conversion process in MySQL

Images from the Internet

By understanding the conversion rules for character encodings, we can understand why garbled characters and character insertion failures occur.

The solution

1. Understand the current character encoding Settings

Let’s take a look at the results of the system configuration:

mysql> show variables like '%char%'; +--------------------------+----------------------------------------------+ | Variable_name | Value | +--------------------------+----------------------------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8mb4 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir |  D:\mysql\mysql-8.0.11-winx64\share\charsets\ |
+--------------------------+-------------------------------------   ---------+
8 rows in set
Copy the code

2. Specify the character set when designing the database

The library, table, and field levels should be specified, otherwise use the my.ini default configuration from lowest to highest. The specific creation statement is not detailed, please search by yourself.

3. Unified character set

The goal is to reduce unnecessary conversions unless specifically required. Special care should be taken to ensure that the conversion is not irreversible due to character set incompatibilities. For example, some Unicode characters are not available in Utf8, as are some GBK encodings. To be specific:

If the client, character-set-client, and table charset character sets are identical, garbled characters will not be generated.

Raw strings in SQL statements can be affected by the character_SET_connection character set or introducer Settings, so be careful if operations such as comparisons can produce completely different results! The solution is to execute the following sentence before sending the query: SET NAMES ‘***’

SET character_set_client = UTf8; SET character_set_results = utf8; SET character_set_connection = utf8;

4. Repair code damage data cannot be forcibly converted character set

ALTER TABLE… CHARSET= XXX or ALTER TABLE… CONVERT TO CHARACTER SET… It is possible to completely destroy the data, as can be done in Junyi Lu’s blog (see quotation 5).

Write in the back

As for emoji, with the support of mobile phone input and the love of young people, the following questions must be considered in order to avoid problems:

  1. Function design specifies whether the input box needs to support emoji
  2. Upstream and downstream link convention how to store and display, string interception
  3. Caution must be taken when operating and upgrading database character sets to prevent data loss
  4. If the character set UTF8MB4 is not supported by earlier versions, the synchronization will fail

source

  1. Understand the women behind the rapid growth of the global shopping carnival in 2018
  2. Ruan Yifeng introduction article: Emoji introduction
  3. An interesting article: Wave after wave of Emoji, how is your Emoji keyboard?
  4. Emoji: Building the Tower of Babel again
  5. 10 minutes: Learn how to understand and resolve MySQL garbled characters in 10 minutes
  6. MYSQL Help About Unicode:charset-unicode
  7. Common garbled code problem analysis and summary: garbled code analysis

MYSQL 解 决 MYSQL 解 决