Let’s recreate the problem and create test_DB

create database test_db default charset utf8 default collate utf8_general_ci;
Copy the code

Create table

CREATE TABLE `article` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(120) NOT NULL DEFAULT ' ' COMMENT 'title',
  `abstract` varchar(600) NOT NULL DEFAULT ' ' COMMENT 'the',
  `created_at` int(11) NOT NULL DEFAULT '0',
  `updated_at` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB COMMENT='Article table'
Copy the code

The character set and collation rules of the database are UTF8 and UTF8_general_CI, respectively. If the character set and collation rules are not specified when creating the database, the database configuration is used.

Use PyMySQL to connect to the database and insert an emoji

# coding=utf-8

import pymysql

connection = pymysql.connect(host='localhost',
                             user='user',
                             password='passwd',
                             db='db',
                             charset='utf8',
                             cursorclass=pymysql.cursors.DictCursor)

try:
    with connection.cursor() as cursor:
        sql = "INSERT INTO article (title) VALUES ('😄')"
        cursor.execute(sql)

    connection.commit()
finally:
    connection.close()
Copy the code

The following error is reported after the command is executed

pymysql.err.InternalError: (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x84' for column 'title' at row 1")
Copy the code

This error occurs because MySQL’s UTF8 character set does not store emojis.

Isn’t UTF8 a Unicode encoding that should support most characters in the world? Utf8 is an alias of UTF8MB3, which stores only 3 bytes of characters, so it cannot store expressions.

If we want to support the storage of emoticons, we need the full UTF8 character set, which can store characters in up to four bytes, called UTF8MB4.

Therefore, to support the storage of emoticons, we can change the database character set to UTF8MB4. How do I change it?

First modify the column of type string in the table article

ALTER TABLE article MODIFY title varchar(120) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT ' ' COMMENT 'title';
ALTER TABLE article MODIFY abstract varchar(600) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT ' ' COMMENT 'the';
Copy the code

We can modify the table character set so that the string for the new column is also utF8MB4 character set

ALTER TABLE article DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Copy the code

If you want new tables to default to the same character set, you can change the database character set

ALTER DATABASE test_db DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Copy the code

Whether to modify columns, tables, or libraries is up to you.

Modify the character set in the script to connect to the database

charset='utf8'
Copy the code

Instead of

charset='utf8mb4'
Copy the code

The script is executed successfully.

At this point, we can say that the revision is complete, but I feel there are some details that must be mentioned. Otherwise, when something goes wrong, you still don’t know what’s going on.

When using the client to connect to the database, we can specify the default encoding

$ mysql -u root -p --default-character-set=utf8mb4
Copy the code

When configuring the database, we found that it is possible to configure a similar configuration in the client

[client]
default-character-set=utf8mb4
Copy the code

That is, when we connect, if no character set is specified, we take the value from the configuration file. Otherwise, take the specified character set.

The following three variables are affected when using different character sets to connect to a database

character_set_client
character_set_connection
character_set_results
Copy the code

For example, if no character set is specified when connecting, utF8MB4 is used. If the character set UTf8 is specified, the values of these three variables are utF8. These three variables can be changed dynamically

SET character_set_client = utf8mb4;
SET character_set_connection = utf8mb4;
SET character_set_results = utf8mb4;
Copy the code

You can also use the following command to replace it. The function is the same as the preceding three statements

SET NAMES utf8mb4;
Copy the code

When the values of these three variables do not match the database character set, errors or garbled characters may occur.

If you select test_DB and set it as follows

use test_db;
SET NAMES utf8;
Copy the code

perform

INSERT INTO article (title) VALUES ('😄');
Copy the code

An error

Incorrect string value: '\xF0\x9F\x98\x84' for column 'title' at row 1
Copy the code

If you select test_DB and set it as follows

use test_db;
SET character_set_client = ascii;
SET character_set_connection = utf8mb4;
SET character_set_results = ascii;
Copy the code

perform

INSERT INTO article (title) VALUES ('😄');
Copy the code

We will find that the stored data is garbled.

So, in the PyMySQL client, we need to specify the encoding as UTF8MB4 to make the script work.

How do these three variables work? Let me make a quick statement.

character_set_clientRefers to the character set of the content requested by the client,character_set_connectionRefers to the character set in which the server processes content,character_set_resultsRefers to the character set of the response returned by the server.

We all know that computers only recognize zeros and ones, so the database server receives a request or returns a response as a string of bytes. When the server receives a request, it decodes the request in the Character_set_client character set and then in the Character_set_CONNECTION character set, and throws the request to the server for processing. After the processing is complete, the result is decoded using the CHARACTER_set_CONNECTION character set, and then returned using the Character_set_results character set.

PS: Encoding can be simply understood as string to byte, decoding the reverse.

Therefore, if the character_set_connection character set is greater than the character_set_client and character_set_results character set, it may work properly. For example, character_set_connection is UTF8MB4, and character_set_client and Character_set_results are both UTF8

INSERT INTO article (title) VALUES ('hello');
Copy the code

However, this is only possible if the character range of the SQL requested cannot exceed that of character_set_client, such as our execution

INSERT INTO article (title) VALUES ('😄');
Copy the code

The data will be garbled.

Similarly, the character range of character_set_connection cannot exceed the character range of the database character set; otherwise, garbled characters may occur.

Usually, we will unify the values of these three variables and set them to the same character set as the database, so that the configuration is clearer. However, whether the client specifies the character set or the server configuration depends on the specific scenario. I think it’s a good idea to have a default character set of UTF8 on the server and then specify it when the client connects.