This article will briefly describe the concept of character set, character encoding. And some common diagnostic techniques when encountering garbled code

Background: Character sets and encodings are undoubtedly a headache for IT novices and even for all kinds of gurus. When dealing with complex character sets, various Martian characters and garbled characters, problem location often becomes very difficult. This article will give a brief introduction to character sets and encodings in terms of principles. A word of caution before we introduce you: If you want a very precise interpretation of terms, you can refer to Wikipedia. This article is the introduction of the blogger through his own understanding and digestion and into easy to understand simple expressions.

What is a character set

Before we get to character sets, let’s look at why we have character sets. What we see on a computer screen is materialized text. What we actually store in the computer storage medium is a binary bit stream. There needs to be a unified standard for the conversion rules between the two. Otherwise, if we plug our USB flash drive into the boss’s computer, the document will be scrambled. Little friend QQ uploaded the file, in our local open and garbled. In order to realize the conversion standard, various character set standards have emerged. Simply speaking, a character set specifies the conversion relationship between the binary digits corresponding to a character (encoding) and the characters represented by a string of binary digits (decoding). So why are there so many character set standards? This question is actually quite easy to answer. Ask yourself why our plugs won’t work in the UK? Why does the monitor have DVI, VGA, HDMI and DP ports at the same time? Many specifications and standards were initially developed without being aware that they were going to be globally applicable, or were intended to be substantially different from existing standards in the interests of the organization. As a result, there are so many incompatible standards that have the same effect. Having said that, let’s take a look at a practical example. Here are the hexadecimal and binary encoding results of the word “cock” in various codes. Is there a feeling of cock?

Character set

Hexadecimal code

Corresponding binary data

UTF-8

0xE5B18C

1110 0101 1011 0001 1000 1100

UTF-16

0x5C4C

1011 1000 1001 1000

GBK

0x8CC5

1000 1100 1100 0101

What is character encoding

A character set is just the name of a set of rules. In real life, a character set is a name for a certain language. For example: English, Chinese, Japanese. There are three key elements required for correctly encoding and transcoding a character to a character set: character repertoire, coded character set, and character encoding form. The character table is a database equivalent to all readable or displayable characters. The character table determines the range of all characters that can be displayed in the whole character set. A coded character set that uses a coded value code point to indicate the position of a character in a font. Character encoding, the conversion relationship between the encoding character set and the actual stored numeric value. Generally, the code Point value is stored directly as the encoded value. For example, in ASCII, A is the 65th digit in the table, and the encoded value of A is 0100 0001, which is the binary conversion result of 65 in decimal. Many readers may have the same question I had when I read this: Font tables and coded character sets seem to be essential, and since every character in a font table has its own serial number, it is good to store the serial number as the content. Why bother with character encoding to convert serial numbers to another storage format? In fact, the reason is easy to understand: the purpose of the unified font table is to cover all the characters in the world, but the actual use of the characters will find that the proportion of the total font table is very low. For example, Programs in Chinese languages rarely need Japanese characters, while in some English-speaking countries even simple ASCII character library tables can meet basic requirements. If each character is stored with the ordinal number in the font table, each character would require 3 bytes (in the Case of the Unicode font), which would be an additional cost (three times the storage size) for english-speaking countries that used the one-character ASCII encoding. To put it more directly, the same hard drive can hold 1500 articles with ASCII, but only 500 with 3-byte Unicode serial number storage. Hence variable length codes such as UTF-8. ASCII characters, which used to require only one byte in UTF-8 encoding, still take only one byte. Complex characters like Chinese and Japanese require two to three bytes to store.

Utf-8 and Unicode relationships

After reading the above two concepts, it is easy to explain the relationship between UTF-8 and Unicode. Unicode is the encoding character set mentioned above, and UTF-8 is the character encoding, an implementation of the Unicode regular character library. With the development of the Internet, the requirement of the same character set is becoming more and more urgent, and the Unicode standard naturally emerged. It covers almost every possible symbol and word in every national language, and will number them. See: Unicode on Wikipedia. Unicode numbers are divided into 17 planes starting at 0000 up to 10FFFF, with 65536 characters in each Plane. Utf-8 implements only the first Plane, so although UTF-8 is one of the most widely accepted character set encodings today, it does not cover the entire Unicode character base, which makes it difficult to handle special characters in some scenarios (as discussed below).

Introduction to UTF-8 encoding

In order to better understand the practical application behind, we briefly introduce utF-8 encoding implementation methods. The conversion relationship between utF-8 physical storage and Unicode ordinals. Utf-8 encoding is variable-length encoding. The minimum code unit is one byte. The first 1-3 bits of a byte are the descriptive part, followed by the actual ordinal part.

  • If the first digit of a byte is 0, the current character is a single-byte character and occupies one byte of space. All parts after 0 (7 bits) represent ordinals in Unicode.
  • If a byte starts with 110, the current character is a double-byte character, occupying 2 bytes of space. All parts after 110 (5 bits) plus the parts of the next byte except 10 (6 bits) represent ordinals in Unicode. And the second byte starts with 10
  • If a byte starts with 1110, the current character is a three-byte character, occupying three bytes of space. All parts after 110 (5 bits) plus the remaining two bytes except 10 (12 bits) represent ordinals in Unicode. And the second and third bytes start with 10
  • If a byte begins with 10, it represents the second byte of the current byte multi-byte character. All parts after 10 (6 bits) join the preceding parts to form an ordinal in Unicode.

The specific characteristics of each byte can be seen in the following table, where X represents the ordinal part, and all the X parts of each byte are spliced together to form the ordinal in the Unicode font

Byte 1

Byte 2

Byte3

0xxx xxxx

110x xxxx

10xx xxxx

1110 xxxx

10xx xxxx

10xx xxxx

Let’s look at three examples of utF-8 encoding from one byte to three bytes:

The actual character

The hexadecimal sequence number in the Unicode word library

The binary ordinal in the Unicode word library

Utf-8 encoded binary

Utf-8 hexadecimal

$

0024

010, 0100,

0010, 0100,

24

¢

00A2

000 1010 0010

1100 0010 1010 0010

C2 A2

euro

20AC

0010 0000 1010 1100

1110 0010 1000 0010 1010 1100

E2 82 AC

Careful readers can easily draw the following rules from the above brief introduction:

  • The 3-byte UTF-8 hexadecimal code must beEAt the beginning of
  • The 2-byte UTF-8 hexadecimal encoding must beCorDAt the beginning of
  • 1 byte UTF-8 hexadecimal encoding must be in ratio8It starts with a small number

Why is it garbled

Gibberish is also known as mojibake (a literal translation of the Japanese). Simply put, garbled characters occur when different or incompatible character sets are used to encode and decode them. In real life, it would be like an Englishman writing bless on a piece of paper. A Frenchman was given the piece of paper and, since bless means “injured” in French, thought he meant “injured” (decoding process). This is a real life garble situation. As in computer science, a character encoded in UTF-8 is decoded in GBK. Because the tables of the two character sets are different, the position of the same Chinese character in the two tables is also different, and garbled characters will appear in the end. Let’s take a look at an example: Suppose we use UTF-8 encoding to store the word “cool”, we get the following conversion:

character

Utf-8 hexadecimal

very

E5BE88

dick

E5B18C

So we get E5BE88E5B18C. When displaying, we use GBK decoding to display. By looking up the table, we obtain the following information:

A two-byte hexadecimal value

GBK Decoded corresponding character

E5BE

atlas

88E5

crucible

B18C

After decoding, we get the wrong result: Atlas 睂, and the hyphen number is changed.

How to identify the original intended expression of garbled text

It is necessary to have a deep understanding of the coding rules of each character set to reverse solve the original correct characters from the garbled characters. But the principle is very simple, here with the most common UTF-8 is incorrectly displayed with GBK garbled code as an example, to illustrate the specific reverse solution and recognition process.

Step 1: Code

Suppose we see garbled characters like Atlas 睂 on the page and learn that our browser currently uses GBK encoding. So the first step is that we can encode the garble into binary expressions through GBK. Of course, lookup table encoding efficiency is very low, we can also use the following SQL statement directly through MySQL client to do encoding work:

Mysql [localhost] {msandbox} > select hex(convert(' hex 睂' using GBK)); + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | hex (convert (' atlas crucible 睂 'using GBK)) | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | E5BE88E5B18C | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.01 SEC)Copy the code

Step 2: Identify

Now we have the decoded binary string E5BE88E5B18C. And then we break it down by byte.

Byte 1

Byte 2

Byte 3

Byte 4

Byte 5

Byte 6

E5

BE

88

E5

B1

8C

Then, applying the rules outlined in the utF-8 coding introduction section, it is not hard to see that these six bytes of data comply with UTF-8 encoding rules. If the entire data stream conforms to this rule, we can boldly assume that the encoding character set before garbled is UTF-8

Step 3: Decode

And then we can take E5BE88E5B18C and decode it with UTF-8, and look at the text before the garble. We can get the result directly from SQL without checking the table:

mysql [localhost] {msandbox} ((none)) > select convert(0xE5BE88E5B18C using utf8); +------------------------------------+ | convert(0xE5BE88E5B18C using utf8) | +------------------------------------+ | Bad | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 1 row in the set (0.00 SEC)Copy the code

FAQ handling Emoji

Emoji is a Unicode Emoji\u1F601\u1F64FThe character of the section. This is clearly beyond the encoding range of the currently commonly used UTF-8 character set\u0000\uFFFF. Emoji have become more common with the popularity of IOS and the support of wechat. Here are a few common emojis: So what impact will Emoji characters have on our daily development operations? The most common problem is when it is stored in the MySQL database. Generally speaking, the default character set for MySQL databases is utF-8 (three bytes). Utf8mb4 was not supported until 5.5, and very few DBAs actively changed the system default character set to UTF8MB4. The problem is that we get an error when storing a character in the database that requires a 4-byte UTF-8 encoding:ERROR 1366: Incorrect string value: '\xF0\x9D\x8C\x86' for column. If you read the above explanation carefully, then this error is not difficult to understand. We are trying to insert a string of Bytes into a column whose first byte is\xF0Meaning this is a four-byte UTF-8 encoding. MySQL cannot store such characters when the MySQL table and column character set is set to UTF-8. So how do we solve this situation? There are two ways to do this: upgrade MySQL to 5.6 or later and switch the table character set to UTF8MB4. The second method is to filter the content before storing it in the database, replacing the Emoji character with a special text code and then storing it in the database. Later, this special text encoding is converted into Emoji display when it is obtained from the database or displayed in the front end. The second way we’re going to do it is we’re going to use-*-1F601-*-To replace the 4-byte Emoji, see python code for detailsAnswer on Stackoverflow