Talk about character encoding

Have you ever copied a document that looked fine on someone else’s computer and found it was garbled on your own? Or you, as a programmer, can open someone else’s code and find that the Chinese comments are garbled and the code is perfectly fine. I have been plagued by these problems for a long time, although many times the problem was solved by switching character encodings, but the reason behind it was never right. Therefore, I want to use this article to make this matter as clear as possible.

Origin of character encoding -ASCII code

Computers store data in binary form, which humans can’t read directly. Therefore, for the text data stored on the computer, a mapping table is needed to realize the conversion between binary and text. For example, you can agree that 0000 0001 represents the letter A and 0000 0002 represents the letter B, so when you save A piece of text ABA, the computer actually stores 0000 0001 0000 0002 0000 0001. Correspondingly, when you open the file, it essentially reads binary data, but the editor converts the binary lookup table to ABA.

In the early days of computers, such a map was established as a standard for encoding and decoding, called ASCII code.

A hundred Flowers – multi-character set

Latin countries, for their part, can be extended ASCII the eighth bit to satisfy local code requirements, but for the Latin countries, single character set (just completed a byte code referred to as a single character set, the corresponding is more than one byte code becomes more character set) cannot meet the demand of encoding of native language, So they use multiple bytes to represent a word. For example, our country is the use of double-byte character encoding, the most widely used is GB2312 encoding. But GB2312 can only code simplified Chinese, so the EMERGENCE of GBK coding, GBK coding in addition to support simplified Chinese, but also support traditional Chinese and Japanese, Korean and other coding, is a unified coding.

GBK code uses 1-2 bytes for encoding. GBK code is divided into many code pages, and the range of each code page is one byte, namely 0x00-0xFF. GBK encoding queries the first table through the first byte (figure below).

If the first byte is 0, that is, the first byte ranges from 0x00 to 0x80, the corresponding characters are queried directly from the table in the same way as traditional ASCII query. For example, the percent sign % is encoded 0x25.

If the first byte is 1, that is, the range is between 0x81-0xFF, the table is first queried to obtain the page number of the code page to be queried in the second byte. For example, if the number of the Chinese character is 8144, the number of page 0x81 is displayed in the first byte, and the number of page 0x81 is 0x44.

Because each country has its own code, and there are many ways of coding, the inconsistency of rules leads to very troublesome text conversion, so the ANSI standard is formulated, and each country specifies the standard multi-character set encoding method, for example, the standard code of China is GB2312.

Global Unity -UNICODE

Because each country has developed its own multi-byte character encoding, there are many character encoding sets in the world, which makes it very troublesome to convert characters between countries. So the big boys sat down and agreed on a universal code, a single code for all the words in the world, and that was UNICODE. UNICODE encodings are two bytes, so 256*256 characters can be encoded, which is roughly sufficient for global character encodings.

For example, the Chinese character han, whose Unicode encoding is \u6c49, \u is used to identify it as a Unicode encoding, consisting of 0x6c and 0x49. If it is represented by two bytes, there will be a problem of order. 6C49 and 496C can be represented in both ways. Therefore, the encoding mode of two different encoding sequences is called big-endian and little-endian mode

Unicode is actually a relatively general concept, with ucS-2,UTF-16, and UTF-8 commonly used in actual encoding rules. Next, we describe each of these concepts.

UCS – 2 and UTF – 16

Ucs-2 is the standard implementation of Unicode encoding, where all characters are encoded as two bytes. The difference in the order of the two bytes results in usC-2 big-endian and USC-2 little-endian modes. But UCS-2 encodes only BMP characters, while UTF16 uses variable-length methods to accommodate other characters, with a minimum of two characters. The BMP character UTF16 is the same as ucS-2, and the extended part is encoded in four bytes.

UTF-8

Finally, utF-8, which we’re all familiar with. When Unicode came along, countries that used Latin found themselves at a disadvantage. They decided that Latin needed two bytes instead of one, so they came up with utF-8 characters.

Utf-8 encoding rules

[1] UTF-8 is variable length, ranging from 1 to 6 bytes;

[2] The number of consecutive bits of the first byte with a value of 1 determines the number of encoded bits

[3] If the first byte starts with 0, it is a single byte character

[4] For characters that are not single bytes, all characters begin with 10 except the first byte

The above rules are more clearly coded as follows:

Take up the byte	First byte size	Complete representation
1 byte	Greater than zero	0xxxxxxx
2 –	Greater than 0 xc0	110xxxxx 10xxxxxx
3 bytes	Greater than 0 xe0-0xfc	1110xxxx 10xxxxxx 10xxxxxx
4 bytes	Greater than 0 xf0	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 bytes	Greater than 0 xf8	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 bytes	Greater than 0 XFC	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Using this code, we can determine how many characters a string is composed of. The C++ code is as follows:

inline std: :string GetHideName(const std: :string& sUtf8Data)
{
    std: :vector<std: :string> vName;
    
    std: :string ch; 
    for (size_t i = 0, len = 0; i ! = sUtf8Data.length(); i += len) {unsigned char byte = (unsigned)sUtf8Data[i];
        if (byte >= 0xFC) // lenght 6
            len = 6;
        else if (byte >= 0xF8)
            len = 5;
        else if (byte >= 0xF0)
            len = 4;
        else if (byte >= 0xE0)
            len = 3;
        else if (byte >= 0xC0)
            len = 2;
        else
            len = 1;
        ch = sUtf8Data.substr(i, len);
        vName.push_back(ch);
    }   
    
    std: :string sQxName;
    if (vName.size() <= 2)
    {
        sQxName = vName.size() > 0? (vName.front() +"*"):sUtf8Data;
    }
    else
    {
        sQxName = vName.front() + "* *" + vName.back();
    }
    return sQxName;
}
Copy the code

The problem

[1] GBK and GB2312 difference? GBK is a superset of GB2312, you can simply understand that GB2312 code is simplified Chinese, and GBK on the basis of GB2312 increased the traditional characters and Japanese and Korean. Since GBK is an extension of GB2312, simplified Chinese uses the same code.

Origin of character encoding -ASCII code

A hundred Flowers – multi-character set

Global Unity -UNICODE

UCS – 2 and UTF – 16

UTF-8

The problem

Related Posts

Redis principle analysis and application

A simple Python scheduler

Basic introduction to Redis