ASCII, GBK, Unicode and UTF-8

Inside a computer, all information is ultimately a binary value. Each bit has two states, zero and one, so eight bits can combine 256 states, which is called a byte. That is, a byte can be used to represent a total of 256 different states, and each state corresponds to a symbol. That is, 256 symbols, ranging from 00000000 to 11111111.

In the 1960s, the United States developed a computer coding system based on the Latin alphabet to display modern English. Known as ASCII (American Standard Code for Information Interchange), it has been used today.

ASCII provides a total of 128 character encoding, which is commonly represented by a byte in the computer. The first bit of the byte is uniformly defined as 0, and the last 7 bits represent the specific code point. It is worth mentioning that in ASCII code point is the serial number in the ASCII character set, for example, the binary corresponding to the uppercase letter A in the ASCII character set is 01000001, and its ASCII code point is 65, just one by one correspondence.

While ASCII encoding is sufficient for English, ASCII is not sufficient for other languages. For example, the number of Chinese characters is about 100,000. Therefore, the Chinese government introduced GB 2312 (Basic Set of Chinese Coded Character set for Information Interchange), which mainly contains two parts, namely coded character set and encoding method. I won’t go into the details here, but in short, there is only Unicode in utF-16. It is worth mentioning that the CHARACTER set standard of GB 2312 itself can theoretically represent 256 x 256 = 65536 characters at most, so in fact, GBK (Chinese Character Code Extension Specification (GBK) version 1.0) is the character set commonly used at present, but GBK itself is not a national standard. It is an extension of Microsoft (operating system development far outpaces national standards development, operating system manufacturers have to solve a pain point of people first), so it does not have the latter number.

So what is Unicode? Just as the Chinese government introduced the GB 2312 character set, it is natural that other countries and multinational companies will also introduce their own character sets. If we imagine character set into a classroom, each students sit on the desk is character, and each student’s student id code, it is not hard to imagine different classrooms have their respective to students make up the rules of student number, the same students in different classrooms can sit in a different position, the natural one student id at different classroom is likely to be found in the different students. So there was an urgent need for a rule that would put all the students in the world in the same classroom, and each student would have a unique student id, so that it would be easy to find the corresponding student. This was the Unicode character set, as its name implies, which is a set of all characters.

However, this leads to a series of problems. First, Unicode, as an independent body, hopes to promote the unification of global coding and character set standards, but cannot abolish local coding schemes. Unicode has chosen to create a completely separate set of tokens, Unicode Scalar Values, which display a completely different version of our common ASCII and other internal numerical schemes, and in order to be compatible with other mainstream schemes, Unicode introduced the Unicode Transformation Format (UTF), commonly known as UTF-8, UTF-16, and UTF-32. Among them, 32 is a fixed four-byte encoding scheme with a beautiful one-to-one correspondence between its code points and Unicode Scalar Values. 16 is the switching scheme between double-byte and four-byte; 8 is variable-length, single byte compatible with ASCII. For example, 👨👩👧 (family) was an emoji. For some reason, people could not meet the needs of a man + woman + girl/boy family, so they had to add 👩 cow (woman, woman, boy). 👩 👩 👧 (woman, woman, girl), 👩 👩 👧 👦 (woman, woman, girl, boy), 👨 👨 👧 👦 family (men, men, girls and boys), 👨 👨 👧 👧 family (men, men, girls, girls)… Later, the skin color can not be fixed as white, but also yellow, black, alien and so on.

Utf-8 is by far the most widely used Unicode encoding, both for economic reasons (ASCII English uses UTF-32 to fix 4 bytes in a way that takes up extra space) and for development reasons (four bytes don’t necessarily hold more and more Unicode characters).

Utf-8 rules

Utf-8 encoding rules are simple, with only two rules:

  1. For single-byte symbols, the first byte is set to0And the last 7 bits are the code points of this symbol. So utF-8 encoding and ASCII encoding are the same for English letters.
  2. fornByte symbol (n > 1), before the first bytenBit is set to1In the firstn + 1Bit is set to0, the first two characters of the following bytes are set to10. The remaining bits, not mentioned, are all code points for this symbol.

Table of conversion relationships between Unicode and UTF-8 (the x character represents the bits occupied by code points)

The number of bits in a code point Code point value Code point value Sequence of bytes Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
7 U+0000 U+007F 1 0xxxxxxx
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+4000000 U+7FFFFFFF 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Issues that need attention

  1. Chinese in UTF-8 is not necessarily three bytes long

    I often see experienced programmers thinking that a Chinese character is two bytes in GBK and three bytes in UTF-8. In utF-8, the length of Chinese characters is three bytes, which is not the case. Depending on whether the characters are in Unicode base, unusual characters can be 4-6 bytes, because GBK was introduced earlier, so basically all characters are in Unicode base.

  2. Calculating the length of a UTF-8 encoded string should not be taken for granted

    Since Cocos2D-X natively does not provide an API for calculating UTF-8, I have seen a lot of fancy ways to calculate the length of mixed Chinese and English strings. For example, suppose that each character of a chinese-English mixed string is four bytes; Call native OC, Java library function String to calculate length, etc. However, once you get the length, you may need to intercept the string, and you may not know how long to intercept the parameter. Moreover, currently mainstream mobile phones all support emoji input, and when there are a large number of emojis in the text input by players, the interception effect may be very unsatisfactory.

An instance that calculates the length of a UTF-8 encoding string

#include <iostream>

static inline size_t utf8Length(const char *s)
{
  size_t i = 0, j = 0;
  while (s[i])
  {
    //if ((s[i] & 0b11000000) ! = 0b10000000) j++;
    if ((s[i] & 0xc0) != 0x80)
      j++;
    i++;
  }
  return j;
}

int main(a)
{
  const auto &utf8 =
      The sky has a well alone empty, pine and cypress island only admire maple. Wuyuan withered rattan leaves orchid empty, the stars fall day chuan yao reflects pupil.";
  auto size = utf8Length(utf8);
  std: :cout << size << std: :endl;
  return 0;
}
Copy the code
32

Process finished with exit code 0
Copy the code