In the Chinese language environment, as programmers we will encounter Chinese garbled code, the reason is the character encoding problem. Before the in-depth understanding of its principle, will feel that the Chinese coding problem is more enigmatic, inexplicable garbled, and confused to good.

Character coding is the cornerstone of computer technology. This paper hopes to help you thoroughly sort out the problem of character coding, not only know why, but also know why, and get rid of the feeling of being dominated by Chinese gibberish.

Before we get into the Chinese coding problem, we need to talk about English coding, whose solution is ASCII.

ASCII

ASCII, American Standard Code for Information Interchange, is a computer coding system based on the Latin alphabet, using 8-bit binary representation of characters.

Inside a computer, all information is ultimately a binary value. Each bit has two states, zero and one, so eight bits can combine 256 states, which is called a byte.

In other words, a byte can represent 256 different states, and each state corresponds to a symbol, that is, 256 symbols, from0000, 0000,to1111, 1111,, the quantity calculation formula is as follows:.

List some ASCII codes as follows:

binary hexadecimal graphics
0010, 0000, 20 (space)
0010, 0001, 21 !
0011, 0001, 31 1
0011, 1101, 3D =
0100, 0100, 41 A
0110, 0001, 61 a

English is made up of 26 basic Latin letters, Arabic numerals, and British punctuation marks, so 128 symbols is enough for L, but ASCII is inadequate for other complex languages, such as Chinese characters, which are around 100,000 (although there are no exact numbers, there are thousands of characters in daily use).

It is surely not enough that a single byte can represent only 256 symbols, so you must use more than one byte to represent a symbol. In order to display Chinese characters correctly, on May 1, 1981, the State Administration of Standards of China issued the “Basic Set of Chinese Coded Characters for Information Exchange”, usually referred to as GB.

GB class

Almost all Chinese systems and international software are supported in mainland ChinaGB2312. GB2312 is the common encoding method of simplified Chinese. It uses two bytes to represent a Chinese character, so it can be represented at mostA symbol.

GB2312 standard contains 6,763 Chinese characters in total, including 3755 first-level Chinese characters (commonly used characters) and 3008 second-level Chinese characters (less commonly used characters). Meanwhile, it contains 682 characters including Latin letters, Greek letters, Japanese hiragana and Katakana letters, and Russian Cyrillic letters.

GB2312 basically meets the needs of computer processing simplified Chinese characters, the Chinese characters included cover 99.75% of the use frequency, but for rare characters and traditional characters, GB2312 can not be processed. Hence the invention of GBK and GB18030.

The relationship between them is shown below:

GBK code is the superset of GB2312 code, fully compatible with GB2312 down, compatible meaning is not only compatible with characters, but also the same character code. GB18030 code is downward compatible with GBK and GB2312, GB18030 code is variable length code.

But many coding schemes like GB have a common problem. They allow computers to handle bilingual environments, that is, Latin letters and local languages, but not multilingual environments, that is, a mixture of languages. Unicode was created to solve this problem.

Unicode

There are many languages in the world, such as Spanish, Korean, Russian and so on, and they all have their own encoding methods, so the same binary number can be interpreted as different symbols. If you want to open a text file properly, you must know how to encode it, otherwise it will be garbled.

Suppose there was a code that included all the symbols in the world. Each symbol is given a unique code, and the garble problem goes away. This is Unicode, the code for all symbols.

Unicode has evolved along with standards for the universal character set. The current version, 12.1.0, released in May 2019, contains more than 130,000 characters. In addition to visual glyphs, encodings, and standard character encodings, Unicode covers character characteristics such as upper and lower case letters.

However, Unicode is just a set of symbols and does not represent the code in the computer. It only specifies the binary of a symbol, not how that binary should be stored.

Thus, two problems arise:

  • How does the computer know that three bytes represent one symbol, rather than three symbols?
  • A single byte is enough for the English alphabet, but Unicode requires three or four bytes per symbol, which would increase the storage space of English text by a factor of three or four and be a huge waste.

With the development of the Internet, more and more information of different countries is transmitted on the Internet. There is a strong need for a unified encoding method. Utf-8 is a Unicode implementation method widely used on the Internet.

UTF-8

Again, UTF-8 is one of Unicode implementations, not the only one, and not the same as Unicode. In addition to UTF-8, there are UTF-16 and UTF-32, though they are rarely used.

Utf-8 features different encoding lengths for different ranges of characters. It can use 1 to 4 bytes to represent a symbol, varying the length of the byte depending on the symbol.

The coding rules are simple:

  • For single-byte symbols, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.
  • For n byte symbols (n>1), the first n bits of the first byte are set to 1, the n+1 bits to 0, the first two bits of the following bytes to 10, and the remaining unmentioned binary bits are the Unicode code for the symbol.

Unicode and UTF-8 byte mapping:

Unicode range UTF-8
U + 0000 ~ U + 007 f 0xxxxxxx
U + 0080 ~ U + 7 ff 110xxxxx 10xxxxxx
U + 0800 ~ U + FFFF 1110xxxx 10xxxxxx 10xxxxxx
U + 010000 ~ U + 10 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

In the case of “love”, its Unicode is U+7231, which is in the U+0800 ~ U+FFFF range, so it is encoded in 3 bytes, and its binary is 01110010 00110001.

Then utF-8 is used, as shown in the figure below:

Afterword.

Code in a computer operating system:

  • The default encoding of Chinese in Windows is GBK (GB2312).
  • The default encoding for Chinese in Linux is UTF-8

If the Linux OS is used, run the following command to view the Chinese system code:

echo $LANG
en_US.UTF-8
Copy the code

If you want to view the original encoding of the file and convert the encoding, you can use enca command, you can use apt-get install enca to install.

enca -L zh_CN <file>  Check the encoding of the file
enca -L zh_CN -x UTF-8 <file>  Convert the file encoding to UTF-8 encoding
enca -L zh_CN -x UTF-8 <file_1> <file_2> Keep the original file
Copy the code

Character encoding selection suggestions:

  1. English only, select ASCII
  2. Mainly save Chinese, more sensitive to the storage size, choose GB2312
  3. Versatility first, simple processing, choose UTF-8

Finally, I wrote a book entitled “Deep Understanding of NLP Chinese Word Segmentation: From Principle to Practice” to help you master Chinese word segmentation from scratch and step into the door of NLP. If the above content is helpful to you, I hope you can like, forward, comment, thank you!