After writing code for so long, I suddenly realized that I didn’t understand character encoding. How does the code, or the file, that we write every day actually exist on a computer? As we all know, the only things a computer can recognize are zeros and ones. So whatever data you have, in the end it’s just a permutation of zeros and ones. So how does a computer recognize these permutations? This requires Character encoding. In simple terms, information is matched with the corresponding permutations of zeros and ones according to certain rules, so that the computer can recognize the real information represented by the permutations of the hard disk based on the character code. Common ASCII UTF-8 GBK, etc., are typical character encodings. How the computer recognizes these character encodings depends on the principle of character encodings.

ASCII

ASCII (American Standard Code for Information Interchange) is a computer coding system based on the Latin alphabet. Notice that the last two letters are II, not the Roman numeral 2. ASCII is a standard single-byte character encoding scheme developed by the American National Standards Institute for basic text data. It originated in the late 1950s and was finalized in 1967. Originally a U.S. national standard used by different computers to communicate with each other as a common standard for the encoding of Western characters, it has been established as an international standard by the International Organization for Standardization, known as ISO 646 standard. Applies to all Latin letters.

Standard ASCII uses a single character, that is, eight binary bits to represent a character. The first digit is set to 0, but the following seven bits are used to represent it, so the ASCII code specifies a total of 128 characters, including all upper and lower case letters, digits 0 to 9, punctuation marks, and some special control characters.

The standard ASCII code table is as follows:

ASCII is an American standard and does not satisfy the needs of other languages. Such as pound signs, Chinese characters and so on. Some western countries use 8 binary bits to represent characters, up to 256 characters. Obviously, for Chinese characters, one character is not enough.

ANSI

In order to expand THE ASCII code to display their own languages, different countries and regions have developed different standards, resulting in GB2312, BIG5, JIS and other coding standards. These extended encodings of Chinese characters, which use two Bytes to represent a character, are called ANSI encodings, also known as MBCS (Muilti-Bytes Charecter Set). In simplified Chinese system, ANSI code stands for GB2312 code, in Japanese operating system, ANSI code stands for JIS code, so in Chinese Windows to convert to GB2312, GBK only need to save the text as ANSI code. Different ANSI encodings are not compatible, and the same binary value may represent different words in different encodings, leading to Unicode. Before introducing Unicode, take a quick look at the Chinese encoding in ANSI.

GB2313

GB2312 is also one of ANSI codes, extending the initial ASCII code of ANSI code. In order to meet the needs of using Chinese characters in domestic computers, China General Administration of Standards issued a series of national standard codes of Chinese character set, collectively known as GB code, or national standard code. Among them, the most influential is the “Basic Set of Chinese Coded Character Set for Information Interchange” published in 1980, the standard number is GB 2312-1980, because of its very common use, it is often referred to as the national standard code. GB2312 code is widely used in mainland China; Singapore and other places also use this code. Almost all Chinese systems and internationalized software support GB 2312.

GB2312 is a simplified Chinese character set consisting of 6,763 common Chinese characters and 682 full-angle non-Chinese characters. Chinese characters are divided into two levels according to the frequency of use. There are 3,755 First-level characters and 3,008 second-level characters.

GBK

The emergence of GB2312 basically meets the needs of computer processing of Chinese characters, but for names, ancient Chinese and other rare words, GB2312 can not be processed, which leads to the emergence of GBK.

GBK is represented by double bytes, the overall coding range is 8140-Fefe, the first byte is between 81-Fe, the last byte is between 40-Fe, and a line xx7F is excluded. A total of 23,940 code points were collected, including 21,886 Chinese characters and graphic symbols, including 21,003 Chinese characters (including radicals and components) and 883 graphic symbols.

Unicode

The obvious disadvantage of ANSI encoding is that the same encoding value represents different characters in different encoding systems, making it easy to create garbled characters. If there was a code that incorporated all the symbols in the world, and each symbol had a corresponding code value, then there would be no garbled problem. This is Unicode encoding.

Unicode encodings are a large set, now large enough to hold more than a million symbols, each encoded differently. Unicode is not really a character code. It is a character set that specifies the binary of a symbol, but does not specify how the binary should be stored. Unicode has some specific implementation encodings, of which utF-8 is the most widely used.

UTF-8

Utf-8 is a Unicode implementation that uses the widest Unicode. It is a variable-length encoding that uses 1 to 4 characters to represent a symbol and changes the character length according to different symbols, improving the encoding efficiency of Unicode. Let’s take a look at utF-8 notation for different byte numbers, with x representing the available binary bits:

The number of bytes Encoding rules Represents the number of characters
1 byte 0xxxxxxx 2 to the seventh is equal to 128
2 – 11xxxxxx 10xxxxxx 2 to the 11th = 2048
3 bytes 1110xxxx 10xxxxxx 10xxxxxx 2 to the 15th = 65536
4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 2 to the 21st is 4194,304

It is easy to find utF-8 encoding rules from the table above:

  • For a single-byte symbol, the first digit is 0, followed by 7, which represents the Unicode code for the symbol. So utF-8 is represented in the same way as ASCII for single-byte symbols.

  • For n-byte symbols, the first n bits of the first byte are 1, the n+1 bits are 0, the first two bits of all subsequent stanzas are 10, and the remaining bits are the Unicode code for that character.

With this variable-length encoding, single-byte symbols need only be represented in one byte without waste. For Chinese characters, three characters are generally used to represent them. It’s also easy for computers to tell how many bytes a character takes up:

  • If the first digit of a byte is0, this byte is a character
  • If the first digit is1How many in a row1, indicating the number of bytes occupied by the current character

In addition to UTF-8, there is a corresponding UTF-16. In UTF-8, a character is represented as 8 binary bits, while in UTF-16, a character is represented as 16 binary bits. Sixteen binary bits can directly represent 65536 characters, so in UTF-16, Chinese characters and English letters have the same status, using 16 binary bits, that is, two bytes for one character. This is wasteful for English, but saves storage space for Chinese.

About character coding, there should be a general understanding, in daily use, we should try to achieve the unity of coding, to avoid the occurrence of garbled code.

Reference article:

  • Character encoding notes: ASCII, Unicode and UTF-8
  • Why is UTF-8 more wasteful than UTF-16?
  • Talk about Unicode encoding

Article update on wechat public number: Bingxin said, focus on Java, Android original knowledge sharing, LeetCode solution, welcome to pay attention to!