Character coding note

preface

I believe that many students in the process of development will encounter character garbled problem.

The character encoding of the file opened in the wrong format, or the encoding encoding of the backargument parsing is inconsistent, can cause garbled problems.

And we are familiar with and commonly used character encoding formats have such a few: ASCII, GBK, Unicode, UTF8

But do we really understand these character encodings? Why are there so many types of character encodings? Why not just use one? What is the connection between them? (Soul Triad)

And fat hao is only know, and do not know why, so decided to find out.

The body of the

ASCII

ASCII (American Standard Code for Information Interchange) is a computer coding system based on the Latin alphabet. It is mainly used to display modern English, while its extended version of the Extended American Standard for Information Interchange code can partially support other Western European languages and is equivalent to the international standard **[ISO/IEC 646]**.

Simply said, is the English alphabet, number symbols and binary between a set of unified provisions. ASCII defines a total of 128 character encoding, represented by a byte, but only occupies the last 7 bits of a byte, the highest bit is uniformly specified as 0.

GB

The National Standard Code, abbreviated as the National Standard Code, is the encoding set of commonly used Chinese characters in the People’s Republic of China and is also adopted in Singapore.

Currently GB 18030-2005 is officially mandatory in the People’s Republic of China, but GB 2312-80 is still used in some fields.

Mandatory standard prefixed with GB. The recommended standard is prefixed with GB/T. National standardization guidance technical documents shall be titled GB/Z.

GB2312 :(also known as GB1) is the first revised Chinese coded character set, each Chinese character is two bytes and contains 6,763 Chinese characters.

GBK: Due to the extensive and profound Chinese, 6,763 characters cannot cover the required Chinese characters, so GBK is extended to 21,003 Characters under the premise of compatibility with GB2312 and ASCII.

GB18030: However, more than 20,000 characters can not meet the needs of our Chinese characters, two bytes represent 65536 Chinese characters at most, so GB18030 decided to use four bytes to represent the extra Chinese characters, and then expanded to 70,244 Chinese characters on the basis of GBK.

Unicode

Unicode, also known as universal code, international code, Unified code and single code, is an industry standard in the field of computer science. It organizes and encodes most of the world’s writing systems, making it easier for computers to present and process text.

Obviously, different countries have their own languages and encodings, and even writing and opening the same file in different encodings can lead to garbled characters. To solve this problem, the ambitious Unicode was born, pledging to unify all symbols in the world.

Unicode continues to expand, with the latest version, 13.0.0, released in March 2020, containing more than 130,000 characters.

There is a problem, however, that Unicode is just a character set that defines the binary code for symbols, but does not specify the format for storing them. So there are two problems:

How can I tell Unicode from ASCII? How does the computer know that three bytes represent one symbol instead of three symbols?
We already know that a single byte is enough to represent an English letter, but if Unicode were to agree on three or four bytes for each symbol, then each letter must be preceded by two or three bytes0This is a huge waste of storage, and the size of the text file can be 2-3 times larger, which is unacceptable.

To solve the problem of inconsistent storage formats, we need a unified format, and the following UTF-8 is one of them.

UTF-8

UTF (Unicode Transformation Format)

Utf-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode and a prefix code. It can encode all valid encoding points in the Unicode character set in one to four bytes and is part of the Unicode standard.

Utf-8 is one of the most widely used implementations of Unicode on the Internet. Its main feature is variable length encoding, represented by 1-4 byte characters depending on the symbol.

Utf-8 encoding rules are simple, with only two rules:

1) For a single-byte symbol, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.

2) For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are set to 10. The remaining bits, not mentioned, are all Unicode codes for this symbol. Fill from back to front.

Scope of the code	UTF-8	annotation
000000 – 00007F	0 XXXXXXX (00-7 f)	128
000080 – 0007FF	110xxxxx (c0-df) 10YYYYYY (80-bf)	1920
000800 – 00D7FF 00E000 – 00FFFF	1110XXXX (E0-EF) 10YYYYYY 10zzzzzz	61440
010000 – 10FFFF	11110www (F0-F7) 10xxXXXX 10YYYYYY 10zzzzzz	1048576

Interpreting UTF-8 should be relatively easy if you can roughly understand the above description and table rules. If the first part of the character is 0, it indicates the single-byte character. If the first digit is a 1, it depends on how many consecutive 1s there are in the highest digit.

Here’s an example:

ASCII binary is 01000001, Unicode binary is 00000000, 01000001 hex is 41, according to the rules of the table above, in the first range, Therefore, the encoding of utF-8 for single-byte characters is also easy to figure out.

The Unicode character is 01001110 00101101. The hexadecimal character is 9b2D. The decimal character is 39725. We simply fill the unicode characters back into template 1110XXXX (e0-EF) 10YYYYYY 10zzzzzz 11100100 10111000 10101101

Of course, there are other implementations such as UTF-16 (represented by 2 or 4 bytes) and UTF-32 (represented by 4 bytes), but I won’t go into detail here.

References:

Character encoding notes: ASCII, Unicode and UTF-8

Programmers must: thoroughly understand the common 7 kinds of Chinese character encoding

Ordinary change, will change ordinary

I am fat hao, a low-key young man in the Internet

Welcome to wechat search “Po Ji Island”, click to read more sharing good articles

preface

The body of the

ASCII

GB

UTF-8

References:

Related Posts

IOS uses Protocol buffers

Zabbix monitor monitors MYSQL throughput

“Gold digging plug-in evaluation” personal orientation