Summary of UTF-8, GBK, and Unicode relationships

This is the 7th day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

Traditional coding

1. Select different code tables in the code page for parsing based on the first byte

Code page: internal code table, representation in a table of characters mapped to single-byte or multi-byte values

②. Code table: is the binary corresponding to the specific value (personal understanding)

2. Chinese character set

GB2312: currently the most commonly used, covering all simplified characters and some other characters

GBK :(K stands for extended meaning) on the basis of ①, added traditional characters and other

③ The highest bit of GBK is 1, and the first byte is between 0x81(1000 0001) and 0xFE(1111 1110)

④ Characters in both character sets are represented by 1-2 bytes.

⑤. The English in GBK is also represented by ASCII

3.A total of 128 ASCII characters, encoding rules to use a single byte, low 7 bits to encode all characters

4. The American ANSI organization has developed the ANSI standard character code, which usually refers to the default code of the platform, such as ISO-8859-1(single-byte code, 256 characters) for The English operating system and GBK for the Chinese system

5. Garbled characters are caused by different encoding specifications, using the wrong character encoding to decode the byte stream

The Unicode character set

Theory of 1.

①. A large character set agreed upon by all mankind, displaying all characters in a single document

②. Assign a unique Code Point to each character.

③ Separating character set from character encoding scheme, it is the specific character encoding that determines the final byte stream

4.Unicode encoding is a general term for utF-8 and other specific encoding schemes, not a specific encoding scheme.

2. Utf-8 uses 1-4 bytes to encode characters

(1) one byte, the highest bit is 0, indicating that this is an ASCII character (00-7f). All ASCII encodings are UTF-8

② a byte, starting with 11, and the number of consecutive 1’s indicates the number of bytes of this character, for example, 110xxxxx indicates that it is the first byte of a double-byte UTF-8 character.

③ A byte, starting with 10, indicates that it is not the first byte and needs to look forward to get the first byte of the character

The English alphabet is encoded as one byte, and Chinese characters are usually three bytes

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Summary of UTF-8, GBK, and Unicode relationships

Traditional coding

The Unicode character set

Theory of 1.

2. Utf-8 uses 1-4 bytes to encode characters

3. Utf-16: Any character is represented by 2 bytes

4.BOM(Byte Order Mark) : Indicates the Byte Order Mark used in the text header and Unicode

Summary of UTF-8, GBK, and Unicode relationships

Traditional coding

The Unicode character set

Theory of 1.

2. Utf-8 uses 1-4 bytes to encode characters

3. Utf-16: Any character is represented by 2 bytes

4.BOM(Byte Order Mark) : Indicates the Byte Order Mark used in the text header and Unicode

Related Posts

Database interview question 2

What are some of the most frequently asked interview questions about the JVM?

Mongodb: Aggregation