The secret of strings (1) Char and character encoding

This is the 31st day of my participation in the August Text Challenge.More challenges in August

Utf-8,UTF-16, GB2312, ASCII we use every day why so many encoding methods? Why is there a byte for an English letter, but two or more bytes for Chinese? Do different encodings of the same character give the same result? Why is it garbled?

If you don’t understand these things, you’re going to run into a lot of holes when manipulating strings.

Basic knowledge of

Char is the abbreviation of character. It means the characters we use in writing and punctuation.

Charset is short for character set. When we talk about Unicode, ASCII is a character set. These characters are converted into binary codes that computers can recognize and store through different encoding methods, and the encoding method is character encoding.

Each character set has different encodings: utF-8,UTF-16, and UTF-32 are all unicode encodings. In the early days, there was a one-to-one correspondence between character sets and encoding methods. However, with the development of programming languages, there are many encoding methods for the same character set in order to adapt to different development and application scenarios:

Riddles of gibberish?

Garbled characters are caused by mismatches between encoding methods and character sets:

ASCII as an early character set and encoding, only support Latin characters and symbols, using 1byte or 8 bits to represent a char, can only support 256 char, but the actual use of only 128, allowing developers to expand, currently more commonly used isO-8859-1.
China’s GB2312 code is to solve the computer input and output of Chinese characters, the provision of 2 byte corresponding to a character, and downward compatible ASCII; However, there are too many language and writing systems in the world. When two coding systems are incompatible, there will be chicken and duck talk. Characters encoded in system A will get wrong characters in system B, or “garbled codes”.

Unicode and code points

In order to solve the problem of incompatibility of encoding methods and achieve the “unified” character of the world, ISO proposed the UNICODE character set, which contains most of the world’s charsets and establishes a union of all the charsets.

This is a screenshot from the Unicode website. Their goal is to provide a unique number for every character, regardless of platform, program, or language.

This is the core of Unicode, the unique number we call a “Code Point” :

Code point numbers range from U+0000 to U+10FFFF, where U stands for Unicode and supports 1,114,112 codes.
To better manage unicode, all code points can be evenly divided into 17 groups of 65,536 numbers. These 17 groups are also called 17 “planes”, the first of which is plain0.
Plain0 is also called BMP (Basic Multilingual Plane), with code points ranging from U+0000 to U+FFFF. The BMP contains most of the characters used in everyday languages. The following 16 Planes are called Supplementary Planes (SP), including SMP and SIP, as shown in the following figure:

encoding

Encoding mode refers to the process of converting code points into binary according to certain rules.

The Unicode character set uses the Unicode Transformation Format (UTF), including UTF-8, UTF-16, UTF-32, All three encodings can represent all characters from U+0000 to U+10FFFF. Let’s first take a look at the corresponding relationship between code points and encoding modes:

Utf-8: The most common encoding, variable length, 1 to 4 bytes per character. The conversion method is to divide and combine the binary digits corresponding to the code points as follows:

Utf-16: combines both fixed-length and variable-length encoding methods. Characters in the base plane occupy two bytes, and characters in the auxiliary plane occupy four bytes.

Utf-32: fixed 4 bytes, fixed length. Conversion is the simplest, directly in front of the code point 0 pad enough to 4 bytes on the line. The biggest disadvantage is that it takes up too much space.

For example, if we define a char= ‘you’ as the Unicode code point U+4F60, then the binary of 4F60 is 0100111101100000. According to the conversion rules, we need to split and combine it according to 4-6-6:

1110(0100) 10(111101) 10(100000) E4 BD A0 is a 16-bit number, which requires 3 bytes.

At this point, we can see why different chars correspond to different bytes.

Above, thank you for reading, if there is inaccurate and wrong place please comment, I will immediately correct, thank you!

Summary is not easy, please do not reprint without permission, otherwise don’t blame old uncle you are welcome

Welcome technical friends to communicate with me, wechat 1296386616

References:

“What is the Unicode” Unicode Consortium www.unicode.org/standard/Wh…

Character Sets and Encoding (4) — Unicode Xiao Guodong xiaogd.net/md/%e5%ad%9…

Thoroughly understand < Unicode > Li Yucang liyucang – git. Making. IO / 2019/06/17 /…

The secret of strings (1) Char and character encoding

Basic knowledge of

Riddles of gibberish?

Unicode and code points

encoding

Related Posts

[12.9-12.13] The Road to Technical Growth, Ali Interview, Book list, HashMap

49- Bloom Filter

Parse the difference between == and equals()