Introduction to the

As we all know, the computer was invented by Americans. After it was invented, it was bound to face some problems. For example, how to store their own characters in their country’s language on a computer.

The origin of the Ascii

At that time, Americans thought it was very simple. Americans listed the characters they needed to use one by one, and then gave each character a unique code, which was a code point, and the corresponding characters were stored in binary information.

As shown below. Figure source

So if you look at this picture and it’s not very intuitive to you, let me turn this into something a little bit more general.

Code bits character binary
0 Null character 0000, 0000,
1 The title to start 0000, 0001,
2 The text start 0000, 0010,
125 }
126
127 Delete the character 0111, 1111,

These characters are basically enough for Americans to use. Later, Europeans began to use computers, and when they found that computers did not contain their own characters, they began to extend ASCII 128 to 255 (this part is called the ASCII extended Character set). If you want to see it, click here.

The origin of the GB2312

Later we Chinese also began to use computers, we want to display Chinese characters on the computer. We have a lot of characters and the ASCII character set extension is not enough, so we have to have our own character set, but we can’t break other people’s ASCII character set, so we expand on ASCII.

Our first Chinese character set, GB2312, used partition management, a total of 94 partition design, each partition contains 94 bits, a total of 8836 code points.

Zone 01-09 contains 628 characters other than Chinese characters

Areas 10-15 are blank and not in use

16-55 contains 3955 first-level characters, sorted by pinyin

56-89 contains 2008 second-level Chinese characters, sorted by radical/stroke

Area 88-94 is blank and not in use

Specific specification you can click here, here to give you a simple example

This is the code point and character corresponding to the 17 block found in GB2312. The character “Ben” is found at 30 in the 17 block. We put the corresponding area of the Chinese characters in a sort. We now have code points for the characters, but no instructions on how to store them. For example, the code point of “Ben” is 1730 converted to hexadecimal. In the figure above you can find yourself counting “Ben” to see if it is 30. We can also arrange it in the following way

Double byte encoding GB2312 provides that each character is represented by two bytes, the first byte is “high byte”, corresponding to 94 areas; The second byte is “low byte” and corresponds to 94 bits. So its location code range is: 0101-9494. The area code and bit number plus 0xA0 respectively is the GB2312 code. For example, if the last code point is 9494, the hexadecimal code and the area code are 5E5E and 0x5E+0xA0 = 0xFE respectively. Therefore, the GB2312 code point is FEFE.

GB2312 Encoding range: A1A1-FEfe, where the encoding range of Chinese characters is B0A1-F7FE. The first byte is 0xB0-0xf7 (corresponding to area code: 16-87), and the second byte is 0xA1-0xFE (corresponding to bit number: 01-94).

Why start with A1 because A0 is 160 and it’s much higher and much more important than 127 and -128 and it doesn’t conflict with ASCII

17 0 1 2 3 4 5 6 7 8 9 0 thin hail insurance fort full treasure to embrace to binge a leopard bao blasting cup tablet sad British north generation back 2 bei barium times when they prepare the vigilant baked by running of the benzene 3 this stupid guy don't alert pump and blend into the nasal 4 than every one pen With no visible and transparent death, no recourse is required to shut down any natural disaster. No visible or transparent death is necessary to eliminate any natural disasterCopy the code
// convert 1730 to hexadecimal. After seeing the code above, we can calculate the "Ben" position. 17 Switch to hexadecimal 0x11 0x11 + 0xA0 =0xB1 30 Switch to hexadecimal 0x1E 0x1E + 0xA0 =0xBE 0xB10xBECopy the code

Later, more and more Chinese characters on the basis of GB2312 extended GBK and a few ethnic characters GB18030.

Unicode origin

As different countries began to use their own character sets and in this case, there were more and more character sets, and ISO decided that you should put all the characters in the world together and number them, so that you didn’t need so many character sets, just one would suffice, Thus came the Unicode Character Set, or UCS.

The ucS-2 character set was originally used, meaning that two bytes represent a character, but two bytes, 16 bits, 2^16=65536 cannot represent all the characters in the world.

Code bits explain coding
0x0000 The first character 0x0000
0xFFFF The last character 0xFFFF

So then came UCS-4, which is four bytes for a character, so that’s 32 characters, which is exactly what it takes to represent all the characters in the world.

Code bits explain coding
0x00000000 The first character 0x00000000
0xFFFFFFFF The last character 0xFFFFFFFF

This 2^32-1 can represent 4.3 billion characters, at least because it takes up a lot of space, so it was not accepted for a long time, but then the Internet developed and utF-8 emerged, so it was gradually accepted.

I’ll show you another picture here, where ISO starts to list all the characters in the world, and then encodes each character. Click here for the original imageThe way we talk about encoding is the way we store this character. For example, ASCII can be stored in one byte, which is a kind of encoding method. For example, GB2312 means that a Chinese character uses two bytes, which is also a kind of encoding method. So what is the encoding of Unicode?

There are three main encoding methods of Unicode: UTF-8, UTF-16, and UTF-32.

The Unicode encoding space can be divided into 17 planes, each containing 216 (65,536) code points. The code points of the 17 planes can be expressed as ranging from U+xx0000 to U+xxFFFF, where xx represents the hexadecimal value from 0016 to 1016. There are 17 planes in total.

utf-8

Utf-8 is a variable-length character encoding for Unicode. It encodes all valid encoding points in the Unicode character set in one to four bytes. Utf-8 divides Unicode code points into four ranges

The number of bits in a code point Code point value Code point value Sequence of bytes Byte1 Byte2 Byte3 Byte4 Byte5 Byte6
7 U+0000 U+00F(127) 1 0xxxxxxxx
8 U+0080 U+07FF(2047) 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF(65535) 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+1FFFFF(1114111) 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Utf-8 encoding rules:

(1) For symbols in ASCII code, single-byte encoding is used, with the same encoding value as the ASCII value (see u0000.pdf). ASCII values range from 0 to 0x7F, and the first digit of the binary value of all encodings is 0 (this is exactly what distinguishes single-byte encodings from multi-byte encodings).

(2) Other characters are encoded in multiple bytes (assuming N bytes). The first N bits of the first byte are 1, the N+1 bits are 0, the first two bits of the next n-1 bytes are 10, and the rest of the N bytes are used to store the code point values in Unicode. reference

A simple exercise: “I” unicode corresponds to the encoded position corresponding to the hexadecimal code point 0x6211, which describes the character “I”. So what is the corresponding encoding let’s do it in UTF-8.

0 x6211 corresponds to the location of the table above is U + 0800 | U + FFFF (65535),

Now we are using to corresponding utf-8 encoding is 10 10 XXXXXX XXXXXX | |

A character is represented by three bytes. This is three bytes corresponding to the binary bit 00000000 0000000 0000000, corresponding to 0x6211. The top slot is empty.

I’m going to write it this way in case some people don’t understand it so let me explain it a little bit a hexadecimal number is at most F, which corresponds to 16, and the binary number corresponds to 1111. That might make a little bit more sense.

The binary bits corresponding to 0x6211 are 0110 0010 0001 0001. We insert the binary information from back to front into the corresponding encoding style 1110XXXX 10XXXXXX 10XXXXXX

11100110 10001000 10010001 Convert to hexadecimal 11100110 (E6)1000 1000 (88) 10010001 (91) => 0xE6 0x88 0x91Copy the code

This is how utF-8 is encoded. The utF-8 memory location is 0xE60x880x91.

At this point, I’m sure you understand what the UFT-8 code is.

utf-16

The Unicode encoding space ranges from U+0000 to U+10FFFF, with a total of 1,112,064 code points available to map characters. The Unicode encoding space can be divided into 17 planes, each containing 216 (65,536) code points. The code points of the 17 planes can be expressed as ranging from U+xx0000 to U+xxFFFF, where xx represents the hexadecimal value from 0016 to 1016. There are 17 planes in total. Within the basic multilingual plane, the code point segment from U+D800 to U+DFFF is permanently reserved and does not map to Unicode characters.

From u+xx0000 to u+xxFFFF, xx indicates different planes. 00 is 0,10 is 16 when 00 is converted to 10. There are 17 numbers from 0 to 16 so they represent different planes. Depending on the plane, it can be from U+0000 to U+FFFF which is 2^16 characters. A total of 17*2^ 16-1 =1114111 characters. The code points from U+D800 to DFFF are not in the Unicode code points, so subtract 0xdFFF-0xD800 =0x7FF. Is converted to decimal 2047. 1114112 characters left minus 2047 characters is 1112065 yards, [I calculate with a slightly different wikipedia] (zh.wikipedia.org/wiki/UTF-16…).

So instead of using code points between 0xDFFF-0xD800 for Unicode sorting, the range is from U+0000 to U+D7FF and from U+E000 to U+FFFF.

The knowledge of the front is almost said, continue to go a little deeper below. Divide planes for Unicode.

The plane Code a starting Code a termination binary
First Plane (basic plane) U+0000

U+E000
U+D7FF

U+FFFF
0000 0000 0000-1101 0111 1111 1111

1110 0000 0000 0000 -1111 1111 1111 1111
Second plane (auxiliary plane) U+010000 U+01FFFF 0000 0000 0000 0000 0000 0000-0000 0000 0000 0001 1111 1111 1111 1111
Third plane (auxiliary plane) U+020000 U+01FFFF
Seventeenth Plane (Auxiliary plane) U+0100000 U+10FFFF

This table makes a little bit more sense to you. If there is any mistake or do not understand the place please leave a message. The UTF-16 encoding occupies two bytes U+0000 to U+FFFF on the first plane and four bytes (U+010000 to U+10FFFF) on the other planes.

The code points in the auxiliary plane are encoded in UTF-16 as a pair of 16-bit length codes (32 bits, 4 bytes), called a proxy pair.

A Code Unit (also known as a “Code Unit”) is the Unit of encoded text that has the shortest combination of bits. For UTF-8, the symbol is 8 bits long; For UTF-16, the symbol is 16 bits long; For UTF-32, the code element is 32 bits long. Code Value is an outdated usage.

A unicode position is represented by a 16-bit binary number.

Specific operation methods:

  1. Subtracting 0x100000 from the code point results in values in the range of 20 bits long 0x…. 0 XFFFFF.

So why do I subtract 0x100000? Unicode code bit x100000 = 0 is greater than or equal to 0 | | less than zero, greater than or equal to 0 in other plane is 4 bytes, is less than 0 in basic plane.

The largest code point of Unicode is 0x10FFFF, and when subtracted from 0x10000, U’ has a maximum value of 0xFFFF, so it can definitely be represented as 20 binary bits.

0x10FFFF-0x10000 (BMP) = OxFFFFF(0000 0000 0000 0000 0000 0000) = 2^20=1,048,576; It takes 20 bits to represent it. I think you see the first step here.

Make the top 10 seats high and the bottom 10 seats low. The first 10 bits are insinuated to 0xD800 to DBFF(the space size is 2^10), and the last 10 bits are mapped to 0xDC00 to 0xDFFF(2^10) to become the low bits.

  1. The high 10bit value (the value range is 0… 0x3FF) is added to 0xD800 to get the first code element (high level proxy) value in the range 0xD800 to 0xDBFF. Because high level surrogates have smaller values than status surrogates, the Unicode standard calls high level surrogates lean surrogates to avoid confusion.

Why 0 to 0x3FF. The maximum value of a 10-bit binary number (11 1111 1111) becomes hexadecimal 0x3FF.

Why do we add D800 to the top? The reason is that 0xDB00 to 0xFFFF is an empty field and these code points do not correspond to any characters, so these code points can be used to map the characters of the secondary plane. As I said and explained, the 10-bit binary is 1111, 1111, 11, 0000, 0000, 00, which means it can represent 0x3FF characters, so the high point is 0xD800 and then you add 0x3FF, and the mapping area is 0xD800 to 0xDBFF.

  1. Low 10 bit value (value range is also 0… 0x3FF) is added to 0xDC00 to get the second symbol (low level proxy), which is now in the range (0xDC00-0xDfff). Because low level surrogates have larger values than high level surrogates, the Unicode standard calls low level surrogates trail surrogates to avoid confusion.

Why is 0xDC00-0xDFFf a low value?

2^ 10-1 = 1023 (hexadecimal 3FF) plus 0 is 1024 bits. Now we want to use the total range to represent the low level again so we need to add 1034 bits to this, 0xDBFF + 0x3FF + 1 = so that the maximum value of the low level is 0xDFFF. Where does the low byte 0xDC00 come from? Is the maximum range of high-order bytes 0xDBFF + 1 = 0xDC00. I hope you understand that.

Above, the characters of an auxiliary plane are split into two basic plane characters. So that’s four bytes, and any more than four bytes we’re going to use high and low level proxies.

  • The sample

I look for a character U+1141B in the other plane of Unicode. The HTML code is “𑐛”.

  1. The first step 0x1141B-0x10000 = 0x0141B is converted to binary 0000 0001 0100 0001 1011
  2. Split high bits and low bits

High: 0000 0001 01 (00 0000 0101)0x0005

Low level: 00 0001 1011 (00 0001 1011)0x001B

  1. Add the corresponding agent location

0xD800 + 0x0005 = 0xD805

0xDC00 + 0x001B = 0xDC1B

The corresponding hexadecimal code is 0xD805 0xDC1B binary code is (11011000 00000101 11011100 000110111)

This example leaves us with a few other tips. We have different operating systems on our computers, like Linux, Windows, MacOS, when we store bytes like Windows reads bytes from left to right, high to low so for example, 0xD805 0xDC1B, but for MacOS, D8 is low, I’m going to write 0x05D8 0x1BDC. So you have the big tail and the small tail. Let’s do it in a table.

Code name Coding sequence coding
UTF-16 LE Small tail 05 D8 1B DC
UTF-16 BE Leg, D8 05 DC 1B
BOM is used to mark the order of bytes. Consists of a character code U+FEFF at the beginning of the data stream, which can be used as a signature defining byte order and encoding form, primarily to mark plain text lines. BOM is useful at the beginning of a lip print. It is not clear whether the text is in large or small byte order, and it can also be used as a hint.
If a BOM is included, FF FE is added before the text. FE FF indicates the large end.
Code name Coding sequence coding
BOM
UTF-16 LE Small tail FF FE | 05 D8 1B DC
UTF-16 BE Leg, FE FF | D8 05 DC 1B

UTF-32

Utf-32 is a 32-bit Unicode conversion format. Each Unicode code point is encoded using 32-bit bits. Each 32-bit value in UTF-32 can represent a Unicode code point and is exactly the same value as that code point.

  • UTF – 32
  1. It can be indexed directly by Unicode code points
  • UTF – 32
  1. The main disadvantage of UTF-32 is that it uses four bytes per code point and wastes a lot of space.

Utf-16 also contains the concepts of large and small tail sequences. Here you can refer to UTF-16 without going into details.

JavaScript

The JavaScript language uses the Unicode character set, using the UCS-2 approach, with two bytes representing a character. So in JavaScript if a character is four bytes it is treated as two double bytes. Character functions in JavaScript are also affected by this. Such as U + 10000. I will continue to add this knowledge later.

"𐀀".length / / 2

"0x10000"= = ="𐀀" //false

"𐀀".charAt(0) // \ud800
"𐀀".charAt(1) // \udc00

"𐀀".charCodeAt(0) / / 55296

"𐀀"= = ='\uD800\uDC00' // true

String.fromCharCode("0x10000") / / "𐀀"
Copy the code

Unicode Frequently Asked Questions Reference 1

Wikipedia. UTF – 16 reference 2

Wikipedia reference UTF – 323

Unicode reference 4