The introduction

Our CPU can only recognize binary zeros and ones, but how do computers present us with colorful interfaces and text?

The answer is code.

This paper mainly focuses on the coding of the text, the coding of the text means to establish the relationship between numbers and text. For example, you correspond to the number 20320 in Chinese characters. This correspondence is not invented out of thin air, but is based on a query from the Unicode character set table. The color we see on the computer also has a corresponding encoding, such as the 24-bit RGB color used in CSS, which is why we can use #000000 to represent the corresponding color in CSS.

Tips: The sequence of escaped characters corresponding to you in the computer is “\304\343\272\303”. We use the keyboard to type “you” —-> The operating system processes the escaped characters as —-> and maps them to Unicode characters.

unicode

The original Unicode encoding was fixed length, 16 bits, meaning that 2 bytes represent a character, making a total of 65536 characters. Obviously, this is not enough to represent all characters in various languages. The Unicode4.0 specification takes this situation into account and defines a set of additional character encodings (detailed below).

Unicode currently has 1,112,064 code points available to map characters from the U+0000 to U+10FFFF encoding space.

The plane

Planes are ranges, which divide the characters that Unicode can represent into 17 ranges.

The plane The scope of Chinese name English names
0 plane U+0000 – U+FFFF Basic genre plane Basic Multilingual Plane, abbreviatedBMP
1 plane U+10000 – U+1FFFF Multilingual supplementary plane Supplementary Multilingual Plane for shortSMP
2 plane U+20000 – U+2FFFF Ideograms complement the plane Supplementary Ideographic Plane, for shortSIP
No. 3 plane U+30000 – U+3FFFF Ideographic third plane 3) Tertiary Ideographic PlaneTIP
Planes 4 through 13 U+40000 – U+DFFFF (Not in use)
14 plane U+E0000 – U+EFFFF Special purpose supplemental plane Supplementary Special- Purpose Plane, for shortSSP
15 plane U+F0000 – U+FFFFF Keep asPrivate Use Area (Area A) Private Use Area-A, for shortPUA-A
16 the plane U+100000 – U+10FFFF Keep asPrivate Use Area (Area B) Private Use Area-B, for shortPUA-B

Plane 0 contains the characters we often use in Chinese and English, Korean and Japanese.

Plane 1 is used to record the ancient texts that we do not often use, such as Aegean numbers, ancient Greek numbers and so on.

Plane 2 is used to record infrequently used Chinese, Korean and Japanese characters.

Plane 3 to Plane 14 is not used, TIP(Plane 3) is used to map oracle script, golden script, xiao Zhuan and other ideographic characters. Pua-a, puA-B is for private use, is used for everyone’s own fun – to store some custom characters.

UTF – 32, utf-8 16 and utf-8

In a computer, only code points of characters are stored. The Unicode standard only specifies the characters of code points, not how they are stored. So we use UTF-32, UTF-16, or UTF-8 to store characters.

UTF-32

Utf-32 is a protocol for encoding Unicode, which encodes each Unicode code point using 32-bit bits (but stipulates that the number of leading bits must be zero, so only 2^21 Unicode code points can be represented. 2^21 represents more than enough data to represent all currently known characters.

The main advantage of UTF-32 is that it can be indexed directly by Unicode code points. Finding the NTH encoding in an encoding sequence is a constant time operation. In contrast, other variable-length encodings require sequential access operations to find the NTH encoding in the encoding sequence.

But in most texts, characters that are not in the basic text plane are rare, making utF-32 nearly twice as large as UTF-16 and four times as large as UTF-8 (depending on the proportion of ASCII characters in the text, the more ASCII characters there are, the closer the multiples are to the above theory).

UTF-16

Hexadecimal coding range Utf-16 representation method (binary) Base 10 coding range Number of bytes
U+0000 – U+FFFF xxxx xxxx xxxx xxxx – yyyy yyyy yyyy yyyy 0-65535. 2
U+10000 – U+10FFFF 1101 10yy yyyy yyyy – 1101 11xx xxxx xxxx 65536-1114111. 4

As a reminder, not utF-32 means 32 bytes for code points, UTF-16 means 16 bytes for code points, and UTF-8 means 8 bytes for code points. The number following UTF indicates the minimum number of bytes required to represent a character’s code point under that encoding protocol. As shown in the table above, utF-16 characters require at least 16 bytes to represent code points.

So when does UTF-16 require 32 bytes to represent characters?

For example, the character whose code point is 0x64321 is represented by UTF-16. 64321 > FFFF so we can’t just use 16 bytes.

The calculation process is as follows:

V = 0x64321 Vx = V-0x10000 = 0x54321 = 0101 0100 0011 0010 0001 Vh = 01 0101 0000 // The 10 bits of the highest part of the Vx Vl = 11 0010 0001 / / Vx 10 bits of lower part w1 = 0 xd800 / / results of the first 16 bits of the initial value w2 = 0 xdc00 / / after the results of 16 bits of the initial value of w1 = w1 | Vh = 1101 1000 0000 0000 | 0101 01 0000 = 1101 1001 0101 0000 = 0xD950 w2 = w2 | Vl = 1101 1100 0000 0000 | 11 0010 0001 = 1101 1111 0010 0001 = 0xDF21Copy the code

Calculation rules:

If the code point is at '0x010000' - '0x10FFFF', then: 1. Subtracting '0x10000' from the code point gives a number between '0x000000' and '0x0fffff'. 2. This number is converted to a 20-bit binary number. If there are not enough digits, fill the left side with 0 and write it as' YYYY YYYY YYXX XXXX XXXX '. 3. Take out 'YY YYYYYYYY' and '11011000 00000000' to perform or operation. 4. Take out 'xx XXXXXXXX' and '11011100 00000000' and perform or operation. 5. Connect the demerits calculated in 3&4 to get '110110YY YYYYYYYY 110111XX XXXXXXXX' 6. Extract the first and last 16 digits of the result obtained in 5 to calculate the corresponding hexadecimal.Copy the code

UTF-8

Code range hexadecimal binary Utf-8 Binary/hexadecimal annotation
000000-00007f 128 characters 00000000 00000000 0zzzzzzz(7个z) 0 ZZZZZZZ (00-7 f) ASCII character range, byte starting at zero
000080-0007FF 1920 characters 00000000 00000yyy yyzzzzzz (6个z,5个y) 110yyyyy (c0-df) 10zzzzzz (80-bf) The first byte starts at 110 and the next byte starts at 10
000800-00D7FF 00E000-00FFFF 61440 characters 00000000 xxxxYYYY YYzzzzzz (4 x,6 Y,6 Z) 1110XXXX (E0-EF) 10YYYYYY 10zzzzzz The first byte starts at 1110 and the next byte starts at 10
010000 to 10FFFF 1048576 characters 000WWWXX XXXXYYYYYZZZZZZ (3 W,6 X,6 Y,6 Z, can represent a maximum of 2^21 code points, corresponding to utF-32 only 2^21 code points) 11110www (F0-F7) 10xxXXXX 10YYYYYY 10zzzzzz Will start with 11110, followed by a byte of 10

But coincidentally

  • The first tuple starting at 110 means that the UTF-8 parser needs to find two tuples to parse.

  • The first tuple starting at 1110 means that the UTF-8 parser needs to find three tuples to parse.

  • The first tuple starting at 11110 means that the UTF-8 parser needs to find four tuples to parse.

Therefore, characters that cannot be represented by ASCII encoding are saved in the way of multi-digit, as shown in the figure below. The first few bits of the first byte indicate the length of the current character, and several ‘1s’ represent the number of bytes. For example, in line 3, the program reads the first byte and sees three ‘ones’ in front of it. The program reads the first byte and looks back for two bytes to make up three bytes. It reads by putting together the’ x ‘bits in the picture and gets a complete character.

Contrast & Application

contrast

Utf-32 encoding rules parse characters in almost constant time complexity. But it takes up a lot of space.

Utf-16 encoding rules are a neutral option that can represent the characters we normally use in the 16-bit range, space is wasted compared to UTF-8 parsing performance (depending on the proportion of ASCII characters), and query times are faster on average than UTF-8. Compared to UTF-32 parsing performance, there are space savings and query times are slower on average than UTF-32.

The utF-8 encoding rule is our common rule. The most commonly used characters in our programming are A-z and A-z or 1-9. If there are no other special characters (annotation using // or # corresponds to unicode code points less than 2^7-1), UTF-8 takes the least space and is the fastest to parse.

application

Utf-8 character set commonly used by programmers in Java programming. However, the JVM uses the UTF-16 character set.

The UTF-8 character set is also used in HTTP.

JS implementation

/** * Encode character or code point to UTF32, UTF16, UTF8 * @param x {String} Use first character * {Number} Code Point * @author Liulinwj */ function encode(x) { let cp = typeof x === "string" ? x.codePointAt(0) : Math.floor(x); if (typeof cp ! == "number" || cp < 0 || cp > 0x10FFFF) { throw new TypeError("Invalid Code Point!" ); } let UTF32LE, UTF32BE, UTF16LE, UTF16BE, UTF8; if (cp > 0xFFFF) { UTF32LE = combine(0, cp << 8 >>> 24, cp << 16 >>> 24, cp & 0xFF); } else { UTF32LE = combine(0, 0, cp >>> 8, cp & 0xFF); } UTF32BE = convertBOM(UTF32LE); if (cp > 0xFFFF) { let c = cp - 0x10000; let sh = (c >>> 10) + 0xD800; let sl = (c & 0xFFF) + 0xDC00; UTF16LE = combine(sh >>> 8, sh & 0xFF, sl >>> 8, sl & 0xFF); } else { UTF16LE = combine(cp >>> 8, cp & 0xFF); } UTF16BE = convertBOM(UTF16LE); if (cp < 0x80) { UTF8 = combine(cp); } else if (cp < 0x800) { UTF8 = combine((cp >>> 6) | 0xC0, cp & 0x3F | 0x80); } else if (cp < 0x10000) { UTF8 = combine( (cp >>> 12) | 0xE0, ((cp & 0xFC0) >>> 6) | 0x80, cp & 0x3F | 0x80, ); } else { UTF8 = combine( (cp >>> 18) | 0xF0, ((cp & 0x3F000) >>> 12) | 0x80, ((cp & 0xFC0) >>> 6) | 0x80, cp & 0x3F | 0x80, ); } return { UTF32LE, UTF32BE, UTF16LE, UTF16BE, UTF8 }; function combine() { return [...arguments].map(function(n) { let hex = n.toString(16).toUpperCase(); return n < 0x10 ? ("0" + hex) : hex; }).join(" "); } function convertBOM(str) { return str.replace(/(\w\w) (\w\w)/g, "$2 $1"); }}Copy the code

What this code does:

Refer to the article

  • Unicode encoding and UTF-32, UTF-16, and UTF-8
  • Unicode
  • UTF-32
  • UTF-16
  • UTF-8