This article is a summary and record of my desire to understand the relationship between Unicode encoding and UTF-8, UTF-16 and UTF-32. The last section introduces the principle and simple implementation of Base64 encoding, mainly on the Internet to see too many articles on base64 are talking about encryption related, did not explain the nature of Base64, it is easy to mislead new people, because Unicode is also encoding, so incidentally record the nature of Base64

What is the unicode

Unicode, the official Chinese name for unicode, is the computer science standard that organizes and encodes most of the world’s writing.

Its goal: to provide a unique identifier for all characters in the world.

What problem does it solve?

One limitation of traditional character encoding schemes is that there are compatibility problems between different countries.

Unicode and UTF-8, UTF-16, UTF-32

UTF: Unicode Transformation Format

Utf-8, UTF-16, and UTF-32 are just different implementations of Unicode: how to store characters.

Code points: We know that the goal of Unicode is to provide a unique identifier for a character. This unique identifier is a number, known as a code point.

Unicode currently uses 17 planes, each of which has 2^16 (65536) code points that can represent 65536 characters; Seventeen planes are 0x10FFFF code points, and unicode currently uses numbers between 0 and 0x10FFFF (less than three bytes).

UTF-32

Utf-32 uses four bytes to store characters, and all Unicode can be represented directly. The problem is that the storage space utilization is too low.

Concrete implementation:

Code points coding
0x10ffff 00000000 00010000 11111111 11111111

Unicode maximum code point space utilization is less than 2/3, which wastes a lot of storage space and is a burden for network transmission.

UTF-16

Multilingual base Plane: In some places it is called Basic Multilingual Plane (BMP), which is the first Plane of Unicode with code points ranging from 0 to OXFFFF

Supplementary Planes: The code points that are not in the first plane belong to the expansion plane. The expansion plane has 16 planes in total

Concrete implementation:

Code points coding
0x0000 – 0xffff xxxxxxxx xxxxxxxx
0x010000 – 0x10ffff 110110xx xxxxxxxx 110111xx xxxxxxxx
  1. Code points for the base plane are encoded using two bytes

  2. The code point of the extension plane is encoded with four bytes; But then there is a problem. How does the computer know when parsing whether an extension character represents one character or two characters?

    Solution: Introduce surrogate pairs. Two code points are preserved in the basic plane and represent no characters, which are code pairs. A pair of two bytes, a code pair is four bytes, a high level agent (migrated to the agent) and a low level agent make up one agent pair

  1. The specific rules are as follows:

    1. Subtracting 0x010000 from the code point gives a 0x000000-0x0fffff (up to 20 digits) number
    2. For less than 20 bits, fill in 0 to the left, add 20 bits, then divide them into two parts (YYYYYYYYYYXXXXXXXXXX), the highest (front) 10 bits +D800 (110110YYYYYYYYYYYY) to get the first code or proxy pair, The lowest (last) 10 bits +DC00 (110111XXXXXXxxxx) get the second symbol or proxy pair.
    3. This results in two code elements or proxy pairs, which are parsed according to opposite rules. The proxy pairs must be in pairs. If they are not parsed in pairs, there is a problem with the encoding and the parsing fails.

UTF-8

Utf-8, like UTF-16, is a variable-length character encoding. Utf-16 can be 2 or 4 bytes long, while UTF-8 can represent any character in the Unicode character set as follows:

Code points coding
0x0000 – 0x007f 0XXXXXXX (7 digits maximum: 7F)
0x0080 – 0x07ff 110XXXXX 10XXXXXX (11 digits maximum: 7FF)
0x00800 – 0x00ffff 1110XXXX 10XXXXXX 10XXXXXX (16-bit maximum: FFFF)
0x010000 – 0x10ffff 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX (20 digits maximum: FFFFF)

Utf-8 is compatible with ASCII encoding. The first 128 code points of UTF-8 are encoded with binary values identical to those of ASCII, and their literals are identical to those of ASCII.

The specific coding rules are as follows:

Utf-8 uses a prefix to mark the number of bytes currently encoded characters use

  1. If the current byte starts with 0, the current byte represents a single character
  2. If the first byte is 1 and the second byte is 0, the current byte is one byte in a multi-byte character
  3. If the first byte of the current byte is 1, the second byte is 1, and the third byte is 0, the current byte is the first byte of a character representing two bytes
  4. If the first byte of the current byte is 1, the second byte is 1, the third byte is 1, and the fourth byte is 0, the current byte represents the first byte of a three-byte character
  5. If the first byte of the current byte is 1, the second byte is 1, the third byte is 1, the fourth byte is 1, and the fifth byte is 0, the current byte represents the first byte of a four-byte character

The first byte of a character counts as part 1, and the remaining bytes count as part 2. The first part indicates that the current character is encoded with several bytes, starting with 0 means that it is a single byte character, in other cases, it is a number of 1s, if there are several 1s, the current character is encoded with several bytes; The second part is more fixed, starting with 10, followed by unfinished bits.

Byte order

Byte Order Mark: On almost all machines, multi-byte objects (Unicode characters) are stored as sequential sequences of bytes. Byte Order refers to the Order in which the bytes are stored for multi-byte data. A multi-byte object is called little-endian when its lower order is placed at a smaller address and its higher order at a larger address. A multi-byte object with its lower order at the larger address and its higher order at the smaller address is called a big-endian.

Example: If there is a variable X, located at address 0x100, its value is 0x01234567, its address range is 0x100-0x103, if stored in small endian order: 0x100:0x67, 0x101:0x45, 0x102:0x23, 0x103: 0x01:0x100:0x01, 0x101:0x23, 0x102:0x45, 0x103:0x67 in big-endian storage.

Take a closer look at the above example to see if the storage mode of big endian seems to be more suitable for hardware storage mode. Here you can go to see ruan’s article on byte order: Understanding byte order; There is a brief summary of byte order.

encoding BTM
UTF-16 LE FF FE
UTF-16 BE FE FF
UTF-32 LE FF FE 00 00
UTF-32 BE 00 00 FE FF

Association with JS

  1. CharCodeAt: Returns the code point of the character at the given index, ranging from 0 to 65535 (16 bits)
  2. String. FromCharCode: Converts code points (0-65535) to corresponding characters, which is a pair of charCodeAt methods
  3. CodePointAt: ES6 is a new method to support money UTF-16 that extracts Unicode code points for a given string based on an index
  4. String.fromCodePoint: the character to which the codepoint is transposed, paired with the codePointAt method
  5. Regular expression U flag: Regular matching in JS assumes that a single character is represented by a 16-bit code element, and strings greater than 65535 are recognized as two characters, using the U flag to treat the string as a Unicode character

The following is an example:

Const STR = '𠮷'; // \uD842 \uDFB7 str.length; // 2 str.codePointAt(0); // 134071 str.codePointAt(1); // 57271 str.charCodeAt(0); // 55362 str.charCodeAt(1); // 57271 const reg = /^.$/; reg.test(str); // false const regU = /^.$/u; regU.test(str); // trueCopy the code

Additional topic: Base64 encoding

First declaration: Base64 is only a coding method, not a encryption method, but can be used to do simple data desensitization processing (many online articles are talking about using Base64 encryption, here pointing out that base64 is not encryption ability, can only do simple data desensitization; Encryption is that as long as the key is not leaked, what you encrypt is safe, but Base64 can’t do that at all; Anyone with some technical knowledge can tell it’s Base64 encoding.)

Base64 encoding: Select 64 (2 ^ 6) characters (A-z, A-z, 0-9, \, +) as the Base64 character set, add A padding character =, actually 65 characters, and use these characters to represent binary data

The coding rules are as follows:

  1. Then divide the 24 bits into four groups of six bits each (2 ^ 6)
  2. Add before each group00Extended to 32 bits for a total of four bytes
  3. Convert the corresponding code points into corresponding characters according to the above table
  4. If the entire length of binary data is a multiple of three, we can loop through the first three steps, but mod three will be 0, 1, 2,0 is just a multiple of three, which we dealt with above
  5. The last two characters are divided into three groups (6, 6, 4) according to the rules above, and the last group is added00, also added later00And then fill another one=Make up four characters
  6. The last character is divided into two groups (6, 2) according to the rules above, and the last group is added00, followed by0000And then fill another one= =Make up four characters

The following is an example:

ASCII code point of M: 109-01101101

ASCII code point of A: 97-01100001

The ASCII code point of n is 110-01101110

  • The base64 encoding of the string man is as follows:

    1. The ASCII code point of M is 109, the ASCII code point of A is 97, the ASCII code point of N is 110
    2. Three characters just a group, a total of 24 bits: 011011010110000101101110, divided into four groups: 011011-010110-000101-101110
    3. 011011:27-B, 010110:22-W, 000101:5-F, 101110:46-U, the result is: bWFu
  • When the string is ma, the base64 encoding process is as follows:

    1. The first step is basically the same, only two characters; M ASCII code point: 109, A ASCII code point: 97
    2. 011011-010110-0001 has only 16 bits, it can only be divided into three groups: 011011-010110-0001, because the last group has only 4 bits, after filling 2 zeros into 6 bits: 011011-010110-000100
    3. 011011:27, 010110:22, 000100:4, the result has only 3 characters, and the last one is filled=, becomes four characters, resulting in: bWE=
  • When the string is only M, the base64 encoding process is as follows:

    1. The first step is similar to the above, with only one character m: 109
    2. A character has only 8 bits and can be divided into two groups: 011011-01. The last group has only 2 bits, followed by 4 zeros to become 6 bits: 011011-010000
    3. 011011:27, 01010:16, only get 2 characters, fill the last 2 characters=, programming 4 characters, the result is: bQ==

Js Base64 native supports: ATOB, BINARY to ASCII (BTOA). Note the following when using this method:

  1. Btoa: Because this function treats each character as a byte (255), regardless of the actual number of bytes that make up the character, if the code point of any character exceeds0x00 ~ 0xFFIn this range, it causesInvalidCharacterErrorThe exception.
  2. Atob: Since ATOB and BTOA are compatible methods, each character is a byte by default when decoding. If the data encoded in binary stream is decoded using this method, the result will not be the original content (unless all data in binary stream is only one byte).

Base64 coding simple implementation:

const b64ch = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=';
const b64chs = Array.prototype.slice.call(b64ch);
function base64Encode (bin) {
    let u32, c0, c1, c2, asc = '';
    const pad = bin.length % 3;
    for (let i = 0; i < bin.length;) {
        if ((c0 = bin.charCodeAt(i++)) > 255 ||
            (c1 = bin.charCodeAt(i++)) > 255 ||
            (c2 = bin.charCodeAt(i++)) > 255)
            throw new TypeError('The string to be encoded contains characters outside of the Latin1 range');
        u32 = (c0 << 16) | (c1 << 8) | c2;
        asc += b64chs[u32 >> 18 & 63]
            + b64chs[u32 >> 12 & 63]
            + b64chs[u32 >> 6 & 63]
            + b64chs[u32 & 63];
    }
    return pad ? asc.slice(0, pad - 3) + "===".substring(pad) : asc;
};
Copy the code

Base64 decoding simple implementation:

const b64ch = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/='; const b64tab = ((a) => { let tab = {}; a.forEach((c, i) => tab[c] = i); return tab; })(b64chs); const b64re = /^(? :[A-Za-z\d+/]{4})*? (? :[A-Za-z\d+/]{2}(? : = =)? |[A-Za-z\d+/]{3}=?) ? $/; const _fromCC = String.fromCharCode.bind(String); function base64Decode (asc) { asc = asc.replace(/\s+/g, ''); if (! b64re.test(asc)) throw new TypeError('malformed base64.'); asc += '=='.slice(2 - (asc.length & 3)); let u24, bin = '', r1, r2; for (let i = 0; i < asc.length;) { u24 = b64tab[asc.charAt(i++)] << 18 | b64tab[asc.charAt(i++)] << 12 | (r1 = b64tab[asc.charAt(i++)]) << 6 | (r2 = b64tab[asc.charAt(i++)]); bin += r1 === 64 ? _fromCC(u24 >> 16 & 255) : r2 === 64 ? _fromCC(u24 >> 16 & 255, u24 >> 8 & 255) : _fromCC(u24 >> 16 & 255, u24 >> 8 & 255, u24 & 255); } return bin; };Copy the code

The above base64 encoding and decoding implementation is extracted from the Base64 ployfill implementation of JS-Base64

Looking at Unicode encodings and Base64 encodings, you can roughly see that encodings seem to be binary oriented.

Refer to the article

  • Unicode encoding and UTF-32, UTF-16, and UTF-8
  • Understand ECMAScript 6
  • Ruan Yifeng Base64 notes
  • Js base64 implementation