Regular study notes, including ES6, Promise, Node.js, Webpack, HTTP Principles, Vue buckets, and possibly more Typescript, Vue3, and common interview questions.


Development of coding

We now commonly use ASCII encoding, Chinese characters are using UTF-8 format.

Codes are generally classified as follows

  • ASCII
  • GB2312
  • GBK
  • GB18030 (DBCS)
  • Unicode
  • UTF-8 / UTF-16
  • .

ASCII

The eight bits of ASCII code can take 256 different forms.

0-32 specifies the special purpose of the terminal and printer to do some convention once the agreed bytes are sent. (If 0×10 is encountered, the terminal is newline. 0×07, the terminal will emit a beep beep, etc.)

All Spaces, punctuation marks, digits, upper and lower case letters are represented by successive byte states, up to number 127. This allows the computer to store English words in different bytes.

Because computers were first used in the United States, they only used the first 127. But there are so many languages in the world, Chinese alone has tens of thousands of Chinese characters, this encoding format is obviously not very good.

GB2312

Some western European countries do not use English, so they use the space after 127 to hold the new letter until the last digit, 255. (The symbol varies from country to country: 130 stands for E in French, but Gimel in Hebrew (ג))

In Korea, the symbols after 127 have been removed to represent Chinese characters.

A character less than 127 has the same meaning as the original. But when two characters larger than 127 are joined together, they represent a Chinese character. The first byte (high byte) is from 0xA1 to 0xF7, and the second byte (low byte) is from 0xA1 to 0xFE. In this way, about (247-161)*(254-161) = 7998 simplified Chinese characters can be combined.

Mathematical symbols, Japanese kana, and ASCII numbers, punctuation marks, and letters were rewritten into two-character codes. These are full corner characters, and those below 127 are half corner characters.

This character scheme is called GB2312. GB2312 is a Chinese extension of ASCII.

GBK

In the process of use, IT is found that GB2312 is not enough. Therefore, it is no longer required that the lowest byte must be the code after 127, but if the first byte is greater than 127, it always indicates the beginning of a Chinese character.

Nearly 20,000 new Chinese characters (including traditional ones) and symbols were added.

GB18030 / DBCS

GBK expanded to GB18030, which they called DBCS, adding thousands of new minority words.

In the DBCS family of standards, the biggest feature is that two-byte Chinese characters and one-byte English characters coexist in the same encoding scheme.

In this way, each country came up with its own coding standards. As a result, many different encodings exist at the same time, which are not supported and do not communicate with each other.

Unicode

Unicode encoding should be familiar to developers.

ISO: International Organization for Standardization

Unicode: Universal multiple-ocTET Coded Character Set (UCS), commonly known as Unicode

The international organization of ISO scrapped all regional coding schemes and started over with a code that included every culture, every letter and word on the planet.

Unicode is a large set, now larger than a million symbols.

The ISO organization directly states that all characters must be represented by two bytes (16 bits). Unicode keeps the ASCII half-corner characters in their original encoding, but expands their length from 8 to 16 bits, while characters from other cultures and languages are recoded entirely.

From Unicode onwards, whether it is a half-angle English letter or a full-angle Chinese character, they are unified as one character. At the same time, they are the same two bytes.

Unicode was not widely available for a long time, until the advent of the Internet.

UTF

To solve the problem of how Unicode is transported over the network, a number of transport-oriented UTF standards have emerged.

UTF encoding: Universal Character Set (UCS) Transfer Format

Utf-8 is the most widely used implementation of Unicode on the Internet.

Utf-8 transfers data in 8-bit increments, while UTF-16 transfers data in 16-bit increments.

Utf-8 is a variable length encoding. Unicode is 2 bytes per Chinese character, while UTF-8 is 3 bytes per Chinese character.

In other words, UTF is just another implementation of Unicode.

Base64 encoding specification

Base64 is one of the most common encoding methods for transmitting 8-bit bytecode on the network. Paths can be replaced during development and can be used for transport.

Base64 encoding is a process from binary to character that can be used to pass longer identity information in HTTP environments. Base64 encoding is unreadable and can be read only after decoding.

Base64 requires that every three 8-bit bytes be converted into four 6-bit bytes, and then the 6-bit bits are added with two high-order zeros to make up four 8-bit bytes.

We usually use the encodeURIComponent() method to convert text to hexadecimal.

Transformation of thinking

As we all know, a word is 3 bytes and a byte is 8 bits. Base64 converts a Chinese character to a format of 3 * 8 = 4 * 6, which means that a Chinese character is now four bytes long and the result is one-third larger than before.

Let’s first output a random Chinese character and convert it to hexadecimal.

// buffer. from can convert Chinese characters to hexadecimal
let r = Buffer.from('don't')
console.log(r); // <Buffer e8 8e ab>
Copy the code

As we can see, the result is 3 bytes of hexadecimal, converted to base 2 one by one.

console.log((0xe8).toString(2)); / / 11101000
console.log((0x8e).toString(2)); / / 10001110
console.log((0xab).toString(2)); / / 10101011
Copy the code

The resulting base 2 is then concatenated and converted to 4 bytes, and 0 is added in front of it to make it an 8-bit binary. Then we get the following result.

/ / 11101000 10001110 10101011
// 00111010 00001000 00111010 00101011
Copy the code

This result is at most 001111, and then we convert the result to decimal.

console.log(parseInt('00111010'.2)); / / 58
console.log(parseInt('00001000'.2)); / / 8
console.log(parseInt('00111010'.2)); / / 58
console.log(parseInt('00101011'.2)); / / 43
Copy the code

So we’re going to get a result where each digit is no greater than 64.

According to the corresponding table written above, we can calculate that the Base64 encoding corresponding to this word is 6I6r, which is the Mo word written above.

Putting this idea together, we get the following code.

const CHARTS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
function transfer(str){
  let buf = Buffer.from(str);
  let result = ' ';
  for(let b of buf){
      result += b.toString(2);
  }
  return result.match(/(\d{6})/g).map(v= >parseInt(v,2)).map(v= >CHARTS[v]).join(' ');
}
let r = transfer('don't');
console.log(r); // 6I6r
Copy the code

This completes our Base64 conversion.

This article was created by Mo Xiaoshang. If there are any problems or omissions in the article, welcome your correction and communication. You can also follow my personal site, blog park and nuggets, AND I will upload the article to these platforms after it is produced. Thank you for your support!