1. Character development history

1.1 bytes

  • Inside a computer, all information is ultimately a binary value
  • Each bit has two states, zero and one, so eight bits can make 256 states, which is called a byte.

1.2 units

  • 8 bits = 1 byte
  • 1024 bytes = 1K
  • 1024K = 1M
  • 1024M = 1G
  • 1024G = 1T

1.3 Base in JavaScript

1.3.1 Base representation

leta = 0b10100; / / binaryletb = 0o24; / / octalletc = 20; / / decimalletd = 0x14; // hexadecimal console.log(a == b); console.log(b == c); console.log(c == d);Copy the code

1.3.2 Base conversion

  • Base 10 to any base 10 number. ToString (target base)
    console.log(c.toString(2));
    Copy the code
  • Arbitrary base to decimal parseInt(‘ arbitrary string ‘, primitive base);
    console.log(parseInt('10100', 2));
    Copy the code

1.4 the ASCII

At first computers were only used in the United States, and an eight-bit byte could combine 256 different states. The 0-32 states specify the special purpose. Once the terminal or printer encounters the agreed bytes, it must do some agreed actions, such as:

  • When 0×10 is encountered, the terminal is newline;
  • When it hits 0×07, the terminal beeps at people;

And all the Spaces, punctuation marks, numbers, upper and lower case letters are represented by successive byte states, until no. 127, so that the computer can use different bytes to store English words

The 128 symbols (including 32 non-printable control symbols) occupy only the last seven bits of a byte, with the first one uniformly specified as zero

This scheme is called ASCII encoding

American Standard Code for Information Interchange: American Standard Code for Information Interchange

1.5 GB2312

Later, some countries in Western Europe did not use English, and their letters were not in ASCII. In order to save their text, they used the space after 127 to save the new letter, until the last digit 255. For example, the French e code is 130. Of course, the symbol varies from country to country. For example, 130 stands for E in French, but Gimel (ג) in Hebrew.

The set of characters from 128 to 255 pages is called the extended character set.

In Order to represent Chinese characters, The Chinese government abolished the symbols after 127

  • A character less than 127 has the same meaning as the original, but two characters larger than 127 are linked together to represent a Chinese character.
  • The preceding byte (which he calls the high byte) from0xA1use0xF7, the following byte (low byte) from0xA10xFE;
  • So we can compose about 7,000 (247-161)*(254-161)=(7998) simplified Chinese characters.
  • Mathematical symbols, Japanese kana, and ASCII numbers, punctuation marks, and letters were rewritten into two-character codes. These are full corner characters, and those below 127 are half corner characters.
  • Call this character scheme GB2312. GB2312 is a Chinese extension of ASCII

1.6 GBK

Later, it was not enough, so it simply dropped the requirement that the lower byte must be after 127, and the first byte greater than 127 was fixed as the beginning of a Chinese character. Nearly 20,000 new Chinese characters (including traditional Characters) and symbols were added.

1.7 GB18030 / DBCS

With thousands of new minority words added, GBK became GB18030 and they were called DBCS

Double Byte Character Set: a Double Byte Character Set.

In the DBCS family of standards, the biggest feature is that two-byte Chinese characters and one-byte English characters coexist in the same encoding scheme

Each country, like China, developed its own coding standards, resulting in no one understanding each other’s coding and no one supporting the others’ coding

1.8 the Unicode

The international organization of ISO scrapped all regional coding schemes and started over with a code that included every culture, every letter and symbol on the planet! Unicode is, of course, a large set, now larger than a million symbols.

  • International Organization for Standardization: International Organization for Standardization.
  • Universal multiple-OCTET Coded Character Set, abbreviated as UCS, commonly known as Unicode

ISO directly states that all characters must be represented in two bytes, or 16 bits. For ASCII half-corner characters, Unicode keeps its original encoding unchanged, but expands its length from 8 bits to 16 bits, while characters from other cultures and languages are recoded entirely.

From Unicode onwards, whether it is a half-angle English letter, or a full-angle Chinese character, they are the same character! At the same time, they are the same two bytes

  • A byte is an 8-bit physical storage unit,
  • A character is a culturally relevant symbol.

1.9 UTF-8

Unicode was not widely available for a long time until the advent of the Internet, where transport-oriented UTF standards were created to solve the problem of how Unicode was transported over the Internet.

Universal Character Set (UCS) Transfer Format: UTF encoding

  • Utf-8 is the most widely used implementation of Unicode on the Internet
  • Utf-8 transmits data in units of eight bits at a time
  • Utf-16 is 16 bits at a time
  • One of the biggest features of UTF-8 is that it is a variable length encoding method
  • Unicode is 2 bytes per Chinese character, while UTF-8 is 3 bytes per Chinese character
  • Utf-8 is an implementation of Unicode

1.10 Coding Rules

  1. For single-byte symbols, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.
  2. For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n+ 1 bits are set to 0, and the first two bits of the following bytes are all set to 10. The remaining bits, not mentioned, are all Unicode codes for this symbol.
Unicode symbol scope | utf-8 encoding (hexadecimal) | (binary) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxCopy the code
  • Unicode
function transfer(num) {
  let ary = ['1110'.'10'.'10'];
  let binary = num.toString(2);
  ary[2] = ary[2]+binary.slice(binary.length-6);
  ary[1] = ary[1]+binary.slice(binary.length-12,binary.length-6);
  ary[0] = ary[0]+binary.slice(0,binary.length-12).padStart(4,'0');
  let result =  ary.join(' ');
  returnparseInt(result,2).toString(16); } / /letresult = transfer(0x4E07); //E4B887Copy the code

1.11 Unicom is worse than mobile

C1 1100 0001
AA 1010 1010
CD 1100 1101
A8 1010 1000

0000000001101010->006A(106)->j
0000001101101000->0368(872)->?
Copy the code
  • GB2312
  • Unicode encoding table 1
  • Unicode encoding table 2

1.12 Text encoding

When using NodeJS to write front-end tools, most of the operations are text files, so file encoding is involved. Commonly used text encodings are UTF8 and GBK, and UTF8 files may also have BOM. When reading text files with different encoding, the file content needs to be converted to UTF8 encoding string used by JS before normal processing.

1.12.1 Removing BOM

BOM is used to mark a text file as Unicode encoded, and is itself a Unicode character (“\uFEFF”) at the head of the text file. Under different Unicode encodings, BOM characters correspond to the following binary bytes:

 Bytes      Encoding
----------------------------
 FE FF       UTF16BE
 FF FE       UTF16LE
 EF BB BF    UTF8
Copy the code

Therefore, we can determine whether a text file contains a BOM and which Unicode encoding to use based on what the first few bytes of a text file are equal to. However, the BOM character, while serving as a marker for the encoding of the file, is not itself part of the content of the file, and it can be problematic in some usage scenarios if the BOM is not removed when reading a text file. For example, if several JS files are combined into one file, the BOM character in the middle of the file will cause the browser JS syntax error. Therefore, when using NodeJS to read text files, the BOM is usually removed

function readText(pathname) {
    var bin = fs.readFileSync(pathname);
    if (bin[0] === 0xEF && bin[1] === 0xBB && bin[2] === 0xBF) {
        bin = bin.slice(3);
    }
    return bin.toString('utf-8');
}
Copy the code

1.12.2 UTF8 GBK

NodeJS supports specifying text encoding when reading text files or when Buffer is converted to a string, but unfortunately GBK encoding is not supported by NodeJS itself. Therefore, we usually use iconV-Lite, a three-party package, to convert the encoding. After downloading the package using NPM, we can write a function that reads the GBK text file as follows.

var iconv = require('iconv-lite');
function readGBKText(pathname) {
    var bin = fs.readFileSync(pathname);
    return iconv.decode(bin, 'gbk');
}Copy the code