Start with a story about character encoding

Unicom is not as good as mobile story

In the coding community has been spreading unicom as mobile a story…

Please don’t get me wrong, Unicom and Mobile really have nothing to do with the coding mentioned in this article, but please let unicom and Mobile help do a little experiment, and then talk about coding in detail.

Under Windows system, right-click on the desktop to create a notepad file, open it and input “Unicom” two Chinese characters, Ctrl+S save and close.

Double-click to open it again and what do you see? Strange, how did the text become garbled?

Ok, create a new file again, this time type “move” save and try again. Amazing, the movement shows up perfectly.

Well, without more stories, this interesting phenomenon is just for talking about “coding” in computers, and then explaining why “mobile is better than connected”.

Talk about the history of character coding

In a computer, all stored data is represented in binary. Letters, Numbers, characters, these are no exception, the computer is the smallest unit in bits (0 and 1), and 8 bits of a byte, so 8 bits can permutation and combination of 256 kinds of state, also can show 256 kind of characters is theory, by which characters which binary said, this is determined by people’s, That is, the kinds of codes that people make up.

Computers were first invented by foreigners. English used by foreigners has only 26 letters, not much punctuation, numbers and symbols, so it is usually represented by the ASCII code.

ASCII

ASCII codes were originally used only in the United States. Among the 256 states combined, special uses are specified in 0~32. Once the terminal or printer encounters the agreed bytes, it has to do some agreed actions, such as wrapping a line when it encounters 0×10.

And all the Spaces, punctuation marks, numbers, upper and lower case letters are represented by successive byte states, until no. 127, so that the computer can use different bytes to store English words.

I remember when I learned C language, I clearly knew some commonly used ASCII code values, such as uppercase A is 65, lowercase A is 97, etc.

English will do, but there are many other languages in the world besides English. Our Chinese characters are so vast that only these 8 bits are far from enough. What should we do?

GB2312

Apart from Chinese, some European languages also have some special letters, such as Russian Greek and so on. So they used the space after 127 to continue to represent their letters. Of course, it gets confusing because each country has a different language. 130, for example, is the letter E in French, but in Hebrew 130 is their letter ג.

Our Chinese is more difficult to handle, even if all the bits are used, it means that thousands of Chinese characters are not completed, so we have developed a set of Chinese code GB2312.

In order to represent Chinese characters, The Chinese government abolished the symbols after 127 and stipulated:

A character less than 127 has the same meaning as the original, but two characters larger than 127 are linked together to represent a Chinese character.
The first byte (which he calls the high byte) is used from 0xA1 to 0xF7, and the next byte (the low byte) is used from 0xA1 to 0xFE;
So we can compose about 7,000 (247-161)*(254-161)=(7998) simplified Chinese characters.
Mathematical symbols, Japanese kana, and ASCII numbers, punctuation marks, and letters were rewritten into two-character codes. These are full corner characters, and those below 127 are half corner characters. Call this character scheme GB2312. GB2312 is a Chinese extension of ASCII.

GBK

Later, it was found that GB2312 solved the problem of Chinese coding, but there are still deficiencies.

There was an incident at that time. When I checked my score in the college entrance examination registration system, MY name could not be given, but my family name could only be given. It was because my name “Yue” was not in the code range of GB2312, so it was not included.

As a result, it was no longer required that the lowest byte must be inside the code after 127, and as long as the first byte was greater than 127, it was fixed that this was the beginning of a Chinese character. Nearly 20,000 new Chinese characters (including traditional characters) and symbols were added.

This is the more comprehensive GBK coding.

Unicode

As development developed, each country created its own code for its own language, and it was a mess, we didn’t know what other people were using, and they didn’t know what we were using, so standards bodies stepped in.

The ISO standards organization saw the mess and came up with a Unicode code to deal with it. It was so simple, it wasn’t the world’s language, that I simply decreed that all characters should be represented by two bytes (two 8 bits for 16 bits). For half corner characters in ASCII, Unicode keeps its original encoding unchanged, but expands its length from 8 to 16 bits, while characters from other cultures and languages are recoded entirely.

From Unicode onwards, whether it is a half-angle English letter or a full-angle Chinese character, they are unified as one character. At the same time, they are the same two bytes.

UTF8

Unicode was created in 1990 and officially used in 1994, a time that is almost ancient now, and was not widely used because of the underdeveloped Internet.

With the development of the Internet, in order to solve the problem of Unicode transmission, many UTF standards have emerged.

Utf-8 is the most widely used implementation of Unicode on the Internet
Utf-8 transmits data in units of eight bits at a time
Utf-16 is 16 bits at a time
One of the biggest features of UTF-8 is that it is a variable length encoding method
Unicode is 2 bytes per Chinese character, while UTF-8 is 3 bytes per Chinese character
Utf-8 is an implementation of Unicode

Because UTF8 is one of Unicode’s implementations, they are interoperable, meaning that Unicode encodings can be swapped to UTF8, which has a set of rules:

Unicode symbol range (hexadecimal)	UTF8 encoding (base 2)
0000 0000-0000 007F	0xxxxxxx
0000 0080-0000 07FF	110xxxxx 10xxxxxx
0000 0800-0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

As you can see, for a single-byte symbol, the first byte is set to 0 and the next seven bits are the Unicode code for the symbol. So utF-8 encoding and ASCII are the same for English letters (see the first line of the table above).

For n-byte symbols (n>1), the first n bits of the first byte are set to 1, the n+1 bits are set to 0, and the first two bits of the following bytes are all set to 10. The remaining bits, not mentioned, are all Unicode codes for this symbol.

It’s a little abstract, for example, how does a computer know when a Chinese character comes in and it’s encoded in UTF8?

Since Chinese characters are written in three bytes (don’t ask why they are written in three bytes, that’s the rule), the first three bits of the first byte are all 1s, the fourth bit is 0, and the following bits start with 10, so it must look something like this: 1110XXXX 10XXxxxx 10XXxxxx.

OK, the computer according to this rule a see understand, come is a Chinese character!

For another example, find a character encoding from the Unicode encoding table and convert it to UTF8 to try. Use my name “Yue” character and its Unicode encoding is \ U73A5

So the first step is to convert hexadecimal to base 2, and it has the value 111001110100101, so how do you split the base 2 value? Since UTF8 is the Unicode code for the last six bits of the character, we count the six bits from right to left and fill in the missing bits with 0.

11100111 10001110 10100101

As a developer, you can use code to achieve this. Here is a real transcoding implementation using Node.js:

function transferToUTF8(unicode) {
  code = [1110.10.10];

  let binary = unicode.toString(2); // Convert to binary

  code[2] = code[2] + binary.slice(- 6); // Extract the last 6 digits
  code[1] = code[1] + binary.slice(- 12.- 6); // Extract the middle 6 bits
  code[0] = code[0] + binary.slice(0, binary.length - 12).padStart(4.'0'); // Take the remaining bits at the beginning, not enough to complement 0

  code = code.map(item= > parseInt(item, 2)); // Convert a string to a binary value

  return Buffer.from(code).toString(); // Use Buffer to rotate into Chinese characters
}

console.log(transferToUTF8(0x73a5));
Copy the code

Running results:

heCopy the code

The above code defines a Transfer function. The parameter receives a hexadecimal value, which represents a Unicode character. The transfer function is internally converted to binary and utF-8 encoding according to UTF-8 rules, and finally converted to Chinese characters using Node.js Buffer. As you can see, the Chinese character “Yue” has been printed correctly.

The above is a simple analysis of Unicode and UTF-8 conversion relationship.

Why is Unicom inferior to Mobile?

The story is about to end, said so many coding things can now look back at the beginning of why Unicom became garble, because in Windows Notepad Chinese default save code is GB2312, through the query can be found that the Chinese character “united” corresponding GB2312 code is \ UC1AA, Binary 1100000110101010, exactly 16 bits two bytes, split by 8 bits exactly UTF8 second encoding corresponds to: 110xxxxx 10XXxxxx, so that when you open Notepad again Windows scans the contents of the file and it will think it is a UTF-8 encoded file, not GB2312! Parsing the contents of the file according to UTF-8 is of course garbled at this point.

At this time you can re save as a file, the file format is changed to GB2312 to save, now open the “Unicom” finally showed.

This example is very extreme, it can be said that the coding of “Unicom” is just a coincidence, but understanding the details of the coding is more helpful for us to quickly understand the essence of the problems encountered in the development and solve them. Here we take notes to learn and improve together with everyone.

Start with a story about character encoding

Unicom is not as good as mobile story

Talk about the history of character coding

ASCII

GB2312

GBK

Unicode

UTF8

Why is Unicom inferior to Mobile?

Related Posts

Record once to achieve Audio (Audio) automatic playback, video the same

This article explains how to convert js types in detail

01. Know Electron