What is character encoding?

Computers can only process numbers. If they want to process text, they must first convert the text to numbers. The earliest computers were designed with eight bits as a byte, so the largest integer represented by a word is 255 (binary 11111111= decimal 255). To represent larger integers, more bytes must be used. For example, the maximum integer represented by two bytes is 65,535. The maximum integer represented by four bytes is 4294,967,295.

ASCII code:

Since computers were invented by Americans, only 127 letters — upper and lower case letters, numbers and symbols — were encoded into computers. This code table is called ASCII, for example, 65 for uppercase A and 122 for lowercase Z.

But to process Chinese obviously one byte is not enough, at least two bytes are needed, and can not conflict with ASCII code, so, China developed GB2312 code, used to encode Chinese.

There are hundreds of languages in the world. If Japan codifies Japanese to Shift_JIS and Korea to Euc-KR, each country has its own standard, which inevitably leads to conflicts. As a result, in a multilingual text, there will be garbled characters.

Unicode code:

Hence Unicode. Unicode consolidates all languages into one code, so that there is no garbled problem anymore. The Unicode standard is also evolving, but the most common is to use two bytes to represent a character (four bytes if you want to use very remote characters). Modern operating systems and most programming languages support Unicode directly.

The difference between ASCII encoding and Unicode encoding:

1) ASCII encodings are 1 byte, while Unicode encodings are usually 2 bytes, as shown in the following example.

The ASCII code for the letter A is 65 in decimal and 01000001 in binary;

The ASCII encoding for character 0 is 48 in decimal and 00110000 in binary. Note that the character ‘0’ is different from the integer 0;

Chinese characters are beyond the ASCII code and are encoded in Unicode as 20013 in decimal and 01001110 00101101 in binary.

If the ASCII encoding of A is encoded in Unicode, you only need to add 0 in front of it. Therefore, the Unicode encoding of A is 00000000 01000001.

Utf-8:

New problems arise: if the Unicode code is unified, the garble problem will disappear. However, if your text is written mostly in English, Unicode requires twice as much storage space as ASCII, making it uneconomical for storage and transfer.

Hence the UTF-8 encoding, which converts Unicode encoding into “variable-length encoding”. Utf-8 encodes a Unicode character into 1 to 6 bytes, depending on the numeric size. Common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only rare characters are encoded into 4 to 6 bytes. If the text you’re transmitting contains a lot of English characters, using UTF-8 can save space:

character ASCII Unicode UTF-8
    A 01000001 00000000, 01000001, 01000001
    中         – 01001110, 00101101, 11100100 10111000 10101101

As you can see from the table above, an added benefit of UTF-8 encoding is that ASCII encoding can actually be considered part of UTF-8 encoding, so a lot of legacy software that only supports ASCII can continue to work in UTF-8.

Unicode is a form of encoding, and ASCII is the same concept, while UTF-8, UTF-16, etc. is a form of encoding, saving space and improving performance in storage and transmission.

General character encoding for computer systems:

The Unicode encoding is used in computer memory, and is converted to UTF-8 encoding when it needs to be saved to hard disk or transferred.

When editing with Notepad, utF-8 characters read from the file are converted to Unicode characters in memory. After editing, Unicode is converted to UTF-8 and saved to the file:

 

When browsing a web page, the server converts dynamically generated Unicode content into UTF-8 and transmits it to the browser:

  

<meta charset=”UTF-8″ /> <meta charset=”UTF-8″ />