A character encoding

Because computers can only process numbers, if they want to process text, they must first convert the text to numbers. The earliest computers were designed with eight bits as a byte, so the largest integer represented by a word is 255 (binary 11111111= decimal 255). To represent larger integers, more bytes must be used. For example, the maximum integer represented by two bytes is 65,535. The maximum integer represented by four bytes is 4294,967,295.

Since the computer was invented by the Americans, only 127 characters were encoded into the computer, that is, upper and lower case letters, numbers and symbols. This code table is called ASCII code, for example, uppercase A is 65, and lowercase Z is 122.

But to process Chinese obviously one byte is not enough, at least two bytes are needed, and can not conflict with ASCII code, so, China developed GB2312 code, used to encode Chinese.

As you can imagine, there are hundreds of languages around the world, and with Japan coding Japanese to Shift_JIS and Korea coding Korean to Euc-KR, each country has its own standards and inevitably conflicts, resulting in garbled text in a multilingual mix

Hence Unicode. Unicode consolidates all languages into one code, so that there is no garbled problem anymore.

The Unicode standard is also evolving, but the most common is to use two bytes to represent a character (four bytes if you want to use very remote characters). Modern operating systems and most programming languages support Unicode directly.

Now, to clarify the differences between ASCII and Unicode encodings: ASCII encodings are 1 byte, while Unicode encodings are usually 2 bytes.

The ASCII code for the letter A is 65 in decimal and 01000001 in binary;

The ASCII encoding for character 0 is 48 in decimal and 00110000 in binary. Note that the character ‘0’ is different from the integer 0;

Chinese characters are beyond the ASCII code and are encoded in Unicode as 20013 in decimal and 01001110 00101101 in binary.

As you can guess, if you encode ASCII A in Unicode, you only need to prefix it with zeros, so the Unicode encoding for A is 00000000 01000001.

A new problem arises: if Unicode is adopted, the garble problem disappears. However, if your text is written mostly in English, Unicode requires twice as much storage space as ASCII, making it uneconomical for storage and transfer.

So, in the spirit of frugality, utF-8 encodings turn Unicode encodings into “variable-length encodings.” Utf-8 encodes a Unicode character into 1 to 6 bytes, depending on the numeric size. Common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only rare characters are encoded into 4 to 6 bytes. If the text you’re transmitting contains a lot of English characters, using UTF-8 can save space:

character	ASCII	Unicode	UTF-8
A	01000001	00000000, 01000001,	01000001
In the	x	01001110, 00101101,	11100100 10111000 10101101

As you can also see from the table above, the UTF-8 encoding has the added benefit that ASCII encoding can actually be considered part of UTF-8 encoding, so a lot of legacy software that only supports ASCII can continue to work with UTF-8 encoding.

Knowing the relationship between ASCII, Unicode, and UTF-8, we can summarize the way character encodings commonly used in today’s computer systems work:

The Unicode encoding is used in computer memory, and is converted to UTF-8 encoding when it needs to be saved to hard disk or transferred.

When editing with Notepad, utF-8 characters read from the file are converted to Unicode characters in memory. After editing, Unicode is converted to UTF-8 and saved to the file

Related Posts

Audio and video solutions in video production environment

Why do people say they are “data middle stage” entrepreneurs, rather than “business middle stage”? By wen-hua peng

HBase System Architecture