Unicode

Unicode is an industry standard in computer science, including character set, encoding scheme, etc. Unicode was created to overcome the limitations of traditional character encoding schemes. It provides a unified and unique binary encoding for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. It was developed in 1990 and officially announced in 1994.

The origin of

Because computers can only process numbers, if they want to process text, they must first convert the text to numbers. The earliest computers were designed with eight bits per byte, so the largest integer represented by a word is 255 (binary 11111111= decimal 255). 0-255 is used to represent upper and lower case letters, numbers, and symbols. This code table is known as the ASCII code. For example, the code for capital A is 65, and the code for lowercase Z is 122.

If you want to express Chinese, obviously one byte is not enough, at least two bytes, and can not conflict with ASCII code, so, China developed GB2312 code, used to encode Chinese.

Similarly, other languages such as Japanese and Korean have this problem. Unicode was born to unify the encoding of all characters. Unicode consolidates all languages into one code, so that there is no garbled problem anymore.

Unicode usually uses two bytes to represent a character, and the original English encoding is changed from a single byte to a double byte by filling all the high bytes with zeros.

Because Python was born before the Unicode standard was published, the original Python only supported ASCII encoding, and the ordinary string ‘ABC’ was ASCII encoded inside Python.

Unicode was developed to overcome the limitations of traditional character encoding schemes. For example, the characters defined by ISO 8859 are widely used in different countries, but they are often incompatible with each other. A common problem with many traditional coding methods is that they allow computers to handle bilingual environments (usually using Latin letters and their native languages) but not multilingual environments (that is, mixing languages at the same time).

The Unicode encoding contains characters written differently, such as “ɑ/a” and “household/household /戸”. However, in Chinese characters, it has caused a dispute over the identification of polyform.

In word processing, unicode defines a unique code (that is, an integer) for each character rather than a glyph. In other words, unicode works with characters in an abstract way (that is, numerically) and leaves the visual interpretation (such as font size, shape, font form, style, etc.) to other software, such as web browsers or word processors.

The same binary number can be interpreted as different symbols. Therefore, in order to open a text file, it is necessary to know how it is encoded, otherwise it will be garbled if it is interpreted in the wrong way. Why do e-mails often have garbled characters? Because the sender and the receiver use different coding methods.

You can imagine if there was a code that included all the symbols in the world. Each symbol is given a unique code, and the garble problem goes away. This is Unicode, as its name suggests, which is the code for all symbols.

Unicode is, of course, a large set, now larger than a million symbols. For example, U+0639 stands for the Arabic letter Ain, U+0041 for the English capital letter A, and U+4E25 for the Chinese character Yan. For a specific symbol mapping table, please refer to unicode.org or a special Chinese character mapping table.

role

Can enable the computer to achieve cross-language, cross-platform text conversion and processing.

level

Unicode encoding system can be divided into two levels: encoding mode and implementation mode.

2. Non-ascii encoding

128 symbols is enough to code English, but not enough to represent other languages. In French, for example, letters with phonetic symbols above them cannot be represented in ASCII. So some European countries decided to use the highest bits of the byte that were unused to encode new symbols. For example, the French e code is 130 (binary 10000010). As a result, the coding system used in these European countries can represent up to 256 symbols.

But here comes a new problem. Different countries have different letters, so even though they all use the 256-symbol code, they don’t represent the same letters. For example, 130 stands for E in French, Gimel (ג) in Hebrew, and another symbol in Russian. But anyway, in all of these codes, the symbols from 0 to 127 are the same, except for 128 to 255.

As for Asian writing, there are even more symbols, with about 100,000 Chinese characters. It is surely not enough that a single byte can represent only 256 symbols, so you must use more than one byte to represent a symbol. For example, the common encoding method for simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it can theoretically represent up to 256 x 256 = 65536 symbols

The problem of Chinese coding needs to be discussed in a special article, which is not covered in this note. It is only pointed out here that although a symbol is represented by multiple bytes, the kanji encoding of the GB class has nothing to do with subsequent Unicode and UTF-8.

4. The problem of Unicode

It is important to note that Unicode is just a set of symbols, and it only specifies the binary of a symbol, not how that binary should be stored.

For example, the Unicode for Chinese characters is the hexadecimal number 4E25, which is a full 15 bits (100111000100101) in binary, meaning that the representation of the symbol requires at least two bytes. For other larger symbols, it might take three bytes or four bytes, or even more.

There are two serious questions here. The first is, how do you distinguish Unicode from ASCII? How does the computer know that three bytes represent one symbol, rather than three symbols? The second problem is that we already know that the English alphabet with only one byte is enough, if the Unicode unified regulation, each symbol with three or four bytes, so every English letters before there must be two to three bytes is 0, it is a great waste for storage, the size of a text file will be two or three times as big, this is not acceptable

The result is: 1) The emergence of multiple storage options for Unicode, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode was unavailable for a long time, until the advent of the Internet.

5.UTF-8

The popularity of the Internet strongly requires the emergence of a unified coding method. Utf-8 is the most widely used implementation of Unicode on the Internet. Other implementations include UTF-16 (characters in two or four bytes) and UTF-32 (characters in four bytes), though they are rarely used on the Internet. So again, the relationship here is,

Utf-8 is an implementation of Unicode.

One of the biggest features of UTF-8 is that it is a variable length encoding method. It can use 1 to 4 bytes to represent a symbol, varying the length of the byte depending on the symbol.

Utf-8 encoding rules are simple, with only two rules:

For single-byte symbols, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.

For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are all set to 10. The remaining bits, not mentioned, are all Unicode codes for this symbol.

The following table summarizes the coding rules, with the letter X representing the bits of code available.

According to the table above, interpreting the UTF-8 encoding is very simple. If the first byte is 0, the byte is a single character. If the first digit is 1, the number of consecutive 1’s indicates how many bytes the current character occupies.

The following uses the Chinese character Yan as an example to demonstrate how to implement UTF-8 encoding.

The strict Unicode is 4E25 (100111000100101). According to the table above, 4E25 is in the range of the third line (0000 0800-0000 FFFF), so the strict UTF-8 encoding requires three bytes. That is, the format is 1110XXXX 10XXXXXX 10XXXXXX. Then, starting with the last bit, the x in the format is filled in from back to front, and the extra bits are filled in with zeros. The result is that the strict UTF-8 code is 11100100 10111000 10100101, which translates into hexadecimal E4B8A5

supplement

Now, to clarify the differences between ASCII and Unicode encodings: ASCII encodings are 1 byte, while Unicode encodings are usually 2 bytes.

The ASCII code for the letter A is 65 in decimal and 01000001 in binary;

The ASCII encoding for character 0 is 48 in decimal and 00110000 in binary. Note that the character ‘0’ is different from the integer 0;

Chinese characters are beyond the ASCII code and are encoded in Unicode as 20013 in decimal and 01001110 00101101 in binary.

As you can guess, if you encode ASCII A in Unicode, you only need to prefix it with zeros, so the Unicode encoding for A is 00000000 01000001.

A new problem arises: if Unicode is adopted, the garble problem disappears. However, if your text is written mostly in English, Unicode requires twice as much storage space as ASCII, making it uneconomical for storage and transfer.

So, in the spirit of frugality, utF-8 encodings turn Unicode encodings into “variable-length encodings.” Utf-8 encodes a Unicode character into 1 to 6 bytes, depending on the numeric size. Common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only rare characters are encoded into 4 to 6 bytes. If the text you’re transmitting contains a lot of English characters, using UTF-8 can save space:

As you can also see from the table above, the UTF-8 encoding has the added benefit that ASCII encoding can actually be considered part of UTF-8 encoding, so a lot of legacy software that only supports ASCII can continue to work with UTF-8 encoding

The Unicode encoding is used in computer memory, and is converted to UTF-8 encoding when it needs to be saved to hard disk or transferred.

When editing with Notepad, utF-8 characters read from the file are converted to Unicode characters in memory. After editing, Unicode is converted to UTF-8 and saved to the file: