coding

First of all, what is code?

Since computers only know zeros and ones, in order to identify characters, A uniform set of rules is needed to implement mappings such as 0100 0001-> A, which is called encoding

At the same time, with the development of computers, the need to identify the character is constantly increasing, resulting in a continuous increase in coding types

ASCLL

The most basic encoding, defined by the Americans, uses 1 byte (8 bits) to define all characters they use

Since the English language has only 26 letters, 256 (2^8) mapping bits are enough to identify all characters

In fact, ASCLL uses only the last seven bits (0 as the first digit, of the form 0xxx XXXX) and defines 128 characters, using A as an example:

A -> 65 -> 0100 0001

Unicode

As countries began to use codes to identify their own characters, it became clear that one-byte encoding was simply not enough:

  1. Native languages may require more than 256 characters
  2. For the same mapping bit, there are different interpretations in multiple languages. For example, 65 above means A in the United States, but it may mean something else in other countries. It is difficult for people to communicate with each other and frequent changes are required

This is where Unicode comes in. Unicode is a super large (solution 1) dictionary, with each character having a unique value (solution 2).

But Unicode is just a dictionary, and the exact encoding is not fixed, as follows:

A->65 -> ?

Only the mapping of one character to one number is specified

UTF-8

  • Utf-8 is an implementation of Unicode
  • Compatible with existing ASCLL
  • The value contains 1 to 4 bytes

Coding rules:

  1. When the byte length is 1, the first byte is 0 and the remaining 7 bits are the Unicode encoding value of the character. This method is the same as the encoding method of ASCLL code, so it is fully compatible with ASCLL code

  2. If the length of a byte is greater than 1, N bytes are required. The first N bits of the first byte are 1, the N+1 bits are 0, and the first two bits of the remaining n-1 bytes are 10. All the remaining bits are used as the Unicode encoding value of the character

Rule 2 is a bit more complicated, but it’s easier to understand if you look at the table:

Unicode encoding value range Utf-8 binary
0000 0000 ~ 0000 007F 0xxxxxxx
0000 0080 ~ 0000 07FF 110xxxxx 10xxxxxx
0000 0800 ~ 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000 ~ 0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

conclusion

  • Ascll: The basic encoding, 1 byte, actually defines 128 characters
  • Unicode: A mapping rule that encodes all characters, but does not implement them
  • Utf-8: a concrete implementation of Unicode, 1 to 4 bytes, compatible with ASCLL

reference

Have a thorough knowledge of Unicode encoding