coding

First of all, what is code?

Since computers only know zeros and ones, in order to identify characters, A uniform set of rules is needed to implement mappings such as 0100 0001-> A, which is called encoding

At the same time, with the development of computers, the need to identify the character is constantly increasing, resulting in a continuous increase in coding types

ASCLL

The most basic encoding, defined by the Americans, uses 1 byte (8 bits) to define all characters they use

Since the English language has only 26 letters, 256 (2^8) mapping bits are enough to identify all characters

In fact, ASCLL uses only the last seven bits (0 as the first digit, of the form 0xxx XXXX) and defines 128 characters, using A as an example:

A -> 65 -> 0100 0001

Unicode

As countries began to use codes to identify their own characters, it became clear that one-byte encoding was simply not enough:

Native languages may require more than 256 characters
For the same mapping bit, there are different interpretations in multiple languages. For example, 65 above means A in the United States, but it may mean something else in other countries. It is difficult for people to communicate with each other and frequent changes are required

This is where Unicode comes in. Unicode is a super large (solution 1) dictionary, with each character having a unique value (solution 2).

But Unicode is just a dictionary, and the exact encoding is not fixed, as follows:

A->65 -> ?

Only the mapping of one character to one number is specified

UTF-8

Utf-8 is an implementation of Unicode
Compatible with existing ASCLL
The value contains 1 to 4 bytes

Coding rules:

When the byte length is 1, the first byte is 0 and the remaining 7 bits are the Unicode encoding value of the character. This method is the same as the encoding method of ASCLL code, so it is fully compatible with ASCLL code
If the length of a byte is greater than 1, N bytes are required. The first N bits of the first byte are 1, the N+1 bits are 0, and the first two bits of the remaining n-1 bytes are 10. All the remaining bits are used as the Unicode encoding value of the character

Rule 2 is a bit more complicated, but it’s easier to understand if you look at the table:

Unicode encoding value range	Utf-8 binary
0000 0000 ~ 0000 007F	0xxxxxxx
0000 0080 ~ 0000 07FF	110xxxxx 10xxxxxx
0000 0800 ~ 0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx
0001 0000 ~ 0010 FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

conclusion

Ascll: The basic encoding, 1 byte, actually defines 128 characters
Unicode: A mapping rule that encodes all characters, but does not implement them
Utf-8: a concrete implementation of Unicode, 1 to 4 bytes, compatible with ASCLL

reference

Have a thorough knowledge of Unicode encoding

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Understand Unicode and UTF-8

coding

ASCLL

Unicode

UTF-8

conclusion

reference

Understand Unicode and UTF-8

coding

ASCLL

Unicode

UTF-8

conclusion

reference

Related Posts

【Redis break the road 】 one: the powerful Redis

0004: Algorithm — heap sort

Recommended learning Java – Spring transactions