A character encoding

Try to understand the code before the contents of this aspect, but has been not clear, see Liao Xuefeng teacher python tutorial today to see the string and coding this chapter, feeling finally understand a lot of, and went back for it again before to see it again, the content of the content of the two together, found that little part of the content before, cause I don’t understand, Write it down to impress.

We know that the computer world has always been 0,1 to store and transmit, and then rely on a variety of rules to send, receive, encode, decode, to achieve the purpose of storage and transmission, you can know what the content is expressed. The various character encodings, the protocols are basically the rules here.

Let’s start with bits and bytes. Bits are used to store binary values, only 0,1. 2^8 = 256 (00000000-11111111), ranging from 0 to 255,0 x00-0xff.

Since the computer world didn’t have the characters we see today, zeros and ones, define A correspondence where A byte represents A character, such as 0100 0001 for the character ‘A’, so that when you need to display 0100 0001, You can display the shape A on the interface, and then the corresponding effect is achieved. When the original rules were made, Americans only considered English case, numbers and punctuation marks, so we can see that the ACSII code only contains these printable characters and some control characters. Here, ACSII only uses 128 bits, and the highest bit is always 0.

However, when the development of the computer in addition to other places outside the United States, there are more characters need to be said, all kinds of characters, such as the Japanese language, Chinese characters, the content of the various countries to develop their own national language coding, Chinese code has GB2312, GBK, the most common due to the relatively large number of Chinese characters, So a single byte is naturally unable to express Chinese characters, increased to use 2 bytes to express, GB2312 itself compatible with ACSII coding, that is, 00-7F range is the content of ACSII.

When different languages with different coding, it may appear the conflict and incompatible situation, if I complete a file, stored on the hard disk in the form of Chinese code, when the document to another computer, in Japanese code to read, you will see the semantic is wrong, also must be shown as a Japanese don’t have any meaning of a text, I can’t tell you what I started with. This is where Unicode comes in. In order to represent all the characters in all the languages of the world, Unicode uses 4 bytes to represent them. So in fact, Unicode can represent about 2^31, 21 billion characters, which is enough for the whole world. The Unicode character example, qing, is \u6674, so the stored down should be 00000000 00000000 01100110 01110100.

Unicode is available, but the number of bytes is too large. If you only use English, it will be a huge waste of storage space or network. Unicode is compatible with ACSII, so when you only use English, you will waste a lot of space. If variable length is used, it will result in a 4-byte string, which will be represented as one Unicode character or four Unicode characters. Hence utF-8, an implementation of Unicode.

There are two rules for UTF-8:

Rule 1: For single-byte characters, the first byte is 0 and the last 7 bits are the Unicode code for the symbol, so UTF-8 is consistent with ASCII for Latin letters.

Rule 2: For characters with n bytes (n>1), set the first n bits of the first byte to 1, the n+1 bits to 0, and the first two bits of the next byte to 10. The remaining unmentioned bits are the Unicode encoding for the symbol.

Thus, the above word is in the three-byte range, so its UTF-8 encoding is 11100110 10011001 10110100.

Understanding the relationships between Unicode, UTF-8, and ACSII encodings and describing how character encodings work in computer systems can help you understand them better. In memory, Unicode encodings are universal and can be converted to other encodings when stored or transferred. By far the most common is UTF-8. When we open a blank file and write to it, we use Unicode. When we save it, if utF-8 is specified, we convert it to UTF-8 encoding storage. The next time we open it, we open it in UTF-8 mode, and do not use any other incompatible mode, otherwise it will cause garbled characters. The same is true for network transmission.

Most of the content of the above from the place (bad, can’t see the original), or after I finished watching, however, there is a feeling, but also can’t find a better article, so give up temporarily, until saw Liao Xuefeng python tutorial section, the teacher did not understand the point also is can’t combined with actual, Teacher Liao is the solution to my confusion, here worship.

Related Posts

[翻 译] WebRTC Weekly 365 issue

Why does the design always not endure to look

Dodi technical Director tells you the truth: Is studying Python a promising career?

[翻译] WebRTC Weekly 365 issue