In-depth understanding of ASCII,Unicode and UTF-8 encodings

1. Why code?

Because the computer can handle only 1 s and 0 s (i.e., two kinds of state: high and low level), all we need to English letters, Numbers, special symbols translated into 0 s and 1 s, computer know that how to translate to translate it and with what kind of rules, so smart people invented a series of encoding rules, characters and Numbers. ASCII was first invented, and Unicode and UTF-8 later evolved.

2. Evolution of coding format

The world’s first computer was created in the University of Pennsylvania, so americans were the earliest users of computers. The earliest information exchange codes were also created in the United States, namely ASCII (America Standard Code for Infomation Interchange). American Standard Code for Information Interchange). ASCII is essentially the relationship between numbers and characters. For example, the decimal number 65 for the uppercase letter “A” is expressed as 01000001 (octal, hexadecimal), while the decimal number 65 is represented as 01000001 by A computer, which cannot store characters but can store 0s and 1s. So the letter “A” is actually stored in the computer as 01000001, accounting for 8 bits, that is, 1 byte. The same is true for other characters, each corresponding to a decimal number, as shown in the ASCII standard comparison table. Isn’t that simple, but why do people still make Unicode encodings? Because ASCII code is the American standard, the characters covered only include A-Z, A-Z, numbers 0-9, other control characters and some special characters, A total of 127 characters. Later, due to the popularity of computers, the 127 symbols can not meet the needs of people. So IBM used 128~255 bits to supplement the ASCII code, including additional symbols, Greek letters and cartographic symbols, this part of the code is called extended ASCII code. There are hundreds of languages in the world. Obviously, standard ASCII code and extended ASCII code still cannot meet the coding requirements of different countries. For example, the “Han” of Chinese characters cannot be represented by ASCII code. The Chinese code GB2312 and The Japanese code Shift_JIS were created, but it wasn’t wise to include different code sets for the same app used by different people in different regions, and Unicode was born. Unicode encodings are usually 2 bytes, and some remote characters use 2 to 4 bytes, so that one set of encodings saves all characters, so that different countries, different regions form a unified encoding format. For example, the ASCII code for “A” is 01000001. If “A” is represented in Unicode, you can add 0 in front of it, 0000000001000001. Now we can use Unicode to represent “HAN”. 0110110001001001. It is not hard to see that if Unicode were used to encode all characters, the problem of garbled characters would be solved. However, the problem is that if a text contains both English letters and Chinese characters, the English letters would also be represented by 2 bytes (16 bits), which would obviously waste storage space. Is there a more general and storage-saving encoding? Of course there is, clever people invented UTF-8 code, UTF-8 is a variable length code, why is it called UTF-8, what does this 8 mean? 8 represents a byte, namely eight, but do not represent the utf-8 using a byte represents one character at a time, but in utf-8 format, one character at a byte size of the smallest unit of change, a little round, people speaking is due to the utf-8 encoding, the size of the different characters take up the space is variable, each character is likely to be 1 byte, It could be two or three bytes.

— — — — — to be updated

In-depth understanding of ASCII,Unicode and UTF-8 encodings

1. Why code?

2. Evolution of coding format

Related Posts

The difference between Map and ForEach in JavaScript

Let Vue’s V-for support iterator traversal

Vue3 with native implementation, really sweet ~