[8.24] Computer Basics - Unicode and character sets you must know

This is the 22nd day of my participation in the August Wen Challenge.More challenges in August

Computer memory review

In the previous article, Computer Memory Basics, we learned that the smallest memory unit of a computer is a bit, which stores only 0 or 1

1 byte = 8 bits, so each bit has a choice of either 0 or 1, 8 bits is 2^8 (2^8), 256 combinations, so 8 bits can represent numbers in the range 0 to 255, for example, 00010000 is 32

ASCII

In the 1960s, the United States developed a character code that defines the relationship between English characters and binary numbers. It is called ASCII code, and it is still used today

ASCII codes specify A total of 128 characters, ranging from 0 (00000000) to 127 (01111111), where space is 32, A is 65, the following figure shows all displayable characters, characters less than 32 and 127 are control characters

We can see that the 128 characters occupy only the last 7 bits of a byte, and the first bit is always 0

non-ascii

If only the 128 characters are used, it can meet the requirements of English storage. However, different countries have their own languages, and they also want to put the characters corresponding to their languages in the current free position 128-155. Each country has its own encoding mode.

For example, 130 stands for E in French, Gimel (ג) in Hebrew, and other characters in Russian. As a result, it is impossible to display two languages on the same computer because they are coded differently. The problem is even worse in Asian countries, where there are around 100,000 characters.

If we only use English, or never move the string from one computer to another, it can be work all the time, but once the Internet appeared, the string from one computer to another, so they said would be different, the meaning of the whole mess will not be able to pick up, fortunately was invented to unicode

Unicode

Unicode is a collection of symbols that includes all symbols in the world, each with a unique code

Unicode is, of course, a large set, now larger than a million symbols. For example, U+0639 stands for the Arabic letter Ain, U+0041 for the English capital letter A, and U+4E25 for the Chinese character Yan. For a specific symbol mapping table, please refer to unicode.org or a special Chinese character mapping table.

But Unicode is just a collection of symbols, specifying, for example, the strict code U+4E25, where U+ represents the Unicode code and 4E25 represents its hexadecimal code, which has 15 bits when converted to binary (100111000100101). So how should we store it on our computer? This leads to two problems:

How to distinguish Unicode from ASCII, and how to know how many bytes represent a character?
If three or four bytes were used to represent a character, the first few bits would all be zeros for ASCII, which would be a huge waste of space

UTF-8

Utf-8 is an implementation of Unicode. It solves the two problems mentioned earlier. Other implementations include UTF-16 (characters in two or four bytes) and UTF-32 (characters in four bytes), though they are rarely used on the Internet.

One of the biggest features of UTF-8 is that it is a variable length encoding method. It can use 1-4 bytes to represent a symbol, varying the length of the byte depending on the symbol.

Its coding rules are simple, with only two: 1) For single-byte characters, the first digit is always 0, and the following 7 bits are the Unicode code for that character, so that utF-8 and ASCII are always the same for English letters. 2) For n-byte characters (n>1), the first n bits of the first byte are all set to 1. The n+1 bit is set to 0, and the first two bits of the following stanza are set to 10. For bits not mentioned, fill in the Unicode code for this character

Unicode symbol scope | utf-8 encoding (hexadecimal) | (binary) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxCopy the code

For example, the strict code is 4E25 (100111000100101), corresponding to the third line of the above table, so its encoding format is 1110XXXX 10XXXXXX 10XXXXXX, starting from the last one of its binary bits, and filling from back to front, the extra bits complement 0, so, After filling in its Unicode code, it will be 11100100 10111000 10100101

The last

To sum up, we usually need to pay attention to the development of strings without knowing what encoding is used, it is meaningless to have strings. Therefore, in order to open a text file, we must know its encoding, otherwise using the wrong encoding, there will be garbled code. Why do e-mails often have garbled characters? Because the sender and the receiver use different coding methods.

Reference for this article:

Nguyen other teacher’s blog: www.ruanyifeng.com/blog/2007/1…

Every software developer absolutely must understand the Unicode and character set: www.joelonsoftware.com/2003/10/08/…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

[8.24] Computer Basics – Unicode and character sets you must know

Computer memory review

ASCII

non-ascii

Unicode

UTF-8

The last

[8.24] Computer Basics – Unicode and character sets you must know

Computer memory review

ASCII

non-ascii

Unicode

UTF-8

The last

Related Posts

SpringBoot2 | SpringBoot Environment analysis of source code (4)

About JDK source code: HOW to read more efficiently

Elasticsearch: How do I search PDF files