Baidu interview was asked: a simple understanding of why the garbled code, how to recover?

How do computers store characters?

I have learned some basic knowledge about computers in university. Computers can only calculate binary data, because binary is the most convenient representation. Computer electronic devices simply represent two states, such as high pressure and low pressure, which correspond to 1 and 0. If you design 10 states, the computer design will be quite complex.

Computers want to store our real-world characters, the Chinese characters or alphabets we use all the time. The simplest way to do this is to assign each character to a number that can be converted to binary, which is equivalent to the computer storing characters indirectly. In fact, that’s exactly what computer scientists do. As a result, various character sets were born, with characters in each country having corresponding numbers.

What character sets are available

When computers were invented, they didn’t think that much. As the birthplace of modern computers, only the United States was considered. So you only need 128 characters, call it the ASCII encoding or the ASCII character set (it’s different, but I think it’s more accurate to call it the character set). 128 characters, if using a one-to-one correspondence, a total of only 7 binary digits. The smallest unit of computer storage is the byte, or 8 bits. The ASCII character set sets the highest bit to 0 and uses the remaining 7 bits to represent characters. The figure below shows the relationship between numbers and characters in ASCII.

The figure divides the ASCII character set into two main parts, printed and non-printed. The print part is easy to explain, it is printed on the display device, if you look at our keyboard, there are generally these characters. The non-printing part is more for control purposes. For example, NUL means null character, \n means newline, \r means carriage return, etc.

ASCII is good enough for the US, but apparently not for many other countries. Here are some common ones. ISO 8859-1 and Windows-1252. Both add Western European characters, which are basically the same. ISO 8859-1 claims to be the standard character set for western European countries, but Windows-1252 is more commonly used.

The first standard character set for Chinese was GB2312, with about 7,000 characters and some rare words and traditional characters. It uses two bytes to represent Chinese characters. GBK is based on GB2312 and is backward compatible. It contains a total of 21,000 Chinese characters, including traditional characters. GB18030 further enhances GBK, which can also be backward compatible, and contains a total of 76,000 characters, including many ethnic minority characters, as well as The Unified characters of China, Japan and South Korea. But two bytes can not represent all characters in GB18030, GB18030 uses long encoding, some characters are two bytes, some are four bytes. In two byte encoding, the range of representation is the same as GBK. In addition to the Chinese character set, there is also Big5, which is for traditional Chinese characters and is widely used in Wan Wan province and Hong Kong.

The front said so much, whether feel very troublesome. Different countries and regions need different character sets. All of the previous character sets are collectively referred to as non-Unicode character sets. The Unicode character set was created to address incompatibilities between different character sets in each country. One thing it does is assign a unique numeric number to every character in the world. There are more than 1.1 million of them, but most of them are between 0x0000 and 0xFFFF, or 65536.

** These numeric numbers are usually expressed in hexadecimal, but note that only a numeric number is assigned, and unlike the non-Unicode character set above, this number also represents the corresponding binary of the character. ** As for the Unicode character set, there are several different schemes for the binary number of each character. The main ones are UTF-8, UTF-16, and UTF-32.

Utf-32 is the simplest and follows our normal logic. The binary form of the number is the binary corresponding to this character. The downside is that each character is represented by four bytes, which is obviously wasteful for common characters. Utf-16 is the same as GB18030, some use two bytes and some use four bytes. Utf-8 is the most widely used, and uses variable-length bytes. The number of bytes per character is related to the number of digits in Unicode. Small numbers use fewer bytes, large numbers use more bytes, ranging from 1 to 4.

How to recover the garbled code

See here, why the garbled code I believe you must know. All it does is use a different character set, resulting in parsing errors. So how do you recover garbled characters? If you know the original encoding is best, just switch to the corresponding character set. For example, in some text editors that can switch encoding methods, HTML raw information Settings, IDE Settings, etc.

But if you don’t know, there is no good way but to try one by one. We can write a simple little program that tries to parse out the original form of garbled characters by iterating through common character sets. In Java, newStr = new String(str.getBytes(charsets[I]),charsets[j]). STR (charsets[I]); charsets[j]; charsets[I] Be sure to make a backup of the original file before you try, because repeated parse and conversion errors may make it difficult to restore to normal characters.

Baidu interview was asked: a simple understanding of why the garbled code, how to recover?

How do computers store characters?

What character sets are available

How to recover the garbled code

Related Posts

Use Python to periodically clean up pdFlatex zombie processes that have timed out

Instead of an alarm clock, I wake up every day with a self-disciplined cat.

Fast mail sending function based on Spring Boot