preface

In daily development, the problem of garbled code can be said to have troubled us, so why there will be garbled code? Why doesn’t the whole world use one code? This article will answer these two questions from the development history of character sets. After reading this article, I believe that you will have a fundamental understanding of the phenomenon of garbled characters.

A story to understand why code

Now there are two people, Zhang SAN and Li Si, Zhang SAN can only speak Chinese, Li Si can only speak English, so how do they communicate at this time? The solution is that they can find a translator, the translation process can be understood as coding, that is to say, from Chinese to English or from English to Chinese is a coding process, the essence of coding is to make the other party can understand their own language.

There are so many official languages and dialects that can’t be coded into each other when applied to a computer? And the most important thing is that human language is not suitable for computers, so we need to invent a language suitable for computers, which is binary. Binary is today’s world computer language, of course, once the former Soviet Union also invented ternary computer, but not popular, this interested can go to understand.

When we want to communicate with a computer, we convert it to binary (encoding), and when the computer processes it, we convert it back to human language (decoding). That’s why we need to encode.

Why is it garbled

But why is it garbled? Or use the above story zhang SAN Li Si to take an example, if once Zhang SAN said a rare word, and then the translator has never seen this word, at this time the translator does not know how to translate, there is no way, directly translated into?? , that is, garbled code.

The same is true in the computer world. For example, if we want to send two lone Wolves from one program A to another program B, the computer will transfer the data into binary. After the transmission, binary is not suitable for human reading, so B has to decode it. However, NOW B does not know what language A uses for encoding, so it decodes in English at random, and the decoded characters certainly do not exist in English, that is, the word “Twin and lone Wolf” cannot be found in the English character set. At this time, garbled codes will occur.

So the essence of garble is that the current encoding cannot parse the received binary data.

History of character sets

Know the reason why the coding and garbled after, can not help but also have another question, if the world is unified in a code, that under normal circumstances there would be no garbage problem, but the reality is a variety of code is like a sea situation, the whole of our programmers heads, one not careful the statement came out. But to answer that question we need to understand the history of character sets.

The birth of ASCII coding

Computers were first born in the United States, and they could only read binary, so we needed to associate common words with binary. Americans convert common English characters and some control characters into binary data, such as the familiar lowercase letter A, which corresponds to 97 in decimal and 01100001 in binary. A byte has 8 bits, that is, it can represent up to 255 characters. However, there are few common characters in English. There are 128 common letters and some common symbols listed, so Americans occupy the position from 0 to 127, forming a coding correspondence table. This is the ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) encoding, ASCII encoding table corresponding relationship if you want to know you can go to check, here is not listed.

Ios-8859 coding family was born

As computers spread to Europe, it was found that common characters in Europe also needed to be encoded, so the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) decided to jointly develop another standard for the character set. The ISO-8859-1 character set was born.

Since ASCII used only 0-127 positions, and the other 128-255 positions were not used (i.e., the highest bit of a byte was not used), Europeans took advantage of the eighth bit, which has since been used by western European characters. Iso-8859-1 characters are also called Latin1 encoding.

Gradually, over time, more and more characters in Europe needed to be encoded, so a series of character sets were derived, from ISO-8859-1 to ISO-8859-16, with a series of minor tweaks, but all of which were part of the ISO-8859 standard.

Note that the ISO-8859 standard is backward compatible with the ASCII character set, so most common scenarios we see use isO-8859-1 encoding by default rather than ASCII encoding directly.

GB2312 and GBK and other double – byte encoding was born

Slowly, with the passage of time, the computer spread to Asia, to China and other countries, at this time many countries developed their own national codes according to their common characters, China is no exception.

However, it is found that all 8 bits of a byte have been used, so it can only be extended by one byte, that is, two bytes for storage. But there is another problem with two bytes, that is, if I read two bytes, do these two bytes represent two single-byte characters or double byte Chinese?

So we, the great Chinese people, decided to develop a Chinese code that would be compatible with ASCII, because ASCII must have single-byte characters less than 128. So we finally decided that Chinese double-byte characters would start after 128. That is, when we find characters with two consecutive digits greater than 128, It shows that this is a Chinese, specified after we put this coding method called GB2312 coding.

It should be noted that GB2312 is not compatible with ISO-8859-N encoding sets, but is compatible with ASCII encoding.

GB2312 code contains 6763 commonly used Chinese characters and 682 non-Chinese graphic characters (including Latin letters, Greek letters, Japanese hiragana and Katakana letters, Russian Cyrillic letters, full corner characters).

With the further popularization of computers, GB2312 also exposed the problem, that is, GB2312 included Chinese characters are simplified and common characters, for some rare characters and traditional characters are not included, so GBK appeared.

GB2312 code because the two bytes are high, even if all the corresponding, the maximum can only store 16384 Chinese characters, and Chinese characters if adding traditional and rare characters is far from enough, so GBK is only required that the first is greater than 128, the second can be less than 128. This means that whenever a byte is found to be larger than 128, the next byte is added as a Chinese character as a whole, which can store up to 32,640 Characters. Of course, GBK is not all used up, GBK total income 21886 Chinese characters and graphic symbols, including Chinese characters (including radicals and components) 21003, graphic symbols 883.

Later, with the further popularization of computers, we also slowly expanded other Chinese character sets, such as GB18030, but these are double-byte characters.

Here I hope you understand why English is one character, Chinese is two or more characters. One reason is that the low value is used, and the other is that there are so many common Chinese characters that one byte is far too much to store.

Unicode characters are born

Actually in the development process of the computer, not just the United States, Europe and China, many other countries have their own characters, such as Japan, South Korea has its own character set, can say is chaotic, so the authorities see not bottom go to, decided to end the turmoil of world war ii, making another set of character standard, this is Unicode.

From the moment Unicode was born, it felt that everyone but itself was scum. So it wasn’t going to be compatible with any other code at all, it just came up with a new standard. Unicode characters originally used the UCS-2 standard, which specifies that a character should be represented by at least two bytes. 2 bytes, of course, even if the whole were used also can store 65536 characters, it must contain all the languages in the world and control characters and symbols, so behind with UCS – 4 standard, can use 4 bytes to store one character at a time, four bytes to store all the world is the basic language and control characters have no questions.

It is important to note that the Unicode encoding only defines the character set and does not specify how the character set should be stored. From our development perspective, Unicode only defines interfaces, but does not implement them.

The UTF coding family was born

UTF encodings are implementations of the Unicode character set in different ways, including UTF-8, UTF-16, and UTF-32.

UTF – 32 encoding

Utf-32 encoding is based on the Unicode character set standard, with each symbol occupying four bytes. As you can imagine, this is a huge waste of space. For English, the space is four times larger, and for Chinese, it is three times larger, so this method of encoding also led to Unicode not being widely accepted at first.

UTF – 16 coding

Utf-16 encoding is a slight improvement over UTF-32, which uses two or four bytes for storage. Most utF-16 encodings are stored in two bytes. When stored in two bytes, UTF-16 encodings convert Unicode characters directly to binary and store them in four bytes for rare or less-used characters. However, an encoding conversion is required for four byte storage.

The following table is the storage format of utF-16 encoding:

Unicode encoding range (hexadecimal) Utf-16 encoding binary storage format
0x0000 0000 – 0x0000 FFFF xxxxxxxx xxxxxxxx
0x0001 0000 – 0x0010 FFFF 110110xx xxxxxxxx 110111xx xxxxxxxx

This table will not be explained until utF-8 encoding is explained.

Utf-8 encoding

Utf-8 is a variable-length encoding compatible with ASCII encoding. In order to implement this feature, a specification must be in place to specify the storage format so that when a program reads two or more bytes, it can resolve whether it represents multiple single-byte characters or a multi-byte character.

The storage specifications of UTF-8 are as follows:

Unicode encoding range (hexadecimal) Utf-8 encoded binary storage format
0x0000 0000 – 0x0000 007F 0xxxxxxx
0x0000 0080 – 0x0000 07FF 110xxxxx 10xxxxxx
0x0000 0800 – 0x0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0x0001 0000 – 0x0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Next, we take the double word as an example to illustrate:

Not only would not u53cc be encoded in Unicode, but not u53cc would be: At this time, only 7 bits in the first row of the table cannot be saved, and only 11 bits in the second column, which is not enough for storage, so it can only be stored in the third column. There are 16 bits in the third column, which is filled in the position of X from back to front. After filling, there is still one vacant bit, which can be directly filled with 0. 11100101 10001111 10001100, so the double word takes up 3 bytes, of course, some rare words take up 4 bytes.

In utF-16, if the current character is stored in two bytes, it can be converted to binary. If the character is stored in four bytes, it can only be converted to x. Anything outside of x is fixed format.

Note that in UTF-16 encoding, 2 bytes may also appear in the format that begins with 110110xx or 110111XX in 4 bytes. The corresponding ranges of these two parts are: D800~DBFF and DC00~DFFF, so in order to avoid such ambiguity, these two parts of the interval are specially left out without coding.

Why is it sometimes garbled? No.

In Java development, it is common to encounter garbled characters that appear as? “, as in the following example:

String name = The Twin Lone Wolves.;
byte[] bytes = name.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(new String(bytes));// Output:????
Copy the code

The reason for this output is that Chinese cannot be stored in ISO_8859_1 encoding, which is mandatory for decoding in the example.

Java provides an ISO_8859_1 class for decoding. If the current character is greater than 255 after converting to decimal, it is not decoded directly and is assigned a default value of 63 instead. Therefore, the byte array in the above example results in 63, 63, 63, 63. And 63 is the exact equivalent in ASCII, right? Number.

So generally we see coding coming up, right? This basically indicates that the current is using ISO_8859_1 decoding, and the current character is greater than 255.

Process knowledge

With the history of coding behind us, let’s take a look at some other coding digressions.

Code points and code units

Strings in Java are composed of sequences of char, which in turn are code units of Unicode code points represented in UTF-16. This sentence involves code points and code units, which may be confused for the first time, but after understanding the Unicode character set standard and utF-16 encoding mode, it is easier to understand.

  • Code point: One code point equals oneUnicodeCharacters.
  • Code unit: inUTF-16In, two bytes represent a code unit, the code unit is the smallest unsplit part, so if inUTF-8In, a code unit is a byte becauseUTF-8Can represent a character with a byte in.

We usually call the length of the string () method, the returned is the number of code units, rather than the number of code points, all if some need to use 4 bytes to represent traditional Chinese characters, then code unit number will be less than the code points, and want to get the code number of points, can be obtained by other method, access method is as follows:

String name = "𤭢";//\uD852\uDF62
System.out.println(name.length());// Number of code units, output 2
System.out.println(name.codePointCount(0, name.length()));// Code points, output 1
Copy the code

Big-endian mode and small-endian mode

In computers, data is stored in bytes, so when a character is represented in more than one byte, there is a question of whether multibyte characters should be grouped backwards or backwards.

Take the double word as an example, and convert it into binary: 0101001111001100, which can be divided into one byte: 01010011 and 11001100, in which the first part is called the high byte, the second part is called the low byte, the two parts are stored sequentially to produce the big endian mode and the small endian mode.

  • Big-endian mode: As the name implies, it ends with the highest byte, with the lowest byte first (left) and the highest byte second (right). Such asdoubleThe word is stored as:11001100, 01010011,.
  • Little-endian mode: as the name implies, it ends with the lowest byte, with the highest byte first (left) and the lowest byte second (right). Such asdoubleThe word is stored as:01010011, 11001100,(The same logic we normally use to calculate binary numbers, from right to left20The power begins).

Note: Java uses big-endian mode by default. Although the underlying processor may store bytes in a different mode, these details are hidden due to the existence of the JVM, so people may not pay much attention to them.

BOM

If we have a file, how does the computer know whether it is in big-endian or small-endian mode?

BOM is a Byte Order mark that appears at the head of a text file. The BOM is used to mark whether the current file is stored in big-endian or small-endian mode. I think we should have seen this, usually in the use of Notepad to save the document, need to choose to use the big end or small end:

There is a character called Zero Width no-break Space in the UCS encoding, which corresponds to the encoding FEFF. FEFF is a nonexistent character that normally should not appear in actual data transfers.

However, to distinguish big-endian mode from small-endian mode, the UCS specification recommends that the character Zero Width no-break Space be transmitted before transmitting the byte stream. The order of the characters is used to distinguish big-endian mode from small-endian mode.

The following table is the BOM of different codes:

coding Hexadecimal BOM
UTF-8 EF BB BF
Utf-16 (Big-endian mode) FE FF
Utf-16 (Little endian mode) FF FE
Utf-32 (Big-endian mode) 00 00 FE FF
Utf-32 (Little endian mode) FF FE 00 00

With this specification, you can parse a file knowing the current encoding and its storage mode. Note that UTF-8 encoding is special, because utF-8 encoding has a special sequence format, so UTF-8 does not distinguish between big-endian mode and small-endian mode.

According to the special encoding format of UTF-8 itself, it can also be inferred without BOM. However, because Microsoft recommends adding BOM to all utF-8 files, there are utF-8 files with BOM and UTF-8 files without BOM. These two formats may be incompatible in some scenarios. So in ordinary use can also pay a little attention to this problem.

conclusion

This paper mainly starts from the history of coding, describes the storage rules of coding and analyzes the essential causes of garbled code, and also analyzes the two storage models of bytes and BOM related problems. Through this paper, WE believe that for the garbled code problem in the project, we will have a clear idea to analyze the problem.