As a programmer, you should know about coding

Original: Curly brace MC(wechat official account: Huakuohao-MC) Focus on JAVA basic programming and big data, focus on experience sharing and personal growth.

In spy drama, we often see such a scene, agents, and get a message, open on see, a string of Numbers, and then ran to a secret place, take out a password this (can also be a tang dynasty), according to certain rules (only one know), such as the first number said pages, said the second number of rows, The third number indicates the number of words and translates the information one by one. If this process uses the wrong cipher book, or if the rules are not known, the decoding will fail.

The computer encoding and decoding process is the same as the above process.

Computers only recognize zeros and ones, and all images and characters are eventually converted to binary that computers can understand. A bit can represent two states, 0 and 1. A byte is made up of eight bits, so a byte can represent 256 (2^8) states. If we specify that each state represents one character, then one byte can represent 256 characters.

ASCII

The computer was invented by americans, so only The English code was considered in the initial design of the code. There are very few English characters, including some special characters, there are only about 100 in total, 128 to be exact. In this case, it’s enough to encode one byte, not only enough, but it also has one bit, the first byte is never involved in the encoding, so it’s all set to zero. This is known as ASCII coding. In ASCII encoding, the SPACE SPACE is 32 (binary 00100000) and the uppercase letter A is 65 (binary 01000001).

non-ascii

As computers spread, so did the Europeans, who found the 128 characters required by ASCII inadequate for their use. In French, for example, letters with phonetic symbols above them could not be represented in ASCII. As a result, some European countries decided to encode the unused first byte into a new symbol. For example, the French e code is 130 (binary 10000010). As a result, the coding system used in these European countries can represent up to 256 symbols. This is commonly seen isO-8859-1 code, also known as Latin1 code.

Chinese code

With the popularization of computers, Chinese people began to use computers, but they found that according to the previous way of coding, there was no Chinese characters, that is, computers could not recognize Chinese characters.

GB2312

In order to enable computers to recognize Chinese characters, we decided to encode them. In the spirit of daring to think and do, we decided to use two bytes to represent each character. The rules are as follows: a byte less than 127 has the same meaning as ASCII, but when two bytes greater than 127 are joined together, it is a Chinese character. The first byte is called high byte and the last byte is called low byte, so we can combine 6,763 simplified Chinese characters. This is the GB2312 code that we often say.

GBK

It is obvious that the 6,763 Chinese characters encoded by GB2312 cannot be adapted to all usage scenarios. For example, the character “zhe” is no longer included. Therefore, a new extension is made on the basis of GB2312, which provides that as long as the first byte is greater than 127, it is OK. It doesn’t matter if the second byte is greater than or less than 127. After such changes, included Chinese characters and symbols can reach more than 2W, which is what we often say GBK coding.

Later, people continue to expand the second byte, developed the GB18030 code, GBK and some more character codes.

So far, all Chinese character encodings are represented in two bytes, but English is represented in one byte. Some older programmers have experienced the experience of counting two English characters with one Chinese character.

BIG-5

The above mentioned are simplified Chinese code, although GBK and GB18030 contain part of the traditional Chinese characters, but also not comprehensive, so Taiwan compatriots issued a special support for the traditional Chinese Big5 code, also is we often say the big five code.

One quick question

I don’t know if you’ve noticed that when it comes to single-byte codes, the letters that are greater than 127 and less than 256 are likely to be different in different countries. For example, 130 stands for E in French, Gimel (ג) in Hebrew, and another symbol in Russian. There is also such a problem in double-byte encoding of Chinese characters, such as BIG5 encoding and GBK encoding are double-byte encoding, but the Chinese characters on behalf are not the same.

This is equivalent to the same string of binary values that agent A interprets as “hello,” while Agent B interprets as “fuck off.” It is quite necessary to have different translation standards between spy organizations, but it is more troublesome to have different coding rules for computers. For example, you chat with Sister Zhiling in Taiwan, sister Zhiling sends you a letter with BIG5 code, and then you use GBK to decode… Maybe there is no then.

Unicode

In order to solve the above problem, there is an international standards organization called ISO, decided to abandon all regional codes, such as BIG5, GBK and so on, and formulate a new code, this code set will contain all the characters of the code, so that everyone is unified. The full English name of this code is “Universal multiple-OCTET Coded Character Set”, referred to as UCS, commonly known as “Unicode”. The emergence of Unicode is equivalent to the unification of weights and measures and currency by Emperor Qin Shihuang.

Unicdoe divides 17 planes according to the frequency of daily use of characters, which are numbered 0-16. Plane 0 is called Basic Multilingual Plane (BMP), which contains the most frequently used characters, ranging from 0000 to FFFF. So the plane can represent 2^16=65536 characters; The other planes also range from 0000 to FFFF, so the other planes can encode up to 65535 characters, making 17×65,536 = 1,114,112 symbols.

Our most common Unicode encodings use the multilingual plane encodings, where all characters are encoded in two bytes (other planes may require three or four bytes). For example, the Unicode code for the Chinese character ‘zhong’ is 4E2D and the Unicode code for the lowercase ‘a’ is 0061.

There are two problems with this: if all English characters are encoded in Unicode, there is a waste of storage space. It takes two bytes to do what one byte can do. The second question is how does the computer know if it’s a Unicode encoding or an ASCII encoding, which is two bytes for a character, or two characters.

UTF

UTF stands for Unicode Transformation Format, or Unicode Transformation Format. Utf-8 is a solution to the problem of wasting space if you store Unicode codes directly. Utf-8 uses variable-length storage of Unicode codes, meaning that English characters continue to be stored in one byte, but Chinese characters use three bytes. So how does UTF-8 do it?

First, for a single-byte symbol, the first byte is set to 0 and the next seven bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.

Second, for n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are set to 10. The remaining bits, not mentioned, are all Unicode codes for this symbol.

The following table summarizes the coding rules, with the letter X representing the available coding bits.

Unicode symbol range (hexadecimal)	Utf-8 Encoding mode (binary)
`0000 0000-0000 007F`	`0xxxxxxx`
`0000 0080-0000 07FF`	`110xxxxx 10xxxxxx`
`0000 0800-0000 FFFF`	`1110xxxx 10xxxxxx 10xxxxxx`
`0001 0000-0010 FFFF`	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

According to the table above, reading the UTF-8 encoding shows that if the first byte is 0, the byte is a single character. If the first digit is 1, the number of consecutive 1’s indicates how many bytes the current character occupies.

For example, suppose that a string like “Hello world” has its Unicode encoding

H --0068 E --0065 L --006C L --006C o--006F world --4E16 world --754CCopy the code

According to utF-8 encoding rules, the following UTF-8 encoding can be obtained

H --01101000 E --01100101 L --01101100 L --01101100 O --01101111 World --11100100-10111000-10010110 world --11100111-10010101-10001100Copy the code

You can see that with UTF-8 encoding, English characters take up one byte and Chinese characters take up three bytes, for a total of 11 bytes, compared to 14 bytes if you store Unicode codes directly. Utf-8 encoding saves a lot of space for English, but adds space for Chinese.

Little Endian and Big Endian

As mentioned above, Unicode uses two bytes to represent characters. If the first byte is first, it is “Big endian” and if the second byte is first, it is “Little endian”. The Unicode code for the word ‘world’ is 4E16. One byte is 4E and one byte is 16. When stored, 4E is big endian and 16 is small endian.

So how does a computer know which way a file is encoded?

According to the Unicode specification, each file is preceded by a character denoting the encoding order. This character is called zero width no-break space (FEFF). That’s exactly two bytes, and FF is one more than FE.

If the first two bytes of a text file are FE FF, it means that the file is big. If the first two bytes are FF FE, the file has a small header.

conclusion

Utf-8 encoding is an encoding implementation based on the Unicode character set. Unicode encoding is now supported by almost all programming languages and operating systems, and it is no longer the case that one Chinese character equals two English characters.

For example, GBK is only suitable for simplified Chinese environment. Although GBK saves more space than UTF-8, the world has become a global village, so it is recommended that everyone use UTF-8 encoding.

ANSI: On Windows, if you open a document with Notepad, you will often see ANSI encoding, which is the default encoding on Windows. ASCII code is used for English documents and GB2312 code is used for simplified Chinese documents (Windows simplified Chinese version only, Big5 code is used for traditional Chinese version).