What is coding

Encoding is the process of converting information from one form or format to another; Decoding is the reverse process of coding.

For example, you and your friend are sitting in a classroom with several people in between. You want to pass the message “ABCdefg” to your friend through the middle person, but you don’t want others to know the content of the message, at this time you need to convert the content of the message. How to convert? Suppose you and your friend had an agreement that all the letters you passed on a piece of paper would be shifted back one bit, a to B, C to D, and z, being the last letter, to the first letter, A. So you write “BCdefgh” on a piece of paper, and the process of converting the message from “abcdefg” to “BCdefgh” is called encoding, and the encoding is that the letters of the English alphabet are shifted backwards by one, and your friend gets the encoded message, and then moves the letters forward in turn to get the real message, and that’s the reverse encoding process, decoding.

Why do characters in a computer need to be encoded

The above example needs to be encoded for security reasons, but why do the characters in the computer need to be encoded, for the same purpose? Of course not. The reason why characters in a computer need to be encoded is that all data in a computer can only be represented in binary numbers, namely 0 and 1. To distinguish the characters we recognize in a computer system, we need to encode them in a permutation of 0 and 1.

ASCII

American Standard Code for Information Interchange. This is the most common Standard for Information Interchange. To what extent is it common? We’ll see later that there are many character encodings in modern computer systems, but almost all of them are ASCII compatible, that is, the encodings of the ASCII compatible characters are consistent. What characters can ASCII represent?

ASCII was first published as a canonical standard in 1967, and was last updated in 1986. Up to now, it defines a total of 128 characters, of which 95 are displayable characters that can be displayed on print or display, and the remaining 33 are control or communication-specific characters. Such as LF (newline), CR (enter), ACK (confirm, special character for communication) and so on

In ASCII, all characters are stored in seven bits. We know that computer systems store in one byte, usually eight bits, with the remaining one bit kept at zero. For example, if the ASCII code value of character A is 65 in decimal notation and 01000001 in binary notation, refer to the ASCII code mapping table for more characters.

ASCII extension

ASCII has its limitations. It can only display 26 basic Latin letters, Arabic numerals, and British punctuation marks, and is only suitable for modern American English. For other indo-European languages, there are still some characters that cannot be encoded using ASCII. How to solve this problem? As we mentioned earlier, ASCII uses only 7 bits of a byte, leaving 1 bit unused. From our knowledge of permutation, we can calculate that 2^8 = 256. If we use 8 bits, we can encode 256 characters, excluding the original 128 characters in ASCII. An additional 128 Spaces are available, which is more than enough for other Indo-European languages, such as ISO 8859-1.

If the indo-European language problem is solved, what can we do with the Sino-Tibetan language family? Take Chinese as an example. There are 2500 characters in common use in mainland China alone, plus a large number of traditional and rare characters.

Chinese code

Obviously, one byte is no longer enough for encoding, and we must consider using multiple bytes to represent a Chinese character. Gb2312-80, also known as GB2312, is a set of national standards issued by the General Administration of Standards of China in 1980. It is suitable for the processing and information exchange of Chinese characters in computer systems. This standard uses two bytes to store Chinese character codes.

There is a problem with using two bytes to store encoding, how is it compatible with ASCII? Can we just throw away ASCII and re-encode the 128 characters in ASCII into a two-byte code? Of course is ok, but doing so may be some extravagant, originally a byte is enough said, forced into two bytes might eventually cause a large amount of storage space on a lot of, and even if only with 128 characters in American English also can’t communicate with other computers using the ASCII coding, because the two sides for the 128 – character encoding.

It seems that it is imperative to solve the compatibility problem with ASCII encoding, let’s take a look at GB2312.

GB2312 uses 7-bit double-byte encoding and contains 6,763 Chinese characters and 682 non-Chinese graphic characters. How does it represent all these Chinese characters? Due to the use of 7-bit double bytes, we know that the number of 7-bit representable character encodings is 128. Taking these two byte representable character encodings as the horizontal and vertical axes of the two-dimensional coordinate system respectively, we can see a plane that can accommodate 128×128= 16,384 character encodings. Of course, the actual amount of storage is not so much, to exclude the ASCII control characters 0 ~ 31, space 32, delete 127, there are only 94 bits left, so GB2312 uses 94×94 location code to represent characters.

In addition to location code, there is another concept called gb code, GB2312 character GB code is equal to decimal location code plus 32, the specific reason is 32, guess is to retain 32 control characters in ASCII, but it is still not compatible with ASCII, so the concept of internal code = hexadecimal GB code +8080H, That is, two bytes each add 80H, equivalent to offset 128 character encoding position, we know that ASCII can represent only 128 characters, so it can achieve full compatibility with ASCII encoding, decoding is also easy to judge, because the ASCII encoding highest bit is 0, and GB2312 offset after all encoding highest bit is 1. That’s the perfect way to tell it apart.

ANSI

When we edit a document in Windows Notepad and save it, we can set the encoding of the document. One of the encoding formats listed is ANSI. What is this? Strictly speaking, ANSI is not a character encoding set. It is represented as a different character encoding set in different Windows languages. It exists only in Windows operating systems, and the values specified in the Windows Code Page determine which character encoding set it represents. In Windows, the value of code page is 936, corresponding to GBK (extended to GB2312 character code), while in traditional Chinese system, the value of code page is 950, corresponding to big-5 code.

Unicode Unicode

With The Chinese code, it is now possible to process and transmit Chinese characters on computers, but the problem is not over. GB2312 only guarantees that some Chinese and English characters can be processed effectively, and when it comes to some other languages, GB2312 is unable to do so, which is where Unicode comes in.

Unicode is a character coding scheme developed by the international organization that can contain all characters and symbols in the world.

Unicode uses two bytes to store character encodings, also known as UCS-2. Considering that two bytes may not be enough in the future, there is also UCS-4, which uses four bytes to represent a character.

It is important to note that Unicode is only an encoding scheme, which specifies the binary encoding for each character, and how the encoding is stored in a specific computer system is implemented by a specific character encoding set.

UTF-8

Utf-8 is an implementation of Unicode. How does it store Unicode character encodings?

For example, the Chinese character “Zhong” is encoded in Unicode as 4E2D, represented in binary as 01001110 00101101, stored in UTF-8 as 11100100 10111000 10101101, in hexadecimal E4B8AD, You can see that the original Unicode encoding is two bytes long, but the UTF-8 encoding is three bytes long. Why? Again, for ASCII compatibility. We know that ASCII uses one byte to store character encodings, and the highest bit is 0, so UTF-8 uses variable-length character encodings. The rules are as follows:

A single-byte character with the first byte set to 0, exactly the same as ASCII encoding.

If a multi-byte character takes up n bytes, the first n bits of the first byte should be set to 1, the next n bits should be set to 0, and the first two bits of the remaining bytes should be set to 10. After that, the remaining Spaces of all bytes are used to store the Unicode encoding of the characters in sequence, and the highest bits are filled with zeros.

For example, the utF-8 encoding of the above mentioned “medium” is 11100100 10111000 10101101. Since this character takes up three bytes, the first three bytes of the first byte are 1, the fourth byte is 0, and the second byte and the third byte both start with 10. Namely, 1110XXXX 10XXXXXX 10XXXXX. The Unicode encoding for “middle” is 01001110 00101101. After removing the high zeros, there are 15 bits in total, but there are 16 bits in storage. Therefore, we set the highest bit to 0, 11100XXX 10XXXXXX 10XXXXXX, and put the Unicode encoding in sequence. You get the final UTF-8 encoding.