We know that inside a computer, all information is stored in binary form. Characters, or video or audio files, eventually correspond to a string of zeros and ones. So the transition from human-readable information to machine-level binary language can be understood as a process of coding, and naturally the reverse process is called decoding.

It can be said that all the garbled code is due to the decoding method and the encoding method of inconsistency. It’s as if I write you a letter in English (my message is encoded in English), and you only know Chinese, and you read the letter in Chinese (decoded in Chinese), and the whole letter to you is what you call [scrambled]. In fact, the so-called garbled code is not a complex problem, just the way of decoding is different from the way of coding, as long as the appropriate way of decoding is good.

Based on the evolution history of computer coding, from the earliest ASCLL coding to Unicode coding, this paper discusses how our [human information] is encoded into [computer level information].

Always compatible WITH ASCLL

In the middle of last century, Americans invented the computer, but did not consider the popularization of computer would be so fast, so they only developed the mapping standard between English characters and some control characters and binary, the original standard is ASCLL coding standard.

ASCLL first numbers all the characters that need to be encoded (a total of 128 characters are arranged), for example: the number of the number 0 is 48, the number of the letter A is 97, etc. ASCLL then uses a byte (eight bits) to describe these characters, converting their respective numbers from decimal to binary. So the characters from 00000000 to 01111111 (0-127) are arranged. Therefore, all files using ASCLL encoding standards are interpreted as one character per octet, so that all English characters, numbers, and other characters can be stored and read. A classic ASCLL table is attached below:

A byte, although only seven bits, but contains characters or quite a lot of, for americans, this is completely enough to use, but for some European countries, and for our great China, one byte is too little, so many parts of the country had their own extension coding standard, But they’re all ASCLL compliant (they’re the original ones, after all).

Second, Windows-1252 from the European extension

The American ASCLL standard defines only 128 character encoding methods, using the binary range 00000000 — 01111111. So Europeans define some of their own symbols directly using 127 binary bits in the 10,000-11111111 (128-255) segment.

Three, GB2312 from the Expansion of The Chinese

My great Chinese nation has thousands of Chinese characters. How can the American code standard for one byte work? GB2312 (national standard code) is mainly aimed at our daily use of some simplified Chinese, a total of 6763 Chinese characters, double byte encoding, forward compatible with ASCLL standard.

Then there is a question, ASCLL standard characters are encoded in one byte, while our Chinese characters are encoded in two bytes. When decoding, does the computer read one byte at a time and parse it into one character according to ASCLL standard? Or read two bytes at a time and parse it into a Chinese character according to our GB2312 standard?

GB2312 stipulates that the highest bit of the first byte in the two bytes encoding Chinese characters must be 1. In this way, the highest bit of all characters in ASCLL standard (00000000-01111111) is 0, so when the computer reads the highest bit of a byte is 1, it reads two bytes in a row and parses them into a Chinese character according to GB2312 standard. Otherwise, it is considered a normal character and is resolved as a normal character according to ASCLL.

Below, we simply describe the specific coding details of GB2312:

First, GB2312 arranges each character through what is called partitioning.

  • Some special characters are arranged in 01-09
  • Zone 16-55 has arranged the common Chinese characters of the first level
  • Areas 56-87 have arranged the common Chinese characters of level 2
  • Zones 00-15 and 88-94 are not coded

GB2312 encoding mode: 0xA0 + area code, 0xA0 + bit number. For example, the location number of [Yang] is 4978 (the 78th digit in area 49), so the GB2312 code of Yang is 0xA0 + 49, 0xA0 + 78, that is, D1EE. Therefore, in the past, there was a location input method, which was typed by inputting four digits, and these four digits were the location number of the Chinese character. As for why to add 0xA0 in the area code, I looked up a lot of information, there is no clear statement, may be a regulation.

It’s important to understand that coding is a two-step process when you think about it.

  • This character is represented by a unique identity choreography
  • Develop uniform rules for mapping identities to the underlying binary

This is true of ASCLL standards, as is GB2312.

For example, ASCLL numbers all characters and does not duplicate each other (step 1), and then makes a rule that the binary of a character number is its character encoding (step 2).

For example, GB2312 numbers all Chinese characters without repeating each other (step 1), and then makes rules so that the binary character encoding of the Chinese character can be obtained through the location number (step 2).

GBK is backward compatible with and extends GB2312, including 21003 Chinese characters. It still adopts fixed two bytes to encode Chinese characters, but the value range of high-order bytes is different, which will not be described here.

Ambitious Unicode

Above we have introduced the American coding standards, the European coding standards, the Chinese coding standards, of course, this is just the tip of the iceberg, there are various coding standards in the world. Computer manufacturers in each country have to use different coding standards for different regions to produce computers, which is cumbersome and inefficient. Is there a coding standard that can capture all the characters in the world and provide a storage implementation?

Unicode was created to unify all the codes in the world. It arranges nearly all the characters in the world, with a total of over 1.1 million character sets ranging from 0x000000 to 0x10FFFF. But most characters are in the range: 0x0000 to 0xFFFF (less than 65536), and each character has a Unicode number and is typically represented in hexadecimal, preceded by U+. For example, the Unicode representation for [Yang] is U+6768.

Unicode is an encoding standard that numbers all characters in the world. It does not specify how each character should be mapped to a binary string. The main implementers of Unicode are utF-32, UTF-16, and UTF-8. Let’s look at the implementation details of each of these implementors.

1, the UTF – 32

This is one of the most brutal implementations, using a fixed four bytes to store a single character, all characters use four bytes for storage, space waste, rarely used in practice.

2, UTF – 16

For Unicode storage implementations, there is a basic idea that the more commonly used characters should be represented in fewer bytes, and the rarest characters should be represented in the most bytes. Let’s look at the implementation details of UTF-16:

Unicode encoding ranges from 0x000000 to 0x10FFFF, and a total of 1,112,064 characters can be encoded. The UTF-16 policy is that the number range 0x00000-0x10000 (0-65532) is a common character and is stored in a fixed two-byte format. Where the binary value corresponding to a character is the binary literal value of the character’s own number. However, the numbered interval 0xD800 to 0xDFFF is not programmed with any characters, and this interval will be used for the subsequent supplementary character set encoding, which is not mentioned here for the moment.

So, for common characters, it’s two bytes, but not common doesn’t mean it’s not used, and we’re going to look at how those supplementary character sets, the so-called uncommon character sets, are encoded.

Utf-16 uses a fixed four-byte store for characters in the numbered range 0x10000-0x10FFFF, but you’ll find a total of FFFF between 0x10000-0x10FFFF, which is 2^20=1,048,576 characters, So it takes 20 bits to encode that many characters. So, in our four bytes, the first two bytes must provide at least 2^10(111… 111, ten one) possibilities, and the last two bytes must also provide 2^10 possibilities to compose all the SUPPLEMENTARY character sets.

However, there is a problem: a string of binary values, how do I determine whether a character is a common character (stored with a fixed two bytes) or a supplementary character (stored with four bytes)?

Utf-16 solutions are as follows:

Each Unicode character has its own Unicode number, and for supplementary characters, they are numbered greater than 0x10000. To obtain the sequence number of a character in all supplementary character sets, subtract 0x10000 from its own number. The ordinal value must be in the range of 0x00000-0xfffff, which is 20 bits, because the number of additional characters left is not more than 0xFFFF.

For the first two bytes (known as leading proxies on Wikipedia), define their values in the range 0xD800 (0xD800 + 0x0000) to 0xDBFF (0xD800 + 0x3FF [10 1s]), providing just 2^10 possible values.

For the last two bytes (referred to as trailing proxies on Wikipedia), we define their values in the same range: 0xDC00 (0xDC00 + 0x0000) to 0xDFFF (0xDC00 + 0x3FF [10 1s]), which also provides exactly 2^10 possible values.

Therefore, if the first two bytes of binary value are found in the range 0xD800 to 0xDBFF, then the character is a supplementary character and is stored in four bytes at the time of encoding. The four bytes in sequence are the binary value of the current character. Otherwise, it means that this is a basic common character stored in two fixed bytes, which should be read in sequence.

Here are a few examples:

1. Unicode character U+0024

First, it is judged that the number is less than 0x10000, and the character belongs to the common common character set, so the UTF-16 encoding value of the character is the binary form of its number.

2. Unicode number U+24B62

First, determine that the number value of the character is greater than 0x10000, indicating that the character belongs to the supplementary character set.

Thus, 0x24B62 minus 0x10000 gives the order of the character in the supplementary character set: 0x14B62.

The utF-16 encoding standard is used to obtain the leading proxy and the leading proxy, and when combined, the UTF-16 encoding of the character is obtained. Here’s how it works:

0x14B62 -> 0001 0100 1011 0110 0010

Leading proxy: 0001 0100 10 + 0xD800 = 0xD852

Rear proxy: 11 0110 0010 + 0xDC00 = 0xDF62

Therefore, the UTF-16 encoding for the U+24B62 character is 0xD852 DF62

To summarize utF-16 encoding standard, for characters numbered less than 65536, fixed two bytes with numbered binary are used as encoding values. In the case of a SUPPLEMENTARY character set (greater than 65536), the sequence number of the current character in the supplementary character set is obtained by subtracting 65536 from its Unicode number. Two proxies are then separated out and specified values are added so that they are in a particular range. And use this to tell whether a character is stored in two bytes or four bytes.

3, utf-8

Utf-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters are encoded using one to four bytes, with the most commonly used characters stored in the fewest number of bytes, and the less commonly used characters stored in a relatively large number of bytes.

The utF-8 encoding rules are as follows:

For characters numbered less than 127, utF-8 is equivalent to ASCLL.

For the rest of the number range, code as shown in the figure, but not much else, so let’s take a look at an example to see exactly how this is done.

The Unicode number for the Chinese character Yang is 0x6768 and for the decimal character 26472

Obviously, the UTF-8 standard encoding format of the Chinese character is: 1110XXXX 10XXXXXX 10XXXXXX

The binary of 0x6768 is: 0110 0111 0110 1000

Replace [x] in the encoding format from back to front, starting with the last bit of the binary.

Obviously, the result is already in, and the corresponding hexadecimal code is 0xE69DA8

To summarize, the UTF-8 encoding standard classifies all Unicode numbers, and the higher the rank, the fewer bytes are used for storage. Different ranges of Unicode numbered character sets have different templates for UTF-8 encoding. To obtain the corresponding UTF-8 encoding, use the numbered binary to set the template according to the corresponding rules.

In contrast, utF-8 encoded files are specified, and the computer decodes them in bytes. If the highest bit of the current byte is 0, then reverse the steps above to get the Unicode number binary for the character, and then look up the table to get the character.

If there are more than one at the beginning of the current byte, then there are several ones, and the encoded binary value of the character has several bytes, which can be read sequentially. Then the same reverse operation, naturally can get the corresponding character.

Several common coding methods are briefly introduced here, about coding, always remember the conclusion summed up in this article. All encoding standards actually do two things. The first is to give a number or identifier to all characters that need to be encoded, and the second is to specify a rule that maps the number or identifier uniformly to the binary string.


All the code, images and files in this article are stored in the cloud on my GitHub:

(https://github.com/SingleYam/overview_java)

Welcome to wechat public number: jump on the code of Gorky, all articles will be synchronized in the public number.