Learn character encoding from 0

There is some knowledge that you are exposed to all the time, but once asked you to answer, may be in addition to its name can not answer what else, character encoding is like this, we all know UTF-8 and GBK, may also know the Chinese character GBK is more, but UTF-8 is not also supported Chinese characters? Why do we use both codes, why don’t we use one? ASCII code is the same, I have learned it in the beginning, but I still don’t know its relationship to other codes. In this section, we will talk about character encoding in detail.

A, ASCII code

1. The birth of ASCII

Well, first of all, the computer is ultimately binary, so let’s say it represents numbers, and it can be expressed in binary. But it’s not so easy to express characters, because there’s no rhyme or reason between characters, so you have to use each number to represent a unique character, like the periodic table, one to one.

This is how ASCII code was born, a set of character codes developed in the US in the 1960s. It defines a total of 128 characters, seven bits in binary, 128 combinations ranging from 000, 0000 to 111, 1111, and 0-127 in decimal.

2. How many ASCII digits are there

According to the previous analysis, ASCII code should be 7 bits long, and it doesn’t need the eighth bit to run out of symbols. English only has 26 letters, foreigners can rack their brains to get 128 characters out of the already very hard, you make it 128 characters out, it is not a head for you to think. So it must be 7 bits at the beginning, and the 8 bits are fixed at 0.

It is also said that it is 8-bit, on the one hand, extension, different regions may use the remaining 128 bits of ASCII code, extension use. As a result, the extended ASDII code varies from country to country. Of course, the first 128 bits are fixed, and the last 128 bits are quite different. Some people say that computers usually use 8 bits for ASCII because the 8th bit is used for parity check, so ASCII is 8 bits.

There are a lot of different reasons, but PERSONALLY I feel that ASCII is still 7 bits, and the extended ASCII code is certainly not the original ASCII code, not to mention there are so many different extended versions.

Ii. Codes of different countries

Extending ASCII was only the first step. Countries like Europe, which spoke a language not too different from English, could use the unused eighth digit and add whatever symbols they needed. Of course, different countries will certainly have different language symbols, and even the same symbol has different meanings and pronunciations, so differences arise.

These countries are ok, especially if they are Asian countries. Neither 128 nor 256 can express such a large number of Chinese characters, so we also created our own character set: GB2312, which is the Chinese national standard simplified Chinese character set, and compatible with ASCII code. Different countries basically have their own localized character sets.

1. Character encoding

Strictly speaking, a character set is just a collection of characters, and it is not directly associated with the computer. So characters need to be binary encoded, and different characters need to be stored in the computer in a specific way.

In the first layer of coding, we need to assign A unique number to each character by some rule, which is one-to-one. For example, in ASCII code, “A” represents the number 65.

The second encoding, the storage length (bytes), is a fixed storage length, or varies according to different characters. ASCII code, for example, is stored in a single byte.

Layer 3 encoding, storage format, big-endian versus small-endian, this layer usually doesn’t matter much and sometimes depends on the system, such as X86 machines that use small-endian.

2. Charset and Encoding

Through the character encoding process, the character set and encoding mode should be separated, the character set simply puts the characters, and then the encoding mode lays down the 123 layer encoding rules.

But this is not the case, at least in my review, and many character sets default to the encoding mode and are, strictly speaking, at least tied to the first level of encoding. In other words, when the character set was invented, coding rules were formulated at the same time. There was no strict distinction between character set and character encoding, and the boundary was relatively fuzzy. Perhaps because at that time, each character set and its encoding were one-to-one and were intended for computer coding, no distinction was necessary.

Just like our GB2312 is a character set, but it is also a character code.

No conscious distinction was made between character sets and encodings until the advent of Unicode, which had different encodings.

Character set: Short for character set

Character set encoding: charset encoding, short for charset encoding

Third, Unicode

1. The birth of Unicode

Since different countries have different character sets and encodings, for the same sentence, my character set and encodings are completely different from yours, then there must be problems. For example, when I type “Hello”, most of the characters in China use the same character set and encoding mode, so it is no problem to send messages to each other. But if you’re in a different country and it doesn’t have your character set installed, it’s impossible to convert your messages. What if the same code on its side of the character set corresponds to the word “stupid beep”? That’s not an awkward B.

However, if they automatically install your character set and encoding solution, it means that they also have to install compatibility for every character set and encoding in the world, which obviously adds too much meaningless burden.

More seriously, what if an article is written in two languages? Is it hard to code half A and half B?

So Unicode was born. It contains all the symbols in the world and can express any language as long as computers support it as one character set.

2. The problem of Unicode

When Unicode came along, it clearly defined the first layer of encoding, but it didn’t define the second. I’m actually at the fifth layer.

That is, Unicode does not define how it should be encoded and stored in a computer, unlike ASCII, which can be directly located at 7 bits and stored in a computer at 8 bits per byte. The range of Unicode is so large that two bytes 65536 are not enough if all are stored at the maximum (it is said that there are already about 100,000 characters), and three bytes are too wasteful to use at all. If you’re an English-only type-taker who can cover all English characters in one byte, and Unicode forces you to save one character in three characters, that’s a lot to eat.

If English is represented by one byte (ASCII code) and other characters by two to four bytes, then a new problem arises. How does the computer know what bytes the next character will be? For example, do the next three bytes represent one character or three single-byte characters?

In addition, since the Internet did not appear at that time, Unicode could not be effectively promoted, there was no effective unified solution, and there were even many different encoding and storage methods.

3.UTF-8

Then came the Internet, and UTF-8, the most widely used Unicode encoding on the Internet, was born, along with UTF-16 and UTF-32. They are different encodings of Unicode, so the difference is layer 2, but obviously the other encodings are not as widely used as UTF-8. So when Unicode had different encodings for the same character set, naturally people began to distinguish between character sets and character encodings.

Utf-8 stands out from other encoding methods because it is variable length, which means it does not waste too much resources, and its encoding rules are simple:

① For single-byte symbols, the first byte is set to 0 and the next 7 bits are Unicode codes, which are compatible with ASCII codes.
② For n byte symbols larger than one byte, the first n bit of the first byte is set to 1, the n+1 bit is set to 0, the first two bits of the next byte are set to 10, and all the other bits are combined to form Unicode codes.

The purpose of these two rules is to figure out how to tell whether a few bytes are one character or multiple characters.

If you don’t understand, you can try to understand it in another way:

Stand in the computer’s point of view, encoding, first read the first byte, check its first digit, if it is 0, that this byte must be a single-byte character, directly parse it behind the 7 bits for encoding. If it is 1, it continues to read the second and third bits of the byte until it hits 0. If the second bit is 0, the byte is attached to the previous byte. If the third, fourth, or later zeros are used, then the number of consecutive 1’s at the beginning of the byte indicates that the character will have several bytes.

Here is a table for reference. On the left is the number of bits a symbol needs to represent in Unicode, and on the right is its UTF-8 encoding. You can see that a character can currently be represented in up to four bytes

Unicode symbol digit | utf-8 encoding (binary) | (binary) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 1 to 7 | 0 XXXXXXX 8 to 11 | 110 XXXXX 10 XXXXXX 12 to 16 | 1110 XXXX 10 10 XXXXXX XXXXXX 17 to 21 | 11110 XXX 10 10 XXXXXX XXXXXX XXXXXXCopy the code

For example:

For example, the Chinese character [code], its Unicode code is 111 1000 0000 0001, a total of 15 bits, compared to the figure above is the third line, requires three bytes to represent, then its UTF-8 code is 11100111 10100000 10000001, understand? The combination of the scarlet letter is Unicode code, so even though it is the same character, its Unicode code and UTF-8 code are different, note this.

Some people may not know how to obtain a character Unicode code and UTF-8 code binary, I wrote a simple Java test class, for reference:

    public static void main(String[] args) throws UnsupportedEncodingException {
        char a = 'code';
        // Outputs the binary Unicode code
        System.out.println(Binary Unicode code:+Integer.toBinaryString(a));
        // Outputs the hexadecimal Unicode code
        System.out.println("Hex Unicode code:"+Integer.toHexString(a));


        // Get the byte array of characters UTF-8. Byte is also a decimal number storage format, so convert the output
        byte[] aBytes = String.valueOf(a).getBytes("UTF-8");
        String aBinary;
        System.out.println("Binary UTF-8 code:");
        for (byte aByte : aBytes) {
            // If you output aByte directly, you will get three integers between -127 and 127
            // It is a complement code, so it is processed into the original code, and then uses the Integer method to output binary
            aBinary = Integer.toBinaryString(aByte & 0xFF);
            System.out.print(aBinary+"");
        }


        // A hexadecimal UTF8 code is included
        String a16= URLEncoder.encode(String.valueOf(a), "UTF-8");
        // Since it's a percent sign, it's a little ugly, so let's change the sign
        a16 = a16.replace("%"."-");
        System.out.println("\n Hexadecimal UTF-8 code:"+a16);
    }


Copy the code

In fact, when most people talk about these codes on the Internet, most of them are written in hexadecimal terms, such as 4EA2. But I’m a little confused when I’m looking at it, and the coding itself is a little bit messy, and I have to convert between binary, hexadecimal, and decimal, which is really too convoluted, so I’m using binary directly here. So don’t worry if you see it in hexadecimal, it’s going to be the same in binary.

Maximum number of bytes for UTF-8:

In fact, utF-8 was originally designed for a maximum of six bytes, meaning that it could represent a character in six bytes at most. However, after the re-specification, only the unicode-defined regions U+0000 to U+10FFFF can be used, so utF-8 is normally said to be a maximum of 4 bytes.

So if someone says UTF-8 is six bytes, it’s not for nothing. However, the total number of symbols in the world today is not enough to use utF-8 of 5 bytes, and most likely will always be 4 bytes.

4. Why GBK?

Logically unified encoding, then use Unicode (UTF-8) is not on the line, but we will find that in daily use, there are still many places GBK, create a TXT, the default text storage format is GBK, this is why?

Utf-8 is variable length, but that doesn’t mean it doesn’t waste resources (it just wastes less). At four bytes, almost 11 bits are fixed and are used for recognition, which is obviously a waste and a price of variable length. In UTF-8, Chinese characters are usually three bytes, so we need at least three bytes to store one text per article.

And GBK is specially used in China, basically we often use any character, character even near several Asian countries (Japanese, Russian, Korean) are included, the most important thing is that it said a Chinese only need two bytes storage, obviously want to save a lot of space than utf-8, there may not be felt when the text is little, but once the text is much, The difference is stark.

We can also prove this simply locally:

① In any position, right mouse button = “new =” text document
② Randomly type ten thousand Chinese characters, click save as, select ANSI encoding, ANSI encoding in China is the meaning of GBK encoding, the reason will be explained later
③ Continue to save as, this time select UTF-8 and save again

As you can see in the figure below, GBK requires only 20,000 bytes of storage, while UTF-8 requires 30,000 bytes, half of the storage resources, but the text is the same!

Not only In China, but also in other countries, especially in Asia. Since major languages are not English letters, they may have their own character sets and encodings. If it is certain that the usual use will not involve internationalization, it will generally choose the coding method defined by its own country, which is convenient and saves resources.

Of course, the main thing is that Windows chooses the default encoding for different locales by default. But if Windows had been so committed to using UTF-8 everywhere, the entire computer network would have had fewer garbled characters.

But these decisions need not be second-guessed now, for they are done.

4. History of Coding in China

Before we briefly mentioned GB2312 and GBK, I believe we can see that these two are China’s coding standards, of course, now basically use GBK. Here I simply overview of the development history of our country’s own coding standards, although the name is very different, but are GB, is the abbreviation of the “NATIONAL standard” two words, so we can judge from this is China’s coding.

In fact, this history of development is much more complicated and rigorous than these paragraphs. If you are interested, you can refer to them by yourself. I can only make a brief and unrigorous summary based on some materials on the Internet.

1. The GB2312 character set

It started with GB2312, which is a Chinese national standard for simplified Chinese character sets without distinguishing codes. The first 127 characters still retain ASCII code, followed by The Chinese character, storage is a Chinese character with two bytes.

It contains 6,763 commonly used Chinese characters, as well as 682 full-corner characters including Latin, Greek, Japanese katakana and Russian Cyrillic. Two bytes long are full-corner characters, whereas those under 127 are half-corner characters.

Additional thoughts: full and half corner punctuation marks

I believe that since the use of input method contact computer, we know the existence of full corner character and half corner character, may also know that half corner character is the English symbol, full corner is the Chinese symbol. And when we’re writing code, we use half-corners for punctuation. For example, the syntax of the code uses half-corner commas: [,]. If you accidentally use full-corner commas: [,], an exception will be compiled.

This shows that half-corner and full-corner punctuation marks are completely different characters. In fact, full-corner punctuation marks are newly defined characters in Chinese code, which can be understood as a special Chinese character, while half-corner marks are the first 128 symbols in ASCII code.

Since they are different symbols, different characters, there are different character numbers, of course, are not compatible. Therefore, the Chinese comma and the English comma are bound to be unrelated, and at best can be said to mean the same in the abstract.

2.GB13000

The emergence of GB13000 is mainly to facilitate the simultaneous processing of a variety of characters, it contains 20902 Chinese characters, obviously expanded a lot than GB2312, and defined a set of new coding system. But, it seems that there is no fire, to tell the truth, I can not find too much information on the Internet, almost thought GB2312 is GBK directly behind. It may be said that although it is well established, there are difficulties in its implementation.

3.GBK

There is an online version of GBK is the original Microsoft launched GB2312 extension, in windows95 version of the use. As it became so popular and widely used, it was officially made official by the state, though no specific proof of this claim has been found. The only thing that can be confirmed is that GBK is not a national standard, but a guiding document for technical specifications. Whatever its origins, it is now the most commonly used encoding method for Windows systems in the country.

GBK is extended to GB2312 and supports GB13000. From the original GB2312 expanded to 21003 Chinese characters (including traditional) and 883 symbols, far more than the original 6000, and the encoding method has also changed. Although it includes all Chinese characters of GB13000, its encoding method is different from that of GB13000, so it can only be regarded as a transition scheme from GB2312 to GB13000. (Of course, until now, it seems that the transition has not succeeded, but all use GBK)

4.GB18030

GB18030 is an extension of GBK, so it is compatible with GBK and GB2312, this time it includes the characters of domestic minorities, that minority village also began to access the network, while adding some Korean characters, and some miscellaneous. A total of 27,484 Chinese characters are included, covering most of east Asia. And it also has a variable length encoding mode, there are single byte, double byte, four byte three.

It is currently our national standard, which means mandatory. Just in the Windows system has not been promoted, online search some information, it seems to be a long way to go ah… But leave it to the professionals.

At present, Windows is still commonly used GBK, but do not forget, China’s coding national standard or GB18030, GBK can only be said to be an industry standard. And different from GB13000, GB18030 has this specific encoding mode in the system. For example, IDEA, which I use most frequently, can switch files to GB18030 code.

5. Summary of GB coding

It can be seen that basically Chinese coding, now is GB2312, GBK, GB18030 these three. In fact, in addition to what I said there are a lot of GB character sets, but I do not know much about this, after all, is a small code farmer, do not need to be too bogey in this place, just did not expect the development of GBK has so many stories. It is not certain after which day standard matches GB18030, still should understand somewhat so.

* 5. Other coding sets (optional)

The e feeling said all said several kinds, other also roughly say one or two sentences, otherwise seems to have what difference. But these codes are neither globally uniform nor nationally universal, so simply look at them and skip them.

1.ISO-8859

ISO8859 is not a standard, it is a series of standards, so we will see isO-8859-1, ISO-8859-2, ISO-8859-3, these different character sets. The different character sets below it use characters from different countries, that is, for different countries, and are very different from each other.

It is a single-byte character set that uses bits 8 not defined in ASCII to add some characters. After 128, 32 code points are reserved for the 32 control codes defined by the extension, so the range of new characters is only 0xA1– 0xFF (161– 255). Or you can use it in countries with few characters. (China certainly can’t use it.)

Code:

Sometimes we’ll see code points, numbers, locations, they’re all pretty much the same, they’re all descriptions of the position of a character in a character set, a seat number. And is usually expressed in hexadecimal notation, such as 0xA1.

2.ISO-2022

Obviously ISO8859 can only be used in countries with little Latin language or characters, for China, Japan and South Korea, certainly there is no way. ISO2022 is so out, it and 8859 on behalf of a series of standards, like ISO-2022-CN, ISO-2022-JP, ISO-2022-KR on behalf of the standard is China, Japan and South Korea.

It is also variable length encoding mode, and supports GB2312, but I won’t go into detail here, anyway, basically no one can use it.

Additional knowledge: What is ISO

ISO is the International Organization for Standardization, it is not only the service of character set, it is specially used to develop some global standards of the Organization, involving the vast majority of the present field. So the Unicode character set, the universal character set, also has its role.

3.BIG5

Big five is the code used by Wanwan (Taiwan Province, China), which contains 13,053 characters, almost all of which are traditional characters and have great limitations. Let alone uncommon words, even some common characters are not included.

But in the 1990s, Hong Kong began to use them, after all, in traditional Chinese. Then Hong Kong found that a lot of the characters they needed were missing, so of course they were working on the missing characters, and they came up with the Hong Kong Supplementary Character Set (HKSCS) to solve this problem.

4. UTF – 16 and UTF – 32

Utf-8: UtF-16: UTF-32: UTF-16: UTF-32

1) UTF – 32

Utf-32, like the original ASCII code, is fixed-length, 32-bit, or 4 bytes. There is almost no symbol that cannot be expressed in 4 bytes, but it is also a waster of space. If you want to type hello, utF-8 is only 5 bytes, and UTF-32 is 20 bytes.

Therefore, this also illustrates the significance of utF-8 variable length encoding.

And because it is fixed length and not a single byte, the processing unit is four bytes. So there’s a kind of problem with big and small side storage, different systems might be different, how does it know if it’s in order from big to small or from small to big?

So the encoding is divided into UTF-32 BE (Big Endian) and UTF-32 LE (Little Endian), and you can tell the difference directly from that.

What is big end and little End:

It’s easier to use the example directly. Suppose you have a character that is represented by four bytes: 0x12345678 (this is a 16-bit representation, two numbers are one byte), then

Big-endian:

Lower address — — — — — — — — — — — — — — — — — > high address 0 x12 | 0 x34 | 0 x56 | 0 x78

The small end method:

Lower address — — — — — — — — — — — — — — — — — — > high address 0 x78 | 0 x56 | 0 x34 | 0 x12

It’s just a difference in order

(2) the UTF – 16

Utf-16 is variable-length, except it comes in only two variants: 2-byte and 4-byte. Most of them are 2 bytes. It is said that they were originally stored in 2 bytes, 16 bits =2 bytes, but apparently there were not enough code bits, so they were extended to 4 bytes.

So characters numbered between U+0000 and U+FFFF (common character sets) are represented by two bytes. Characters numbered between U+10000 and U+10FFFF are represented by four bytes.

It also has size issues, so you’ll see utF-16 BE and UTF-16 LE.

6. Other issues

List some other questions about the character. Someone may need an answer. This section may be updated later.

1. What is ANSI?

When notepad is saved, the default is ANSI encoding, of course now we know it is GBK, so ANSI=GBK?

In fact, no, ANSI code means “local code”, which means GBK in China, Big5 in Taiwan Province and JIS in Japan. So there is no need to understand too complicated, in China’s Windows, you just take it as GBK.

So note that ANSI has nothing to do with ASCII, except that it looks a little bit like it.

2. What is UCS and how does it relate to Unicode?

① What is UCS?

Sometimes we will see what is called a UCS Character Set, but its full name is Actually Information Technology — Universal Coded Character Set (UCS). Information Technology Universal Coded Character Set (UCS), represented by the ISO 10646 standard, is actually organized by ISO to unify the global character coding. Note that Unicode was created by an academic body called the Unicode Consortium, when UCS and Unicode were born with the same goal in mind: to unify the global code.

② Its connection to Unicode

Later, it became clear that the world did not need two different unified encodings, so they worked together, and since Unicode2.0, the two have been pretty much the same, with UCS-2 and UCS-4 being the specific encodings for UCS, just as UTF-8 is for Unicode.

In addition, UCS-2 can be regarded as the parent of the early UTF-16, which is fixed with two bytes, but later UTF-16 introduced auxiliary plane characters, which can be represented by four bytes, so it is not comparable.

The UCS-4 is basically the same as the UTF-32, at least for now.

References:

1. Baidu Encyclopedia

2. Character encodings Notes: ASCII, Unicode, and UTF-8: [Recommended reading, albeit from 2007, but still worth a look]

www.ruanyifeng.com/blog/2007/1…

Character set and encoding (一) — Charset vs Encoding:

My.oschina.net/goldenshaw/…

4. Comparison of coding methods, and selection of UTF-8 and GB2312:

www.cnblogs.com/shanwater/p…

5.GB2312, GB 13000, GBK, GB18030 Introduction and Description documents:

Wenku.baidu.com/view/057033…

6. What is ANSI code? :

www.cnblogs.com/malecrab/p/…