Character Codec stories (ASCII, ANSI, Unicode, UTF-8)

Once upon a time, a group of people decided to represent everything in the world with eight transistors that could be turned on and off in different states. They thought eight switching states were good atomic units, so they called them “bytes.”

And then they built machines that could process those bytes, and then the machines went on, and they could make more states out of the bytes, and the states started changing. They saw that it was good, so they called the machine a computer.

At first computers were only used in America. A total of 256 (2 ^ 8) different states can be combined from an eight-bit byte.

They specify special uses for the 32 states numbered from 0. When a terminal device or printer encounters these agreed bytes, it has to do some agreed actions. Encounter 00×10, the terminal line, encounter 0x07, the terminal will beep to people, such as good encounter 0x1B, the printer will print white words, for the terminal color display letters. They saw that this was fine, so they called these byte states below 0x20 (decimal 32) “control codes.”

They then coded all Spaces, punctuation marks, numbers, upper and lower case letters into successive byte states up to number 127, so that computers could store English words in different bytes. Everyone felt so good about this that they called the scheme ANSI “Ascii” coding (American Standard Code for Information Interchange). All the computers in the world used the same ASCII scheme to save English words.

Later, just like to build the tower of Babylon, all over the world are beginning to use computers, but in many countries with no English, they use many of the letters in ASCII doesn’t work, in order to can save their words in the computer, they decided to adopt the space after the 127 to represent the new letters, symbols, Many horizontal lines, vertical lines and crosses were added to draw the table, and the number was coded until the last state was 255. The page 128 through 255 character set is called the “extended character set”. Since then, greedy human beings have no new state to use, the American imperialists probably did not think that there are third world countries people also want to use computers!

By the time the Chinese got their hands on computers, there were no byte states available to represent Chinese characters, and more than 6,000 commonly used characters needed to be preserved. But this is not difficult to the wise Chinese people, we unkindly after the 127 strange symbols are directly cancelled, and the provisions: A character less than 127 has the same meaning as before, but two characters larger than 127, when joined together, represent a Character, with the first byte (which he calls the high byte) going from 0xA1 to 0xF7 and the second byte (the low byte) going from 0xA1 to 0xFE. This gives us about 7,000 simplified Chinese characters. In these codes, we also put the mathematical symbols, Greek letters, Rome Japanese kana have entered, even have some in ASCII Numbers, punctuation, letters are entirely made up of two bytes long code again, it is often said that “the Angle of” characters, and in 127 under the original call “half Angle” character.

The Chinese people saw that this was good, so they called the character scheme “GB2312”. GB2312 is a Chinese extension of ASCII.

But there are so many characters in China that we soon find it impossible to type the names of many people, especially troublesome state leaders (such as Zhu Rongji’s “yong” character). So we have to continue to GB2312 did not use the code point to find out honestly use.

When that wasn’t enough, they dropped the requirement that the lowest byte must be after 127, and that the first byte greater than 127 was the beginning of a Chinese character, regardless of whether it was followed by something in the extended character set. As a result, the extended coding scheme is called GBK standard, which includes all the contents of GB2312 and adds nearly 20,000 new Chinese characters (including traditional Chinese characters) and symbols.

Later, ethnic minorities also want to use computers, so we further expand, and added thousands of new ethnic minority characters, GBK expanded to GB18030. Since then, The Culture of the Chinese nation has been passed on in the computer age.

Chinese programmers saw that the standard for encoding Chinese characters was good, so they called it “DBCS” (Double Byte Charecter Set). In a DBCS series standard, the biggest characteristic is two bytes long Chinese characters and a byte long English characters coexist in the same set of coding scheme, therefore they write programs to support Chinese language processing, must pay attention to the each byte in the string value, if this value is greater than 127, then think of a double byte character set character appeared. In those days, blessed, programmed computer monks would recite the following mantra hundreds of times a day:

“One Chinese character is two English characters! One Chinese character counts as two English characters…”

Because at that time, all countries such as China come up with its own set of coding standards, the results who also don’t know who each other coding, who also does not support other people’s code, the mainland and Taiwan so that only the 150 nautical miles apart, use the same language areas, the brother of the respectively adopted different DBCS encoding scheme, when the Chinese want the computer to display Chinese characters, It is necessary to install a “Chinese character system”, which is specially used to deal with the display and input of Chinese characters, but the fortune telling program written by the ignorant feudal personage in Taiwan must install another set of BIG5 encoding which can be used “yitian Chinese character system”, install the wrong character system, the display will be messed up! What about this? And among the nations of the world, there are those poor people who don’t have access to computers. What about their words?

What a tower of Babylon proposition for computers!

Just in time, the Archangel Gabriel appeared — and an international organization called ISO decided to solve the problem. Their solution was simple: scrap all regional coding schemes and create a new code that included every culture, every letter and symbol on the planet! They decided to call it “Universal Multiple-ocTET Coded Character Set, “or UCS, commonly known as UNICODE.

When UNICODE was developed, the memory capacity of computers had grown so much that space was no longer an issue. As a result, ISO directly states that all characters must be represented in two bytes, or 16 bits. For ASCII “half corner” characters, the UNICODE package keeps its original encoding unchanged, but expands its length from 8 bits to 16 bits, while characters from other cultures and languages are recoded entirely. Since the “half corner” English symbol only needs to use the lower eight bits, the higher eight bits are always zero, so this grand scheme wastes twice as much space to save the English text.

At this time, programmers from the old days began to find a strange phenomenon: their strlen function was unreliable, a Chinese character is no longer equal to two characters, but one! Yes, from UNICODE onwards, whether it is a half-sided English letter, or a full-sided Chinese character, they are unified “one character”! It is also the same “two-byte”. Note the difference between the terms “character”, which is an 8-bit physical storage unit, and “byte”, which is a culturally relevant symbol. In UNICODE, a character is two bytes. The time when one Chinese character counts as two English characters is almost over.

Once upon a time when there are multiple character sets, and those who do multilingual software company experienced a lot of trouble, they to the same set of software in different country, have to regional software based on the blessing that double-byte character set spell, not only to be careful not to make a mistake, but also the hovering in the software text in different character set. UNICODE was a good package solution for them, so starting with Windows NT, MS took the opportunity to overhaul their operating system and make all the core code work in UNICODE. From then on, WINDOWS will finally be able to display characters from all the cultures of the world without the need for native languages.

However, UNICODE was not designed to be compatible with any existing encoding scheme, which makes GBK and UNICODE completely different in the arrangement of Chinese characters’ internal codes. There is no simple arithmetic method to convert text content from UNICODE encoding to another encoding. This conversion must be done by looking up tables.

As mentioned earlier, UNICODE is represented by two bytes as a single character, which can combine a total of 65535 different characters, which is probably enough to cover the symbols of all cultures in the world. If that’s not enough, ISO has prepared the UCS-4 solution, which is basically four bytes for a character, so that we can combine 2.1 billion different characters (the highest bit is used for other purposes), probably by the time the Galactic Federation is formed.

UNICODE came along with the rise of computer networks, and how To Transfer UNICODE over networks was also a problem that had to be considered. Therefore, UTF (UCS Transfer Format) standards for transmission emerged. As the name implies, UTF8 is to Transfer data 8 bits at a time. UTF16 is 16 bits at a time, but for the sake of transmission reliability, there is no direct correspondence between UNICODE and UTF, but some algorithms and rules to convert.

Computer monks who have been trained in network programming know that there is a very important problem when transmitting information on the network, which is how to interpret the data in high and low places. Some computers use the low-to-send method, such as the INTEL architecture used in our PCS. Others are sent high – first. When exchanging data on the network, a simple way to check whether the two parties have the same understanding of high and low is to send each other an identifier at the beginning of the text stream — “FEFF” if the text is high and “FFFE” if it is not. You can open a UTF-X file in binary mode and see if the first two bytes are these two bytes?

Here are the rules for Unicode and UTF-8 conversions

Unicode

UTF-8

0000 – 007F

0xxxxxxx

0080 – 07FF

110xxxxx 10xxxxxx

0800 – FFFF

1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding for “Han” is 6C49. 6C49 is between 0800 and FFFF, so use a 3-byte template: 1110XXXX 10XXXXXX 10XXXXXX. Write 6C49 as binary: 0110 1100 0100 1001, divide this bit stream into 0110 110001 001001 according to the three-byte template segmentation method, replace x in the template successively, and get: 1110-0110 10-110001 10-001001, or E6 B1 89, is its UTF8 encoding.

While we’re at it, let’s also mention a very famous strange phenomenon: when you open a new file in Windows Notepad, type “unicom”, save, close, and open it again, you’ll find that the two words have disappeared, replaced by several garbled characters! Ha ha, someone says this is the reason why unicom does not spell mobile.

In fact, this is because GB2312 coding and UTF8 coding generated coding collision.

When a piece of software opens a text, the first thing it does is decide which encoding is used in which character set. Software generally uses three ways to determine the character set and encoding of text:

Detect the file header identifier, prompting the user to choose, guess according to certain rules

The most standard approach is to detect the first few bytes of text, Charset/encoding, as shown in the following table: Charset/encoding

EF BB BF UTF-8

FF FE UTF-16/UCS-2, little endian

FE FF UTF-16/UCS-2, big endian

FF FE 00 00 UTF-32/UCS-4, little endian.

00 00 FE FF UTF-32/UCS-4, big-endian.

When you create a text file, notepad’s default encoding is ANSI. If you enter Chinese characters in ANSI, it is actually THE ENCODING of GB. In this encoding, the internal code of “Unicom” is:

c1 1100 0001

aa 1010 1010

cd 1100 1101

a8 1010 1000

Notice that? The first two bytes, and the third and fourth bytes start with “110” and “10”, which is exactly the same as the UTF8 two-byte template.

So when we open Notepad again, notepad will mistake this as a UTF8 encoded file, let’s remove the first byte 110 and the second byte 10, we get “00001 101010”, and then align the bits, fill in the leading 0, This gives you “0000 0000 0110 1010”, sorry, this is UNICODE 006A, which is the lowercase “j”, and the next two bytes encoded in UTF8 are 0368, which is nothing. This is why files with only the word “unicom” won’t display properly in notepad.

If you type a few more words after “connect”, the encoding of the other words may not be exactly 110 and 10 bytes, so when you open it again, Notepad will not insist that this is a UTF8-encoded file, but will read it ANSI, and the garbled characters will not appear.

Character Codec stories (ASCII, ANSI, Unicode, UTF-8)

Related Posts

Reflection in Java

MySQL where

Xxl-job scheduler principle analysis