This is the seventh day of my participation in the August More text Challenge. For details, see: August More Text Challenge

This article mainly shares what is the character encoding, the character set, and the common character set ASCII, Unicode detailed introduction. The encoding method of UTF-8 is introduced in detail. Learn about character encoding thoroughly in this article

Most of the following content comes from a deep understanding of character encoding (ASCII, Unicode, UTF-8, UTF-16, UTF-32). On this basis, the author adds the content dealing with character encoding in go language (if you just want to understand the content related to character encoding, you can directly go to this article. Thanks to the author for the excellent article)

A character encoding

As we all know, all the information in the program is stored in binary form at the bottom of the computer, which means that a char or an int that we define in the code will be converted into binary code and stored, and this process is called encoding, The process of converting the computer’s underlying binary code into meaningful characters on the screen (such as “Hello world”) is called decoding

In computer Character codec involves the concept of Character Set (Character Set), which is equivalent to a mapping table that can correspond a Character to an integer one by one. Common Character sets include ASCII, Unicode and so on

Many times we confuse the encoding of a character set with a character set. From this we can see that they are not the same concept. A character set is just a set of characters, while encoding is a more complex process. As to why these two concepts are often put together, what is the connection between them, and what is utF-8 that we often use, that’s what this article is about

Note: The process of finding the corresponding value of a character or number from the character set is not an encoding process

ASCII code

For a long period of time, in the history of computer only application of some developed countries in Europe, so only exist in their programs they understood the Latin alphabet (a, b, c, d, etc.) and Arabic numerals, they only need to consider when coding decoding this kind of situation, is how to convert these characters into binary number that the computer can understand, At this time, the ASCII character set came into being. They only need to compare with the ASCII character set when coding. Whenever they encounter character A in the program, they will find the ASCII value 97 corresponding to A and save it in binary form

The following figure shows the ASCII character set mapping table, including control characters (return key, backspace, newline key, etc.) and displayable characters (Upper and lower case characters, Arabic numerals, and Western characters).

This encoding is known as ASCII encoding. As you can see from the character set reference table, the ASCII character set supports 128 characters, all of which can be represented using only 7 bits, that is, the last 7 bits of a byte. The first bit should be 0 (for example, 0110, 0001 indicates a).

That was good enough for the early computer world, but it left many speakers of other languages around the world unable to use their own clerical systems on computers. With the rise of the Internet, data in a multitude of languages is commonplace. How do you deal with the complexity of languages and still be efficient? The answer is Unicode

Unicode

It includes all the symbols in the world, and has succeeded in making it possible for each number to represent a unique symbol used in at least one language. The Unicode character set now includes more than 130,000 characters (the 100,000th character was adopted in 2005). It is worth noting that Unicode is still ASCII compatible, meaning 0 to 127 remains the same

The standard number corresponding to each symbol in the Unicode character set is called a Unicode code point. In Go terminology, these character tokens are called runes. The natural data type for holding a single text symbol is INT32, and that’s exactly what Go uses. Because of this, the rune type is an alias for the INT32 type

Code points

Unicode is a character set, which is different from utF-8, UTF-6 and other encoding methods. The number described in this section is equivalent to the ASCII value in THE ASCII code. It is the unique identifier of a character in the Unicode character set. In Unicode they are also called Code points, such as Code Point U+0061, where 61 is the hexadecimal representation of 97 and represents the character ‘a’ in the Unicode character set.

The code point is expressed in the form of U+[XX]XXXX, where X represents a hexadecimal number. Generally, there can be 4-6 digits. If the number is less than 4 digits, 0 is added to make up 4 digits; if the number is more than 4 digits, there are several digits. According to The Official Unicode code, the code point range is this, and will not be expanded in the future, more than a million is enough, and currently only about 110,000 characters are defined

In the whole coding process, the code point acts as an intermediate transition layer, as shown in the figure below:

As you can see from this diagram, the entire coding process can be divided into two processes

  • Digitization of a character in a program to a specific numeric value based on the number in the character set
  • Stored in the computer in a specific way according to the number

Obviously, at this point we can see that the number is not what ends up in the computer. As previously understood, encoding is a character encoded as a binary number stored, but this is not accurate, the real encoding is not only that, it involves the details of how many bytes each number is represented, and whether it is a fixed or variable length representation

Unicode

There are three encoding schemes derived from The Unicode character set, utF-32, UTF-16 and UTF-8, which makes it different from the previous encoding scheme, because the character sets and encoding modes of the ASCII and GBK encoding schemes are all one-to-one corresponding, while there are three encoding implementations of Unicode. This is one of the reasons we need to distinguish between character sets and encodings, because Unicode does not specifically refer to UTF-8 or UTF-32 at this point

The following figure is used to explore how code points are converted into various codes under various coding modes

The table above contains four-character code points and shows the encoding results of the four different code points in UTF-32, UTF-16, and UTF-8 encoding modes. Among them: code point to UTF-32 conversion is the simplest, is in front of the fill 0, full 4 bytes can be; The code point to UTF-8 conversion, except for the smallest one is the same in value, the other three are not related at all; The conversion from the code point to UTF-16 is the most irregular. You can see that the first three characters utF-16 and the code point are identical, but the larger code point (more precisely, the code point above U+FFFF) is quite different, becoming four bytes long and having a very different value

This involves two implementation methods of fixed length and variable length in the encoding process. Here, UTF-32 belongs to fixed length encoding, that is, it always uses 4 bytes to store code points, while UTF-8 and UTF-16 belong to variable length storage, UTF-8 uses 1-4 bytes according to different situations. Utf-16 uses 2 or 4 bytes to store code points

Fixed length and variable length

Why should there be two forms of coding: constant length versus variable length? In Chinese expression, there is the so-called broken sentence problem, if we can not deal with broken sentence is likely to convey the meaning of the wrong. Take this note from a fortune teller:

The rich have no disaster to be careful

At this point, if the fortune-telling chivalry so broken sentence

Big rich big expensive, no disaster to be careful

Said I blessed big life big, no disaster, can wanton, but did not take long this chivalrician died, calculate the name of Mr. Said in despair, you will be wrong, originally, is actually such a sentence

Big rich big rich no, disaster to be careful

If you are not rich, you should be careful when you go out

This is why computers need to use fixed and variable length in decoding. Because the underlying binary code of a computer is as amorphous as the contents of a fortune-teller’s note, we need a set of rules if we are to understand it properly

UTF-32

In utF-32, the fixed-length encoding means A break every 4 subsections. Then the code point U+0041 (1000001 in binary) of character A is encoded by UTF-32 and stored in the computer as follows

00000000 00000000 00000000 01000001
Copy the code

It will fill all the high points of the four bytes with zeros. The biggest disadvantage of this representation is that it takes up too much space, because no matter how large the code point is, it needs 4 bytes to store it, which takes up a lot of space. So how to break through this bottleneck? The variable-length scheme emerged

UTF-8

Utf-8 is a variable-length encoding. It can consist of four byte combinations, 1, 2, 3, and 4. The high-order preservation is used to distinguish the variable-length encoding.

  1. For symbols with only one byte, the first byte is set to 0, and the next seven bits are the Unicode code for that symbol. At this point, utF-8 encoding and ASCII encoding are the same for English letters
  2. For symbols of n bytes (n>1), the first n bits of the first byte are set to 1, the n+1 bits are set to 0, and the first two bits of the following bytes are set to 10. The rest of the unmentioned bits are the Unicode code for this symbol, as shown in the following table:
Unicode code point range (hexadecimal) | | utf-8 encoding (binary) bytes | | | -- - | -- - | -- - | | 0000 007 0000 ~ 0000 f | 0 XXXXXXX | a byte | | 0000 0080 ~ 0000 07 ff | 110 XXXXX 10 XXXXXX | two subsections | | 0000, 0800 ~ 0000 FFFF | 1110 XXXX 10 XXXXXX XXXXXX | three bytes | | 0001, 0000 10 ~ 0010 FFFF | 11110 XXX 10 10 XXXXXX XXXXXX XXXXXX | | four bytesCopy the code

For example, the code point of the Chinese character “ugly” is 0x4E11 (0100 1110 0001 0001) in the range of 0000 0800 to 0000 FFFF in the third line of the above table. Therefore, “Ugly” needs to be encoded in the form of three bytes:

Here, the three 1’s in the first byte of the highest bit mean that the character is three bytes long, and the empty 16 x’s start with the last “ugly” bit and work their way back into the format, with the extra bits filling in the 0’s. The result is that the utF-8 code for “ugly” is 11100100 10111000 10010001, which translates to E4B891 in hexadecimal.

Decoding utF-8 encoding is easy. If the first byte is 0, the byte is a single character. If the first digit is 1, the number of consecutive 1’s will indicate how many bytes the current character occupies. “ugly” has three 1’s, which means three characters, and then it takes out the significant bit

Go in utf-8

Go source files are always encoded in UTF-8, and text strings that need to be manipulated by Go programs are also preferentially encoded in UTF-8. While the Unicode package has functions for individual literal symbols (such as distinguishing between letters and numbers, and converting case and case), the Unicode/UTF8 package provides functions for encoding and decoding literal symbols in UTF-8

Many Unicode characters are difficult to type directly from the keyboard, some look too similar to be distinguished, and some are even invisible. In Go, the escape of string literals allows us to specify Unicode characters with the value of code points. There are two forms

  • \uhhhh indicates a 16-bit code point value
  • \uhhhhhhhh indicates a 32 bit code point value

Each h represents a hexadecimal number. Code point values in 32-bit form are rarely used. Both forms represent a given code point in UTF-8 encoding. Thus, the following string literals represent the same string of 6 bytes in length

"World" "\xe4\xb8\x96\xe7\x95\x8c" "\u4e16\u754c" "\U00004e16\U0000754c" "\u4e16\u754c" "\U00004e16\U0000754c" "\u4e16\u754c" "\U00004e16\U0000754cCopy the code

Literal symbols with code point values less than 256 can be written as A single hexadecimal number escaped, such as ‘A’ as ‘\x41’, while higher code point values must be escaped with \u or \u. As a result, ‘\xe4\ XB8 \x96’ is not a legal literal symbol, even though these three bytes constitute a valid UTF-8 code point

Because of the excellent features of UTF-8, many string operations do not require decoding. We can directly determine if one string is a prefix to another:

func HasPrefix(s, Prefix String) bool {return len(s) >= len(prefix) && s[:len(prefix)] == prefix} or func HasSuffix(s, Suffix string) bool {return len(s) >= len(suffix) && s[len(s)-len(suffix):] == suffix} or whether it is another substring func Contains(s,  substr string) bool { for i:=0; i < len(s); i++ { if HasPrefix(s[i:], substr) { return true } } return false }Copy the code

If you really need to deal with Unicode characters individually, you must use other encoding mechanisms. The following cases:

package main import ( "fmt" "unicode/utf8" ) func main() { s := "hello, Println(len(s)) //13 fmt.println (utf8.runecountinString (s))//9 // Need to use the UTF-8 decode to process these characters, For I :=0; i < len(s); { r, size := utf8.DecodeRuneInString(s[i:]) fmt.Printf("%d\t%c\n", i, r) i += size } }Copy the code

Each call to DecodeRuneInString returns R (the literal symbol itself) and a value representing the number of bytes that R takes to encode in UTF-8. This value is used to update the subscript I to locate the next literal symbol in the string. In this way, the form of a loop is always needed. Fortunately, Go’s range loop also works with strings, decoded implicitly as UTF-8. The following figure shows the output of the loop (for non-ASCII literal symbols, the increment of the subscript is greater than 1)

Thanks for the following excellent articles

In-depth understanding of character encodings (ASCII, Unicode, UTF-8, UTF-16, UTF-32)

Golang Character encoding, UTF-8, and Unicode

Go string encoding, Unicode and UTF-8

A Go language problem with Unicode, Rune, UTF-8, and String

Go Programming Language — Alan A. A. Donovan