How to turn a string with special characters into a byte?

preface

Some time ago, I published the HTTP request in Go — HTTP1.1 request flow analysis, so these two days originally intended to study HTTP2.0 request source code, the result found too complex to Go to the Zhihu, and then found a very interesting question “How to convert golang special character string into []byte?” . In order to change the mood, there is this article.

The problem

I’m not going to code the original question, but I’m going to go straight to the figure above:

My first reaction is that the ASCII values should range from 0 to 127. How can they exceed 127? The special character in the image above is’ ‘(if not, remember that the Unicode for this special character is \u0081), not the full stop in English.

Unicode and UTF-8 are tangled

Baidu Encyclopedia has introduced Unicode and UTF-8 in detail, so here will not do too much elaboration, only excerpted part of the relevant definitions and this article:

Unicode sets a uniform and unique binary encoding for each character, typically two bytes to represent a character.
Utf-8 is a variable length character encoding for Unicode. It can be used to represent any character in the Unicode standard. Utf-8 is characterized by different length encodings for different ranges of characters. For characters between 0x00-0x7f, the UTF-8 encoding is exactly the same as the ASCII encoding.

Characters in go

Byte and RUNe are defined as type Uint8 and Type rune = INT32, respectively.

The uint8 range is 0 to 255 and can only represent a limited number of Unicode characters. According to the above definition of Unicode, 4-byte RUNe is fully compatible with two-byte Unicode.

We verify this with the following code:

var (
		c1 byte = 'a'
		c2 byte = 'new'
		c3 rune = 'new'
	)
	fmt.Println(c1, c2, c3)
Copy the code

The program cannot run at all because the second line of compilation will show an error and VScode will give you a very detailed message: ‘new’ (untyped rune constant 26032) overflows byte.

Next, we verify that characters and Unicode are equivalent to integers with the following code:

	fmt.Printf("0x%x, %d\n".''.'') // Output: 0x81, 129
	fmt.Println(0x81= =''.'\u0081'= =''.129= ='') // Output: true true true
	//\u0081 output on the screen is not displayed, so change the capital letter A to output
	fmt.Printf("%c\n".65) // Output: A
Copy the code

According to the three true outputs in the above code, characters and Unicode are equivalent to integers, and integers can also be converted back to character representations.

The strings in GO are UTF8 encoded

According to golang’s official blog blog.golang.org/strings:


Go source code is always UTF-8.
A string holds arbitrary bytes.
A string literal, absent byte-level escapes, always holds valid UTF-8 sequences.

Copy the code

Translation, in fact, is two points:

The code in GO is always encoded in UTF8, and the string can store any byte.
Without escaping at the byte level, the string is a standard UTF8 sequence.

With the foregoing basics in mind and the conclusion that the string is a standard UTF8 sequence we proceed to manually encode the string “” (if this cannot be shown, remember that the Unicode for this special character is \u0081).

Unicode to UTF-8 coders mapping table:

Unicode (hexadecimal)	Utf-8-byte stream (binary)
000000-00007F	0xxxxxxx
000080-0007FF	110xxxxx 10xxxxxx
000800-00FFFF	1110xxxx 10xxxxxx 10xxxxxx
010000-10FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The binary representation of the character ‘’ (if it cannot be shown, remember that the Unicode for this special character is \u0081) is 1000000and the hexadecimal representation is 0x81.

According to the Unicode to UTF8 table, 0x7f < 0x81 < 0x7FF, so this special character needs to be two bytes, and the UTF8 template to be applied is 110xxxxx 10XXXXXX.

We convert 10000001 to utF8 binary sequence by following the steps below:

Step 1: Add 0 to the high order of the special character according to the number of x. The number of x is 11, so you need to add three zeros to the high order of special characters. In this case, the binary representation of special characters is 00010000001.

Step 2: X has two parts, and the length is 5 and 6 respectively. Therefore, 6 bits and 5 bits are cut from the bottom to the top of 00010000001 to get 000001 and 00010 respectively.

Step 3: Fill 000001 and 00010 into template 110xxxxx 10xxxxxx from low to high, and the binary sequence of UTF8 can be obtained as: 11000010 10000001.

We convert binary to integer by go:

fmt.Printf("%d, %d\n".0b11000010, 0b10000001)
// Output: 194, 129
Copy the code

To conclude: When a character is converted to byte, it outputs the integer value of the character itself. When a string is converted to byte, it actually outputs a sequence of UTF8 byte slices (strings in GO store UTF8 byte slices). At this point, if we look back at the original problem, we can see that the output is exactly as expected.

Rune of the go

I’m guessing here that the questioner expects “string transbyte slicing to be consistent with character transbyte”, where rune comes in handy. Let’s look at the effect of using Rune:

fmt.Println([]rune(""))
// Output: [129]
Copy the code

As you can see from the above, when you slice a string, rune directly converts each character to the corresponding Unicode.

We simulate the conversion of a string to a []rune slice and []rune slice to a string with the following code:

String to rune slice:

    // The string is directly converted to a []rune slice
	for _, v := range []rune("New World Grocery store.") {
		fmt.Printf("%x ", v)
	}
	fmt.Println()
	bs := []byte("New World Grocery store.")
	for len(bs) > 0 {
		r, w := utf8.DecodeRune(bs)
		fmt.Printf("%x ", r)
		bs = bs[w:]
	}
	fmt.Println()
	/ / output:
	// 65b0 4e16 754c 6742 8d27 94fa
	// 65b0 4e16 754c 6742 8d27 94fa
Copy the code

The function of utf8.DecodeRune in the above code is to convert the sequence of utF8 bytes passed in to a RUNe, or Unicode.

Rune slice to string:

    // rune Slice to string
	rs := []rune{0x65b0.0x4e16.0x754c.0x6742.0x8d27.0x94fa}
	fmt.Println(string(rs))
	utf8bs := make([]byte.0)
	for _, r := range rs {
		bs := make([]byte.4)
		w := utf8.EncodeRune(bs, r)
		utf8bs = append(utf8bs, bs[:w]...)
	}
	fmt.Println(string(utf8bs))
	/ / output:
	// New World Grocery
	// New World Grocery
Copy the code

The function of utf8.EncodeRune in the above code is to convert a rune into a SEQUENCE of UTF8 bytes.

To sum up: In cases where it is not possible to determine if the string contains only single-byte characters, use rune. Each rune type represents a Unicode character, and it can be switched seamlessly with the string.

Understand that strings in GO are actually slices of bytes

As mentioned earlier, a string can store any byte of data and is a standard UTF8 format byte slice. So this section is going to impress you with the code.

	fmt.Println([]byte("New World Grocery store."))
	s := "New World Grocery store."
	for i := 0; i < len(s); i++ {
		fmt.Print(s[i], "")
	}
	fmt.Println()
	/ / output:
	// [230 150 176 228 184 150 231 149 140 230 157 130 232 180 167 233 147 186]
	// 230 150 176 228 184 150 231 149 140 230 157 130 232 180 167 233 147 186
Copy the code

From the above code, we can see that we get the same result when we access the string by byte by cursor as when we convert the string into byte slices, so we can reconfirm that the string and byte slices are equivalent.

Normally our strings are byte slices in the standard UTF8 format, but this does not mean that strings can only store byte slices in UTF8 format. Strings in GO can store arbitrary byte data.


	bs := []byte{65.73.230.150.176.255}
	fmt.Println(string(bs))         // Turn random byte slices into strings
	fmt.Println([]byte(string(bs))) // Turn the string back to the byte slice again

	rs := []rune(string(bs)) // Turn the string into a byte rune slice
	fmt.Println(rs)          // Output the rune slice
	fmt.Println(string(rs))  // Slice rune into a string

	for len(bs) > 0 {
		r, w := utf8.DecodeRune(bs)
		fmt.Printf("%d: 0x%x ", r, r) // Output the value of rune and its corresponding hexadecimal
		bs = bs[w:]
	}
	fmt.Println()
	fmt.Println([]byte(string(rs))) // Convert the rune slice to a string and then convert it to a byte slice again
	/ / output:
	/ / new � AI
    // [65 73 230 150 176 255]
    // [65 73 26032 65533]
    / / new � AI
    // 65: 0x41 73: 0x49 26032: 0x65b0 65533: 0xfffd 
    // [65 73 230 150 176 239 191 189]

Copy the code

Reading through the code and output above, the first five lines of output should be unquestionable. But the output on line 6 is not what you expect.

As mentioned earlier, strings can store arbitrary byte data, which can cause problems if the byte data stored is not the standard UTF8 byte slice.

We already know that utF8.DecodeRune can convert byte slices into runes. If a byte slice is encountered that does not conform to the UTF8 encoding specification, utf8.DecodeRune will return a fault-tolerant Unicode \uFFFD corresponding to the hexadecimal 0xFFfd output above.

The problem with this fault-tolerant Unicode uFFFD is that the byte slices do not conform to the UTF8 encoding specification and do not get the correct Unicode, so uFFFD occupies the place where the correct Unicode should be. When you convert the rune slice to a string, the string stores the valid UTF8 byte slice, so the sixth line outputs the valid UTF8 byte slice with \uFFFD, which is inconsistent with the original byte slice.

⚠️ : In the usual development, it should be noted that the conversion between rune slice and byte slice must be based on the string without garbled characters (the internal byte slice conforms to the UTF8 encoding rule), otherwise similar errors like the above are likely to occur.

Multiple representations of strings

This section is an extension, so try not to use this special presentation in development, which looks advanced but is not very readable.

Let’s look at the code directly:

	bs := []byte([]byte("New"))
	for i := 0; i < len(bs); i++ {
		fmt.Printf("0x%x ", bs[i])
	}
	fmt.Println()
	fmt.Println("\xe6\x96\xb0")
	fmt.Println("\xe6\x96\ XB0 World Grocery"= ="New World Grocery store.")
	fmt.Println('\u65b0'= ='new')
	fmt.Println("\ Tonal World Grocery"= ="New World Grocery store.")
	/ / output:
	// 0xe6 0x96 0xb0 
    / / new
    // true
    // true
    // true
Copy the code

At present, only Unicode and single-byte hexadecimal can be used directly in strings. Readers are welcome to provide more representations for communication.

Finally, I wish you all the best after reading this article.

How to turn a string with special characters into a byte?

preface

The problem

Unicode and UTF-8 are tangled

Characters in go

The strings in GO are UTF8 encoded

Rune of the go

Understand that strings in GO are actually slices of bytes

Multiple representations of strings

Related Posts

Originally thought Huffman tree, Huffman coding is very difficult, the results of the big guy with 6 pictures to explain

Implement a ORM Java | how simple

Summary of common Java operators and their priorities