Before the introduction

This article translated from https://blog.golang.org/strings, some places may be translation is not good, please feel free to ask, if you have questions I will answer.

preface

The previous blog introduced slicing in GO, with many examples to explain the mechanism behind the way slicing is implemented. With this background, this article focuses on strings in GO. While it may seem overblown to devote a blog to strings, proper use of strings requires us to understand not only how they work, but also the differences between byte, character, and rune. The difference between Unicode and UTF-8, and the difference between strings and string literals.

A good way to look at this topic is as an answer to the oft-asked question, “Why is the NTH character not retrieved when an index is used to retrieve the NTH element of a string?” . As you will see, this question will lead us to more details about how text works in today’s world.

What is a string

Let’s start with some basics.

In GO, a string is actually a read-only byte slice. This article will assume that you already know what byte slicing is or how it works. Read the previous blog post if you don’t know.

Before we get to that, it’s important to mention that strings can contain any byte. A string does not have to contain Unicode text, UTF-8 text, or any other defined text format. For the contents of a string, this is simply equivalent to slicing bytes.

Here is a string (more on that later) that uses \xNN to indicate that string constants contain specific byte values. (Of course, bytes contain all hexadecimal values from 00 to FF.)

const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
Copy the code

Print the strings

Because some of the bytes contained in the SAMPLE string are not valid ASCII or even valid UTF-8, it will print garbled. The print statement is as follows

fmt.Println(sample)
Copy the code

The printed garbled characters are as follows (the printed result will vary according to the environment)

� � � = ⌘Copy the code

To find out exactly what string contains, we can split it up and look at it separately. There are many ways to do this. The simplest is to iterate over its contents, printing each byte individually, as shown in the code for loop below

for i := 0; i < len(sample); i++ {
        fmt.Printf("%x ", sample[i])
}
Copy the code

As mentioned earlier, retrieving elements in a string by index yields bytes, not characters. We’ll return to the subject later, but we’ll focus only on bytes. The output below is the result of a byte by byte traversal

bd b2 3d bc 20 e2 8c 98
Copy the code

The bytes printed separately match the bytes escaped in hexadecimal in the string literal.

For a garbled string, you can print the string in hexadecimal format using %x in FMT.Printf. It prints the sequence of bytes in a string in the format of two bytes as a hexadecimal number.

fmt.Printf("%x\n", sample)
Copy the code

You can compare the results printed below with those printed above

bdb23dbc20e28c98
Copy the code

If you want the same output, a simple trick is to use the “space” symbol before % and x when formatting. It uses the following statement

fmt.Printf("% x\n", sample)
Copy the code

The printed result will have a space between every two bytes to make the output easier to understand:

bd b2 3d bc 20 e2 8c 98
Copy the code

Of course, there are a lot of print formats. For example, %q(letter) can escape any bytes in the string that cannot be printed, so the output will be clear and will not be garbled. The printed statement is as follows

fmt.Printf("%q\n", sample)
Copy the code

If only a few bytes in the string are garbled characters, do not print garbled characters. In this case, you can print %q characters. The output is as follows

"\ XBD \ sets = \ XBC ⌘"
Copy the code

A closer look at the printout reveals that hidden among the garbled characters is an ASCII equals sign, a space symbol, and the well-known Swedish symbol “Place of Interest” at the end. The corresponding Unicode encoding is U+2318, followed by a space (the hexadecimal value is 20) and its utF-8 byte: e2 8c 98.

If we are not familiar with or confused by strange values in string, we can add a “+” symbol to %q, which escapes not only non-printable byte sequences but also non-ASCII byte sequences, while interpreting UTF-8 as well. The printed result is a string containing Unicode characters represented in a non-ASCII UTF-8 encoding.

fmt.Printf("%+q\n", sample)
Copy the code

Using this format, the Unicode value of the Swedish symbol is escaped by \u:

"\xbd\xb2=\xbc \u2318"
Copy the code

These printing techniques are useful when we look at the contents of strings, and will be used frequently in the following discussion. It is worth mentioning that these methods behave the same for byte slices as they do for strings.

Here are all the print options used above, this is a finished program, so you can edit and run it directly from the site.

package main

import "fmt"

func main(a) {
    const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

    fmt.Println("Println:")
    fmt.Println(sample)

    fmt.Println("Byte loop:")
    for i := 0; i < len(sample); i++ {
        fmt.Printf("%x ", sample[i])
    }
    fmt.Printf("\n")

    fmt.Println("Printf with %x:")
    fmt.Printf("%x\n", sample)

    fmt.Println("Printf with % x:")
    fmt.Printf("% x\n", sample)

    fmt.Println("Printf with %q:")
    fmt.Printf("%q\n", sample)

    fmt.Println("Printf with %+q:")
    fmt.Printf("%+q\n", sample)
}
Copy the code

Utf-8 and strings

As we can see, indexing a string yields bytes, not characters: a string is a series of bytes. This means that when we store a character in a string, we store the bytes that represent that character. We can look at a more controlled example to see what’s going on.

This is a simple program that prints a string constant containing only one character in three different ways, once as a normal string, once as an ASCII string, and once as a hexadecimal byte. To avoid any confusion, we create a “raw string” wrapped in backquotes, so that it contains literal text. (Strings are usually enclosed by double quotes and can contain the escaped strings described in the previous section).

func main(a) {
    const placeOfInterest = ` ⌘ `

    fmt.Printf("plain string: ")
    fmt.Printf("%s", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("quoted string: ")
    fmt.Printf("%+q", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("hex bytes: ")
    for i := 0; i < len(placeOfInterest); i++ {
        fmt.Printf("%x ", placeOfInterest[i])
    }
    fmt.Printf("\n")}Copy the code

The output is as follows

Plain String: ⌘ letter letter:"\u2318"
hex bytes: e2 8c 98
Copy the code

This result reminds us that the Unicode character U+2318, the “Place of interest” symbol, is represented in go by byte E2 8c 98. These bytes are the hexadecimal result of the utF-8 encoding of this symbol.

Your familiarity with UTF-8 will determine whether this output will be obvious to you, or quite ingenious. But this point is still worth devoting time to explaining how UTF-8 encoded strings are created. The simple fact is that it does this conversion at the moment the source code is completed.

The source code for GO is utF-8 encoded text, and no other encoding is allowed. This explains when we write the following text in the source code

` ⌘ `Copy the code

The editor we are using will write the UTF-8 encoding to the source text. When we print hexadecimal bytes, we simply print the data that the editor puts in the file.

Simply put, the Go source code is utF-8 encoded, so the string in the source code is utF-8 encoded text. When a string contains no escape characters, it is the same as in the source code. Therefore, depending on the definition and the way strings are constructed, the contents of raw strings are encoded in a valid UTF-8. Similarly, unless it contains non-UTF-8 encoded content like the previous example, the content of a string is usually valid UTF-8 encoded content.

As we mentioned earlier, the contents of a string can be any sequence of bytes. I have also shown in this section that string literals are UTF-8 encoded strings when they do not contain byte-level escape symbols.

To sum up, the string bottom layer can be any sequence of bytes, but when we create strings from string literals, those bytes are the result of UTF-8 encoding.

Code points, characters, runes

So far we have been very careful with “byte” and “character”. This is partly because strings contain bytes, and partly because “character” is a bit difficult to define. The Unicode standard uses the term “code point” to refer to a single value. So U+2318, which is the hexadecimal value 2318, denotes ⌘.

A simpler example would be the Unicode code point for the lowercase ‘A’ corresponding to the Latin letter ‘A’ is U+0061.

But what about the accented lowercase ‘A’ corresponding to the letter ‘A’? This is a character whose code point representation is (U+00E0), but it can also be represented in other ways. For example, we can combine the accent (U+0300) with the lowercase letter A (U+0061) to represent the letter ‘A’. In general, a character can be represented by multiple different code points, resulting in different SEQUENCES of utF-8 bytes.

As a result, the concept of characters in calculations is somewhat ambiguous, or at least ambiguous, so we need to use them with care. In order to make characters reliable, some normalization techniques can ensure that a given character is represented by the same code point, but this topic is a bit far from the topic of this article. There will be a blog post on how libraries in Go address normalization.

“Code point” is a bit of a mouthful, so Go introduces a shorter term: rune. This term appears in both the Go library and source code, and has the same meaning except for an interesting addition to “code point.”

Go defines rune as int32, so it is clearer when code points are represented by an integer. In addition, a character constant in GO is a constant of type Rune. The following expression is of type rune and the value is an integer 0x2318.

'⌘'
Copy the code

In general, there are the following points:

  1. The Go source code is UTF-8 encoded
  2. A string can contain any byte
  3. A string invariant that does not contain byte-level escapes contains all valid UTF-8 encoded sequences
  4. The sequences that represent code points in Unicode are called runes
  5. There is no guarantee in GO that characters in strings are normalized

Range loops

Except that the source code in Go is UTF-8 encoded, Go treats UTF-8 encoding differently only when iterating through strings.

We’ve already seen what happens when we iterate through the string. As a comparison, a loop for range decodes the output of each UTF-8 encoded rune. Each value in the loop returns the actual location of the current rune and the underlying bytes, with the code point as its value. Here is a print format that uses another Printf format, %#U, to show the code point in Unicode and its print results

const nihongo = "Japanese"
for index, runeValue := range nihongo {
  fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
Copy the code

The printed result indicates that each code point occupies multiple bytes:

U+65E5 'day' starts at byte position 0
U+672C 'this' starts at byte position 3
U+8A9E 'language' starts at byte position 6
Copy the code

Libraries

Go’s standard library provides strong support for interpreting UTF-8 text. If a for range loop is not sufficient for your purposes, you can use the tools provided by this library.

The resulting library is Unicode/UTF8, which can validate, unpack, and combine UTF-8 encoded strings. The following example implements the same effect as for range, but uses the library-supplied DecodeRuneInString function, which returns rune and the number of utF-8 encoded bytes it contains

const nihongo = "Japanese"
for i, w := 0.0; i < len(nihongo); i += w {
  runeValue, width := utf8.DecodeRuneInString(nihongo[i:])
  fmt.Printf("%#U starts at byte position %d\n", runeValue, i)
  w = width
}
Copy the code

Conclusion

The answer to the question posed at the beginning of the article is that strings are made up of bytes, so the index takes bytes, not characters. A string can even contain no characters. In fact, the definition of characters is a bit vague, and it would be a mistake to interpret ambiguity by saying that a string is composed of characters.

There is much more to say about Unicode, UTF-8, and multilingual processing, but these can be covered in other blogs. At this point, hopefully you’ll get a better idea of how strings work in GO, and even though a string may contain arbitrary bytes, UTF-8 encoding is still at the heart of string design.