Introduction to the

String is a basic data type provided by the Go language. It is used almost anytime in programming development. This article introduces you to strings to help you understand and use them better.

The underlying structure

The string underlying structure is defined in the String.go file in the source Runtime package:

// src/runtime/string.go
type stringStruct struct {
  str unsafe.Pointer
  len int
}
  • str: A pointer to the memory address where the actual string is stored.
  • len: Length of the string. Similar to slicing, we can use it in our codelen()The function gets the value. Pay attention to,lenStores the actual number of bytes, not characters. So for characters that are not single-byte encoded, the results can be confusing. More on multi-byte characters later.

For the string Hello, the actual underlying structure is as follows:

What STR stores is the encoding for the character, H for 72, e for 101, and so on.

We can output the underlying structure of the string and store each byte using the following code:

package main import ( "fmt" "unsafe" ) type stringStruct struct { str unsafe.Pointer len int } func main() { s := "Hello  World!" fmt.Println(*(*stringStruct)(unsafe.Pointer(&s))) for _, b := range s { fmt.Println(b) } }

Run output:

{0x8edaff 5}

Since the Runtime. stringStruct structure is non-exported, we cannot use it directly. So I manually defined a StringStruct structure in the code with fields identical to Runtime. stringStruct.

Basic operation

create

There are two basic ways to create a string, using var definitions and string literals:

var s1 string
s2 := "Hello World!"

Note that var s string defines the null value of a string. The null value of a string is an empty string, i.e. “”. A string cannot be nil.

String literals can be defined using double or back quotes. Special characters that appear in double quotes need to be escaped, but not in single quotes:

s1 := "Hello \nWorld"
s2 := `Hello
World`

In the above code, the newline character in S1 requires the escape character \n, and the newline character is typed directly in S2. Because the literals defined by single quotes are exactly the same as those we see in our code, they are often used when containing large chunks of text (usually with newlines) or more special characters. In addition, when using single quotes, note the spacing of other lines after the first line:

package main

import "fmt"

func main() {
  s := `hello
  world`

  fmt.Println(s)
}

Probably just for indentation and aesthetics, I added two Spaces before the word “world” in the second line. These Spaces are actually part of the string. If this is not intentional, it can cause some confusion. The above code outputs:

hello
  world

Indexing and slicing

We can use the index to get the value of the bytes stored at the corresponding position of the string, and we can use the slice operator to get a substring of the string:

package main

import "fmt"

func main() {
  s := "Hello World!"
  fmt.Println(s[0])

  fmt.Println(s[:5])
}

Output:

72
Hello

You didn’t know that in the previous article, Go Slice also explained that slicing a string does not return a slice, but a string.

String concatenation

The simplest and most straightforward way to concatenate a string is to use the + symbol, which can concatenate any number of strings. The drawback to +, however, is that the string to be concatenated must be known. Another way is to use the Join() function from the strings package in the library, which takes a string slice and a delimiter and concatenates the elements in the slice into a single string separated by the delimiter:

func main() {
  s1 := "Hello" + " " + "World"
  fmt.Println(s1)

  ss := []string{"Hello", "World"}
  fmt.Println(strings.Join(ss, " "))
}

The above code first uses + to concatenate strings, then stores each string in a slice and concatenates them using the Strings.join () function. The result is the same. It should be noted that if you put the strings to be concatenated on one line and use + concatenation, the required space will be calculated in the Go language first, this space will be allocated in advance, and each string will be copied in the end. This behavior is different from many other languages, so there is no performance penalty for using + concatenation strings in Go, and even better performance due to internal optimizations. Assuming, of course, that the splicing is done all at once. The following code uses + concatenation several times, resulting in a large number of temporary string objects, affecting performance:

s := "hello"
var result string
for i := 1; i < 100; i++ {
  result += s
}

Let’s test the performance difference between the different approaches. First of all, define three functions, using 1 + Join, multiple + Join and Join() respectively:

func ConcatWithMultiPlus() { var s string for i := 0; i < 10; i++ { s += "hello" } } func ConcatWithOnePlus() { s1 := "hello" s2 := "hello" s3 := "hello" s4 := "hello" s5 := "hello" s6 := "hello" s7 := "hello" s8 := "hello" s9 := "hello" s10 := "hello" s := s1 + s2 + s3 + s4 + s5 + s6 + s7 + s8 + s9 +  s10 _ = s } func ConcatWithJoin() { s := []string{"hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello"} _ = strings.Join(s, "") }

Then define the benchmark in the file Benchmark_test.go:

func BenchmarkConcatWithOnePlus(b *testing.B) {
  for i := 0; i < b.N; i++ {
    ConcatWithOnePlus()
  }
}

func BenchmarkConcatWithMultiPlus(b *testing.B) {
  for i := 0; i < b.N; i++ {
    ConcatWithMultiPlus()
  }
}

func BenchmarkConcatWithJoin(b *testing.B) {
  for i := 0; i < b.N; i++ {
    ConcatWithJoin()
  }
}

Run tests:

$go test - bench. BenchmarkConcatWithOnePlus - 8 11884388 170.5 ns/op BenchmarkConcatWithMultiPlus - 8 1227411 1006 ns/op BenchmarkConcatWithJoin - 8-6718507 157.5 ns/op

As you can see, using + one Join is about the same as joining (), while multiple + concatenations perform about 1/9 of the other two methods. Also note that I defined 10 string variables in the concatWithOnePlus () function before using + concatenation. If you use + to concatenate string literals directly, the compiler will optimize them directly to a string literal, and the result will be uncomparable.

In the Runtime package, we use the concatStrings () function to handle the operation of concatenating strings with + :

// src/runtime/string.go func concatstrings(buf *tmpBuf, a []string) string { idx := 0 l := 0 count := 0 for i, x := range a { n := len(x) if n == 0 { continue } if l+n < l { throw("string concatenation too long") } l += n count++ idx = i } if count == 0 { return "" } // If there is just one string and either it is not on the stack // or our result does not escape the calling frame (buf ! = nil), // then we can return that string directly. if count == 1 && (buf ! = nil || ! stringDataOnStack(a[idx])) { return a[idx] } s, b := rawstringtmp(buf, l) for _, x := range a { copy(b, x) b = b[len(x):] } return s }

Type conversion

We often need to convert a string to a []byte or from a []byte back to a string. This will involve a memory copy, so pay attention to the conversion frequency is not too high. String is converted to []byte, and the conversion syntax is []byte(STR). Create a []byte, allocate enough space, and then copy the string.

func main() {
  s := "Hello"

  b := []byte(s)
  fmt.Println(len(b), cap(b))
}

Note that the output cap may not be the same as len, and the extra capacity is for the performance of subsequent appends.

[]byte is converted to string(BS) and the process is similar.

String you don’t know

1 encoding

In the early days of computing, there were only single-byte codes, the best known being ASCII (American Standard Code for Information Interchange). Single-byte encodes a maximum of 256 characters, which may be sufficient for English-speaking countries. But as computers spread around the world, it became apparent that a single byte was not enough to encode other languages, typically Chinese characters. A Unicode encoding scheme is proposed. Unicode codes provide a unified encoding scheme for all the language symbols in the world. For Unicode-related information, see the Reference link to Unicode Knowledge Every Programmer Must Know.

Many people don’t know what Unicode has to do with UTF8, UTF16, UTF32, etc. In practice, Unicode simply specifies the encoding value for each character, and rarely stores or transfers this value directly. UTF8/UTF16/UTF32 defines the format for how these encoded values are stored in memory or files and transferred over the network. For example, the Chinese character “Zhong” has a Unicode code value of 00004E2D, and the other codes are as follows:

E4B8AD UTF16BE: FFFE2D4E UTF32BE: FFFE00002D4E0000 UTF32BE: FFFE00002D4E0000 UTF16BE: FFFE2D4E0000

The string storage in the Go language is UTF-8 encoding. UTF8 is variable length encoding with the advantage of being ASCII compatible. Multi-byte encoding scheme is adopted for non-English speaking countries, and shorter encoding is adopted for frequently used characters to improve encoding efficiency. The drawback is that UTF8’s variable-length encoding makes it impossible to directly and intuitively determine the character length of a string. Normal Chinese characters are encoded with 3 bytes, such as “zhong” above. For rare words, more bytes may be used to encode them. For example, the UTF-8 code for “TUI” is E9AD8B20.

We use the len() function to get the encoded byte length, not the character length, which is important when working with non-ASCII characters:

func main() { s1 := "Hello World!" FMT.Println(len(s1)) FMT.Println(len(s2))}

Output:

12 to 15

Hello World! There are 12 characters that are easy to understand, hello, China has 5 Chinese characters, each Chinese character is 3 bytes, so output 15.

For strings that use non-ASCII characters, we can use the runeCountinString () method in the library’s Unicode/UTF8 package to get the actual number of characters:

func main() { s1 := "Hello World!" S2 := "Hi China" fmt.println (utf8.runecountinString (s1)) // 12 fmt.println (utf8.runecountinString (s2)) // 5}

For ease of understanding, here is the underlying structure of the string “China” :

Indexing and traversing

Using an index operation string, the byte value at the corresponding position is retrieved. If the position is the middle of a multi-byte encoding, the returned byte value may not be a valid encoding value:

S := IF (S [0] = 1) AND (S [0] = 1);

As mentioned earlier, the UTF8 encoding of “medium” is E4B8AD, so s[0] takes the first byte value, resulting in 228 (the value of hexadecimal E4).

To easily traverse strings, the Go language’s for-range loop has special support for multi-character encoding. The index returned by each traversal is the byte position at the beginning of each character, and the value is the encoding value of the character:

Func main() {s := "Go "for index, c := Range s {fmt.println (index, c)}}

So when multi-byte characters are encountered, the index is not contiguous. The above “language” occupies 3 bytes, so the index of “language” is the index of “middle” 3 plus its number of bytes 3, resulting in 6. The output from the above code is as follows:

0 71
1 111
2 32
3 35821
6 35328

We can also print it as a character:

Func main() {s := "Go language "for index, c := range s {fmt.printf ("%d %c\n", index, c)}}

Output:

0 G 1 O 2 3 language 6 words

With this approach, we can write a simple runeCountinString () function, let’s call it Utf8Count:

Func utf8Count (s string) int {var count int for range s {count++} return count} fmt.println (utf8Count (" China ")) // 2

Garbled and non-printable characters

If an illegal utf8 encoding is present in a string, a specific symbol is printed for each illegal encoded byte � :

Println(s[:5]) b := []byte{129, 130, 131} FMT.Println(String (b))}

The above output:

In the � � � � �

S [:5] only takes the first two bytes. These two bytes cannot form a valid UTF8 character, so it outputs two �.

Another thing to be wary of is non-printable characters. A colleague asked me a question before, two strings output the same content, but they are not equal:

func main() {
  b1 := []byte{0xEF, 0xBB, 0xBF, 72, 101, 108, 108, 111}
  b2 := []byte{72, 101, 108, 108, 111}

  s1 := string(b1)
  s2 := string(b2)

  fmt.Println(s1)
  fmt.Println(s2)
  fmt.Println(s1 == s2)
}

Output:

hello
hello
false

I’m just writing out the inner bytes of the string, which is probably obvious at first glance. But it took us a little bit of time to debug this problem. Because the string was being read from a file, and the file was UTF8 encoded with a BOM. We all know that the BOM format automatically appends three bytes 0xEFBBBF to the header of the file. A string comparison compares the length and each byte. To make the problem more difficult to debug, the BOM header is also not displayed in the file.

4 Compiler optimization

There are many scenarios where []byte is converted to string for performance reasons. If the converted String is only used temporarily, the conversion does not make a memory copy. The string returned refers to the memory of the slice. The compiler recognizes the following scenarios:

  • The map to find:m[string(b)];
  • String concatenation:"<" + string(b) + ">";
  • String comparison:string(b) == "foo".

Because the string is only used temporarily, the slicing will not change during that time. So there is no problem in this way.

conclusion

Strings are one of the most frequently used basic types, and getting familiar with them can help you better code and solve problems.

reference

  1. Go Expert Programming
  2. Every programmer must know Unicode, https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know- about-unicode-and-character-sets-no-excuses/
  3. You don’t know Go GitHub:https://github.com/darjun/you-dont-know-go

I

My blog: https://darjun.github.io

Welcome to pay attention to my WeChat public number [GOUPUP], learn together, make progress together ~