3

Introduction

String is a basic data type provided by Go language. It is used almost at any time in programming development. This article introduces string related knowledge to help you better understand and use it.

Underlying structure

The underlying structure of the string is defined in the string.go file under the runtime

// src/runtime/string.go
type stringStruct struct {
  str unsafe.Pointer
  len int
}
  • str : A pointer to the memory address where the actual string is stored.
  • len : The length of the string. Similar to slices, we can use the len() function to get this value in the code. Note that len stores the actual number of bytes, not the number of characters. So for characters that are not single-byte encoded, the result may be confusing . Multi-byte characters will be described in detail later.

For the string Hello , the actual underlying structure is as follows:

str stores the codes corresponding to the characters, H corresponds to the codes 72 , e corresponds to 101 and so on.

We can use the following code to output the underlying structure of the string and each byte stored:

package main

import (
  "fmt"
  "unsafe"
)

type stringStruct struct {
  str unsafe.Pointer
  len int
}

func main() {
  s := "Hello World!"
  fmt.Println(*(*stringStruct)(unsafe.Pointer(&s)))

  for _, b := range s {
    fmt.Println(b)
  }
}

Run output:

{0x8edaff 5}

Since the runtime.stringStruct structure is non-exported, we cannot use it directly. So I manually defined a stringStruct structure in the code, the fields runtime.stringStruct exactly the same as 060a3d67f8dd9e.

Basic operation

create

There are two basic ways to create a string, using var definition and string literal:

var s1 string
s2 := "Hello World!"

Note that var s string defines the null value of a string, and the null value of a string is an empty string, that is, "" . The string cannot be nil .

String literals can be defined using double quotes or back quotes . Special characters that appear in double quotes need to be escaped, but do not need to be in single quotes:

s1 := "Hello \nWorld"
s2 := `Hello
World`

The above code, s1 line breaks appear in the need to use the escape character \n , s2 directly type wrap. Because the literal definition of single quotes is exactly the same as what we see in the code, it is often used when it contains a large paragraph of text (usually with line breaks) or more special characters. In addition, when using single quotation marks, pay attention to the space problem in other lines after the first line:

package main

import "fmt"

func main() {
  s := `hello
  world`

  fmt.Println(s)
}

Maybe just for indentation and aesthetics, two spaces are added before "world" in the second line. In fact, these spaces are also part of the string. If this is not intentional, it may cause some confusion. The output of the above code:

hello
  world

Indexing and slicing

You can use the index to get the byte value stored at the corresponding position of the string, and use the slice operator to get a substring of the string:

package main

import "fmt"

func main() {
  s := "Hello World!"
  fmt.Println(s[0])

  fmt.Println(s[:5])
}

Output:

72
Hello

In the previous article you didn't know the Go slice , it was also introduced. The slicing operation of a string returns not a slice, but a string.

String splicing

The simplest and straightforward way to splice strings is to use the + symbol. + can splice any number of strings. But + is that the string to be spliced must be known. Another way is to use Join() function in the strings package. This function accepts a string slice and a separator, and splices the elements in the slice into a single string separated by the separator:

func main() {
  s1 := "Hello" + " " + "World"
  fmt.Println(s1)

  ss := []string{"Hello", "World"}
  fmt.Println(strings.Join(ss, " "))
}

The above code first uses + splice the strings, and then stores each string in a slice, and uses the strings.Join() function to splice. The result is the same. It should be noted that puts the strings to be spliced in one line, and + . In the Go language, the required space is calculated first, this space is allocated in advance, and each string is copied to . This behavior is different from many other languages, so there + splice strings in the Go language, and the performance is even better than other methods due to internal optimization. Of course, the premise is that the splicing is done at once. The following code uses + splice multiple times, which will generate a large number of temporary string objects and affect performance:

s := "hello"
var result string
for i := 1; i < 100; i++ {
  result += s
}

Let's test the performance difference of various methods. First define 3 functions, using + splicing once, + splicing and Join() splicing multiple times:

func ConcatWithMultiPlus() {
  var s string
  for i := 0; i < 10; i++ {
    s += "hello"
  }
}

func ConcatWithOnePlus() {
  s1 := "hello"
  s2 := "hello"
  s3 := "hello"
  s4 := "hello"
  s5 := "hello"
  s6 := "hello"
  s7 := "hello"
  s8 := "hello"
  s9 := "hello"
  s10 := "hello"
  s := s1 + s2 + s3 + s4 + s5 + s6 + s7 + s8 + s9 + s10
  _ = s
}

func ConcatWithJoin() {
  s := []string{"hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello"}
  _ = strings.Join(s, "")
}

Then define the benchmark test in the file benchmark_test.go

func BenchmarkConcatWithOnePlus(b *testing.B) {
  for i := 0; i < b.N; i++ {
    ConcatWithOnePlus()
  }
}

func BenchmarkConcatWithMultiPlus(b *testing.B) {
  for i := 0; i < b.N; i++ {
    ConcatWithMultiPlus()
  }
}

func BenchmarkConcatWithJoin(b *testing.B) {
  for i := 0; i < b.N; i++ {
    ConcatWithJoin()
  }
}

Run the test:

$ go test -bench .
BenchmarkConcatWithOnePlus-8            11884388               170.5 ns/op
BenchmarkConcatWithMultiPlus-8           1227411              1006 ns/op
BenchmarkConcatWithJoin-8                6718507               157.5 ns/op

It can be seen that the performance of using + one splicing and Join() function is similar, while the + is nearly 1/9 of the other two methods. Also note that I first define 10 string variables ConcatWithOnePlus() + splice. If you directly use + splice a string literal, the compiler will directly optimize it to a string literal, and the result will be incomparable.

In the runtime package, use the concatstrings() function to handle the operation of splicing strings +

// src/runtime/string.go
func concatstrings(buf *tmpBuf, a []string) string {
  idx := 0
  l := 0
  count := 0
  for i, x := range a {
    n := len(x)
    if n == 0 {
      continue
    }
    if l+n < l {
      throw("string concatenation too long")
    }
    l += n
    count++
    idx = i
  }
  if count == 0 {
    return ""
  }

  // If there is just one string and either it is not on the stack
  // or our result does not escape the calling frame (buf != nil),
  // then we can return that string directly.
  if count == 1 && (buf != nil || !stringDataOnStack(a[idx])) {
    return a[idx]
  }
  s, b := rawstringtmp(buf, l)
  for _, x := range a {
    copy(b, x)
    b = b[len(x):]
  }
  return s
}

Type conversion

We often need to convert string to []byte, or from []byte back to string. This will involve a memory copy, so it is necessary to pay attention to the conversion frequency should not be too high. String is converted to []byte, and the conversion syntax is []byte(str) . First create a []byte and allocate enough space, and then copy the string content.

func main() {
  s := "Hello"

  b := []byte(s)
  fmt.Println(len(b), cap(b))
}

Note that the output of cap may be len , and the extra capacity is in consideration of the performance of subsequent additions.

[]byte converted to string The conversion syntax is string(bs) , and the process is similar.

String you don't know

1 encoding

In the early days of computer development, there were only single-byte encodings, the most well-known being ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange). Single-byte encoding can only encode up to 256 characters, which may be usable for English-speaking countries. However, with the popularity of computers all over the world, it is obviously not enough to encode the languages of other countries (typically Chinese characters). To this end, a Unicode encoding scheme was proposed. Unicode encoding provides a unified encoding scheme for the language symbols of all countries in the world. For Unicode-related knowledge, please see the reference link that every programmer must know.

Many people don't know what Unicode has to do with UTF8, UTF16, and UTF32. In fact, it can be understood that Unicode only specifies the encoding value corresponding to each character, and it is rarely stored and transmitted directly. UTF8/UTF16/UTF32 defines how these encoded values are stored in memory or files and the format for transmission on the network. For example, the Chinese character "中", the Unicode code value is 00004E2D , and other codes are as follows:

UTF8编码:E4B8AD
UTF16BE编码:FEFF4E2D
UTF16LE编码:FFFE2D4E
UTF32BE编码:0000FEFF00004E2D
UTF32LE编码:FFFE00002D4E0000

String storage in Go language is UTF-8 encoding. UTF8 is a variable-length encoding, which has the advantage of being compatible with ASCII. Multi-byte encoding schemes are used for characters in non-English speaking countries, and shorter encodings are used for frequently used characters to improve encoding efficiency. The disadvantage is that the variable-length encoding of UTF8 us from directly and intuitively determining the character length string 160a3d67f8e309. General Chinese characters use 3 bytes to encode, such as "中" above. For rare characters, more bytes may be used for encoding. For example, the UTF-8 encoding of "魋" is E9AD8B20 .

What we get using the len() function is the encoded byte length , and non-character length , which is very important when using non-ASCII characters:

func main() {
  s1 := "Hello World!"
  s2 := "你好,中国"

  fmt.Println(len(s1))
  fmt.Println(len(s2))
}

Output:

12
15

Hello World! has 12 characters that are easy to understand, hello, China has 5 Chinese characters, each Chinese character occupies 3 bytes, so the output is 15.

For strings that use non-ASCII characters, we can use the RuneCountInString() method in the unicode/utf8 package of the standard library to get the actual number of characters:

func main() {
  s1 := "Hello World!"
  s2 := "你好,中国"

  fmt.Println(utf8.RuneCountInString(s1)) // 12
  fmt.Println(utf8.RuneCountInString(s2)) // 5
}

For ease of understanding, the underlying structure diagram of the string "China" is given below:

2 Indexing and traversal

Use the index to manipulate the string, and get the byte value at the corresponding position. If the position is the middle position of a multi-byte encoding, the returned byte value may not be a legal encoding value:

s := "中国"
fmt.Println(s[0])

As mentioned earlier, the UTF8 encoding of "Medium" is E4B8AD , so s[0] takes the first byte value, and the result is 228 (the value of hexadecimal E4).

In order to traverse the string conveniently, the for-range loop in the Go language has special support for multi-character encoding. The index returned by each traversal is the byte position at the beginning of each character, and the value is the coded value of the character:

func main() {
  s := "Go 语言"

  for index, c := range s {
    fmt.Println(index, c)
  }
}

So when encountering multi-byte characters, the index is not continuous. The above "language" occupies 3 bytes, so the index of "言" is the index 3 of "中" plus its byte count of 3, and the result is 6. The output of the above code is as follows:

0 71
1 111
2 32
3 35821
6 35328

We can also output in character form:

func main() {
  s := "Go 语言"

  for index, c := range s {
    fmt.Printf("%d %c\n", index, c)
  }
}

Output:

0 G
1 o
2 
3 语
6 言

According to this method, we can write a simple RuneCountInString() function, call it Utf8Count :

func Utf8Count(s string) int {
  var count int
  for range s {
    count++
  }
  return count
}

fmt.Println(Utf8Count("中国")) // 2

3 Garbled and unprintable characters

If an illegal utf8 encoding appears in the string, a specific symbol will be output for each illegal encoded byte when printing:

func main() {
  s := "中国"
  fmt.Println(s[:5])

  b := []byte{129, 130, 131}
  fmt.Println(string(b))
}

The output above:

中��
���

Because the "country" code has 3 bytes, and s[:5] only takes the first two, these two bytes cannot form a legal UTF8 character, so two output.

In addition, we need to be wary of non-printable characters. A colleague asked me a question before. The output content of the two strings is the same, but they are not equal:

func main() {
  b1 := []byte{0xEF, 0xBB, 0xBF, 72, 101, 108, 108, 111}
  b2 := []byte{72, 101, 108, 108, 111}

  s1 := string(b1)
  s2 := string(b2)

  fmt.Println(s1)
  fmt.Println(s2)
  fmt.Println(s1 == s2)
}

Output:

hello
hello
false

I wrote the internal bytes of the string directly, and it might be obvious at a glance. But we encountered this problem at that time and it took a little effort to debug it. Because the string was read from the file at that time, and the file was in the UTF8 encoding format with BOM. We all know that the BOM format will automatically add 3 bytes 0xEFBBBF file header. The string comparison will compare the length and each byte. What makes the problem more difficult to debug is that the BOM header is not displayed in the file.

4 Compilation optimization

[]byte converted to string, which is due to performance considerations. If the converted string is only used temporarily, the conversion will not perform a memory copy at this time. The returned string will point to the sliced memory. The compiler will recognize the following scenarios:

  • map search: m[string(b)] ;
  • String splicing: "<" + string(b) + ">" ;
  • String comparison: string(b) == "foo" .

Because string is only used temporarily, the slice will not change during the period. Therefore, there is no problem with this use.

to sum up

String is one of the most frequently used basic types. Familiarity with it can help us better code and solve problems.

reference

  1. "Go Expert Programming"
  2. Unicode knowledge that every programmer must know, https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about- unicode-and-character-sets-no-excuses/
  3. Go GitHub you don’t know: https://github.com/darjun/you-dont-know-go

I

My blog: https://darjun.github.io

Welcome to follow my WeChat public account [GoUpUp], learn together and make progress together~


darjun
2.9k 声望359 粉丝