preface

What are characters? What is a string?

// Output a sentence
fmt.Println("Ab to eat")
/* Output result: ab eat */

/ / character
a := 'a'
Copy the code

Let’s take a look at the sentence “AB eats”. This sentence is a string. This string consists of four characters: ‘A’, ‘b’, ‘eat’, ‘eat’; So when taken separately, each word is a character, the character type is quoted in single quotes, and contains only one character; And strings, as the name implies, can be multiple strings together, double quotes; (We can think of a string as a container of double quotes, and think of it as a rectangular groove; You can think of characters as ping-pong balls. A character is a ping-pong ball. We can put ping-pong balls in the grooves.

[‘a’, ‘b’, ‘eat ‘,’ meal ‘]

  • Characters: single quotes, content is only one character (it is a ping-pong ball, corresponding to one character, so the content must have a value, the content is empty error)
  • String: double quotation marks (it’s just a container, it doesn’t matter whether you put the ping-pong ball in it or not, so there is no content in it and no error will be reported)

Byte character

As we said in the previous chapter, all our data is stored in memory in binary form, so the same is true when we talk about character types; Uint8 is an unsigned, 8-bit (1 character) integer that ranges from 0 to 255, with a total of 256 integers. What does the uint8 have to do with byte?

Remember floating point numbers from the last chapter? Although we intuitively see the decimal, he actually uses IEEE754 standard to convert floating point data into binary data, and then store it in memory. So our byte data is somehow converted to binary and stored in memory; So before we talk about byte, rune, and string, let’s talk about the conversion of ‘character’ to ‘binary’ so that we can understand these three types very easily.

Coding is the process of converting information from one form or format to another. It is also called code in computer programming languages. What does that mean?

We have a lot of dictionaries on our computer, so let’s take the ASCII dictionary for example, and the 26 uppercase letters on our keyboard, 26 lowercase letters, symbols, numbers 0-9, all of these characters are in the ASCII dictionary, and each character has a place in the dictionary, The character ‘a’ is 97 in the ASCII dictionary, so we convert ‘a’ to 97, convert 97 to binary, and store it in memory; Similarly, we take the binary in memory, we convert it to 97, we take the character at position 97 in the ASCII dictionary, and we get the character ‘A’. Is it simple and easy to understand, so we call it coding (the process of changing information from one form or format to another).

ASCII(American Standard Code for Information Interchange), one of the older dictionaries used to display new English and other Western European languages, is one byte in size, or 256 positions, and ranges from 0 to 255, just like the Uint8, with 256 unsigned digits; As we have mentioned above, uint8 and Byte are the same size and storage, unsigned 256 digits, just different names; The reason why byte is a character type is that the numbers stored in byte actually correspond to the positions in the dictionary. In order to express the character type in a more intuitive way, the Uint8 type is not used for storage, but a new type byte with a different name that is actually the same as uint8 is created.

Our byte type is a type of the corresponding ASCII encoding table, which stores the character positions of the ASCII encoding table.

// Create a character of type 'a'
var a byte = 'a'
fmt.Println(a)
/* Output: 97 */

var b byte = ' '
/* Empty rune literal or unescaped '*/

var c byte = 'off'
/* Error constant 25514 overflows byte */
Copy the code

Our fmt.println function prints the value of a, which is 97, not ‘a’, so we know that the value of a is actually 97, and that what we see as ‘a’ is actually the symbol that the computer pulls out of the coded table to display.

Error: byte contains the ASCII encoding standard character number (position).

Byte is an unsigned integer containing 256 digits. The character ‘cuo’ is a character in the Unicode encoding standard (described below). Its character number (position) is 25514. The maximum number that can be stored in the byte type is 255. If you store the number 25514 in the byte type, you will receive an error “Constant 25514 overflows byte”.

At this point we should understand the byte type, so let’s do a little summary:

  • Byte 1 byte, unsigned, 256 positive digits, ranging from 0 to 255
  • ASCII code table, 256 characters, corresponding to byte, only applicable to the USE of computers in the United States
  • The conversion of code table characters to computer binaries is called encoding

Rune of characters

There is one important point, which you may have noticed, which is ASCII encoding, which is mainly used for conversion between New English and other Western European languages. Common symbols, letters, and numbers can be used with ASCII encoding, but what about displaying our Chinese characters? What about our Chinese characters? How can we use 256?

There is no exact figure on the number of Chinese characters, about 100,000, and only a few thousand characters are in daily use. According to statistics, 1000 common characters can cover about 92% of written information, 2000 characters can cover more than 98%, 3000 characters have reached 99%, the statistical results of simplified and traditional characters are not much different. (Baidu Encyclopedia)

So ASCII code is only applicable to the United States, Western Europe and other countries, if we want to use the computer and display Chinese, we have to have a coding standard including Chinese, so we said that there are many dictionaries in the technical machine; In order to let the computer display Chinese, my country also issued gb2312-80 standard, GBK coding standard, GB18030 coding standard; If each have their own unique character country to issue an own standard, there is a kind of problem, such as our country is using a computer program GBK encoding, then the computer programs and data to other countries on the runtime can appear garbled question, because two computers to use is not the same coding standards.

In order to solve the above problems, has created an international organization) Unicode standard (unified code, Unicode, single yard), by the countries in the world put their country’s characters are added to the coding standard, this standard has 1114112 yards, can accommodate all countries of the world’s characters, use at most four bytes for storage, Partially compatible with characters from the ASCII standard, followed by each country’s own characters; In addition, when Unicode standard uses UTF-8 format to encode, Chinese uses 3 bytes, while our Golang uses UTF-8 format by default. When we program, if our program only uses Chinese, Chinese symbols, English symbols, and English symbols (all used by Chinese people), Then using GBK encoding is also possible, because GBK standard, Chinese character encoding is using two characters, than Unicode UTF-8 memory saving space. (What we normally call UTF-8 is actually utF-8 conversion format using Unicode encoding.)

The utF-8 (UCS) format is 1 byte and 1 byte, so that different ranges of characters can be encoded with different lengths. For example, the ASCII characters before Unicode can be represented in a single byte, and utF-8 format can use a single byte according to the string degree. For example, if a Chinese character needs three bytes to represent, utF-8 uses three bytes to handle it. Can this be understood in a popular way? In the past, Unicode only had two encoding formats, UCS2 and UCS4, which were expressed in two bytes or four bytes. After using UTF processing, Unicode was truly called universal code. Here is only a simple mention of the mouth, detailed knowledge of coding if interested in the computer coding can be consulted on their own information for learning, here will not be detailed, find a time to write a special coding related article.

Anyway, the rune character type

The byte type is essentially a Uint8 type with a different name that stores the same data. Is our rune character type, 4 bytes, signed, integer, the same as our int32 character? Rune is actually equivalent to INT32, with a different type name. Rune stores character numbers in the Unicode encoding standard, which uses up to four bytes to represent characters, while RUNe uses four bytes, which means rune can store and represent all characters in Unicode.

// rune
var a rune = 'a'
fmt.Println(a)
/* Output: 97 */

var b rune = 'off'
fmt.Println(b)
/* The output is 25514 */
Copy the code

So rune can store the ‘cub’ character, and it can also store the ‘a’ character that takes up 1 byte. Rune, like byte, cannot be empty, so there must be something inside the single quote. If we were just representing characters from the ASCII encoding standard, we would use byte instead of rune, which uses four bytes and four times as much memory.

In plain English, rune is a larger container that can hold a larger number of characters. The ‘cuo’ character is the number 25514. Our rune has 4 bytes and 32 bits, which is more than enough to hold the number 25514.

Rune should be easy to see with Byte’s preamble, so here’s a quick summary of Rune:

  • Rune 4 bytes 32 bits, corresponding to int32
  • Unicode, which uses up to 4 bytes to represent a character, available in a variety of encoding formats (UTF-8, UTF-16, UTF-32)
  • Utf-8 Golang is the default encoding. Utf-8 uses three bytes to represent one Kanji character

String string

A string is, by definition, multiple characters concatenated in double quotation marks.

/ / type string
var a string = "A measures"
fmt.Println(a)
fmt.Println("Length:".len(a))
/** Output result A measure length: 4 **/
Copy the code

Does it make sense when you print out the content a, or is it a bit confusing when you print out the length: 4? With byte and rune in mind, let’s go back to string and make it easy to understand.

The utF-8 format uses different encoding lengths for different ranges of characters. The variable A string contains two characters ‘a’ and ‘cuo’. The range of character ‘a’ can be stored in 1 byte. Kanji characters are also mentioned above. In UTF-8 format, a Kanji character is represented by 3 bytes. We think of these two characters as ping-pong balls, ‘a’ ping-pong ball is 1byte, ‘cuo’ ping-pong ball is 3 bytes, and our double quotation marks are a groove container, so we put ping-pong ball ‘a’ in the groove container, then ping-pong ball ‘cuo’, and we get len() of the length in our code, The value of the ‘a’ character is 1byte, and the ‘cuo’ character is 3 bytes, so the print result is 4. If you can see that, the reason strings are called strings is because they are concatenated.

And this groove container that we’re talking about, it’s actually a type, we’ll call it an array, but that’s for the next article, we’ll just think of it as a container.

[b1, B2, b3, b4] → [b1, B2, b3, b4]

  • The contents of the string we see after “a “is encoded
  • [‘ A ‘, ‘cuo’] the way we understand ping-pong balls stored in grooves
  • [b1, B2, b3, b4] String Specifies the actual value of the variable a, where B1 is the byte used to store the ‘a’ character and B2 b3 b4 is the byte used to store the ‘cuo’ character. B1, B2, B3, b4 are elements in the array. There are four byte elements in the array, so its size is 4, so len() gets the number of elements in the array. The elements in our array are of type byte, so we call them byte arrays, byte[]; This is a brief introduction to arrays and we will go into more detail in the next chapter.

The above code prints that the length of variable A is 4, which we already understand, but how do we print the number of characters in this code?

a := "A measures"

// Use utf8 package functions
b := utf8.RuneCountInString(a)
fmt.Println(a)
/* The output is 2 */

// Convert the string to a rune array
c := []rune(a)
fmt.Println(len(c))
/* The output is 2 */
Copy the code
  • The utf8.runecountinString (a) function uses utF-8 rules to count the number of characters in variable A, assign 2 to variable B, and print out the result: 2
  • B := [] rune(a) converts a string to a rune array, and then assigns the value to c.
  • The len(b) function gets the number of elements in the array, which is a rune character array with two elements, so the result is 2, and prints the result: 2

What we see in a string is the encoded content. The actual value of the string is an array of bytes [], in which utF-8 bytes are encoded for each character, and the contents of the array are stored in memory. Instead, we take the contents of the array, use UTF-8 encoding to retrieve the corresponding characters, and then display the characters as strings.

Now that we’re done with strings, let’s make a quick summary:

  • String Multiple strings are concatenated
  • The actual value of the string is an array of byte[], which is allowed when the string is empty (a := “”). Since the actual value is [], which represents an array with no elements, no error is reported
  • The Unicode UTF-8 format converts the resulting characters into numbers and stores them in bytes, which are then stored in arrays. Byte [] can then be converted to and from a string using UTF-8 format.

Byte characters, rune characters, and string characters are all described in this chapter. (Characters and strings in Java are essentially the same, but before Java9 String used char[] arrays to store strings and after Java9 byte[] arrays to store strings)

Bool Indicates the Boolean type

This is the simplest data type that has only two fixed values, the keywords true and false. The default is false.

// Boolean variables
t := true
f := false
fmt.Println(t)
fmt.Println(f)
/* Output: true false */

// Create a variable and assign 0
var a int32 = 0

// if the condition is not fixed, t=false
t = a == 1
// if true, f=true
f = a == 0
fmt.Println(t)
fmt.Println(f)
/* Output: false true */
Copy the code

If a is equal to 1, the two equals signs are equal to each other. A is equal to 1, an equal sign, is an assignment, as we said before; ! Exclamation mark, that’s the opposite, 0! = 1 means that 0 is not equal to 1, so this condition is true, so this condition is true; I’m just going to talk a little bit about these symbols, and I’m going to talk about them in more detail, but you just need to know how to use them.

Bool values are fixed true and false, and we assign values true if true (1 == 1) and false if false (0 > 1).

So, in memory, false(0000 0000) and true (0000 0001) are false(0000 0000) and true (0000 0001). While their binary numbers correspond to our decimal numbers of 0 and 1, our bool only converts true and false, and restricts us from using integers to compare bool (t == 0 raises an error). So our bool type is different from other languages and cannot be directly compared with a numeric bool (true == 1 is allowed in the Java language).

Bool type:

  • Storage uses 1 byte to store fixed values 0 and 1
  • The value is fixed to true(1) and false(0), although the actual values stored by bool are 0 and 1, they cannot be compared to bool

Now that we’re done with basic data types, the next chapter begins with content arrays.