Lesson 16 — Common Methods of Strings (2)


This is the second day of my participation in the August More text Challenge. For details, see:August is more challenging

0x00 Review and opening

Last time we looked at the common modification methods to Rust strings. This lesson begins with the access methods to Rust strings. This is the fourth installment in this series on Rust strings. And later, if I have time, I’ll go into some more detail about strings.

0 x01 Unicode and utf-8

The most common code in computer should be ASCII code, but the range of ASCII code is only 0x00 0x7F, cannot store Chinese characters, minority characters and so on. Thus appeared GB2312, GB18030 and other codes. In order to unify the character encoding, the International Organization for Standardization (ISO) developed the universal multi-byte coded Character set, also known as Unicode Character set. It contains all the languages and characters in the world. The most commonly used ranges are 0x0000 0xD7FF and 0xe000-0x10FFFF.

But the Unicode character set takes up four bytes per character, and utF-8 is the simplest and most efficient encoding to save space. The String and &str types in Rust use utF-8 encoding to represent text. Utf-8 is a variable length encoding with 1 byte as the encoding unit. According to certain rules, utF-8 encodes code points into 1 to 4 bytes. As shown in the following table:

Utf-8 encoding (1 to 4 bytes) Spot said UNICODE range
0xxxxxxx 0bxxxxxxx 0x00~0x7f
110xxxxx 10aaaaaa 0bxxxxxaaaaaa 0x80~0x7ff
1110xxxx 10aaaaaa 10bbbbbb 0bxxxaaaaaabbbbbb 0x800~0xffff
11110xxx 10aaaaaa 10bbbbbb 10cccccc 0bxxxaaaaaabbbbbbcccccc 0x10000~0x10ffff

It is easy to understand why Chinese characters take up three bytes and English letters and Arabic numerals take up one byte.

The following is an example of UTF-8 encoding:

Utf-8 encoding (1 to 4 bytes) character Code points
01100001 a 0b1100001 == 0x61
11000010 _10101001 © 0b00010_101001 == 0xa9
11100110 _10110001_10001001 han 0b0110_110001_001001 == 0x6c49
11110000 _10011111_10011000_10000011 😃 0b000_011111_011000_000011== 0x1f603

The example code is as follows:

println! ("***************1, code *****************"); let a = "a"; Let b = "©"; Let c = "han "; Let d = "😃"; println! ("a is {} bytes ", STD ::mem::size_of_val(a)); println! ("b is {} bytes ", STD ::mem::size_of_val(b)); println! ("c is {} bytes ", STD ::mem::size_of_val(c)); println! ("d is {} bytes ", STD ::mem::size_of_val(d)); println! (" \ n * * * * * * * * * * * * * * * 1, coding (print binary) * * * * * * * * * * * * * * * * * "); for x in a.bytes() { print! ("{:08b}_", x); } println! (a); for x in b.bytes() { print! ("{:08b}_", x); } println! (a); for x in c.bytes() { print! ("{:08b}_", x); } println! (a); for x in d.bytes() { print! ("{:08b}_", x); } println! (" \ n * * * * * * * * * * * * * * * 1, coding (Unicode) print * * * * * * * * * * * * * * * * * "); println! ("{:X}", 'a' as i32); println! ("} {: X ", as i32 ©); println! ("{:X}", 'han' as i32); println! ("} {: X ", as i32 😃);Copy the code

Code run result:

* * * * * * * * * * * * * * * 1, code * * * * * * * * * * * * * * * * * a (1 byte b (2 bytes 4 bytes (3 bytes d (c * * * * * * * * * * * * * * * 1, coding (print binary) * * * * * * * * * * * * * * * * * 01100001 _ 11000010 _10101001_ _10110001_10001001_ _10011111_10011000_10000011_ 11110000 11100110 * * * * * * * * * * * * * * * 1, coding (Unicode) print * * * * * * * * * * * * * * * * * 61 A9 c49 f603 1 6Copy the code

The encoding and decoding rules here do not need to be described, interested in the search for information. If there are too many comments, I’ll do an extra chapter on encoding and decoding.

0x02 Access to a string

In Rust, strings are accessed with the following two points in mind:

1. Since the string is a sequence of bytes encoded in UTF-8, it is a variable length encoding, so the characters cannot be accessed directly using the index.

2. String operation is divided into two ways: byte processing and character processing. The bytes() method is processed in bytes and returns an iterator that iterates over bytes. Using the chars() method is processed character by character and returns an iterator for character iteration.

Length of string

If you get the length of the string through the len() method, you return the length in bytes, that is, the total number of bytes of all the characters in the string. If chars().count() is used to obtain the length of the character, it is the length of the string.

The example code is as follows:

Let string_length = "I'm learning Rust~"; println! Number of bytes (" "{}" : {} ", string_length, string_length. Len ()); println! Characters in length (" "{}" : {} ", string_length, string_length. Chars (). The count ());Copy the code

Code run result:

"I am learning Rust~" length of bytes: 20" I am learning Rust~" length: 10Copy the code
Accessing string elements

Because Rust’s strings are UTF-8 encoded, the direct use of indexes to access individual character elements is disallowed. So we can only access it with an iterator. The bytes() and char() methods return byte and character iterators, respectively, and there are NTH methods that can access elements by index. This method returns type Option.

The example code is as follows:

Let string_nth = "Rust Programming Fundamentals "; // Access the fifth character DBG! (string_nth.chars().nth(5)); // Access the fifth byte DBG! (string_nth.bytes().nth(5));Copy the code

Code run result:

[SRC \main.rs:45] string_nth.chars().nth(5) = Some(' procedure ',) [SRC \main.rs:47] string_nth.bytes().nth(5) = Some(188,)Copy the code

0 x03 summary

It took four articles to just briefly introduce Rust’s strings, but it will take a lot of practice to really understand the strings method. In fact, there is a lot of knowledge about Rust strings and related methods. Due to the limited space, I will leave the explanation of strings for now.

0x05 Resources

Unicode all extents Unicode code range | Unicode symbol library ✏ ️ (fuhaoku.net)

0x04 Source code for this section

016 · StudyRust – Code Cloud – Open Source China (gitee.com)

The next section — process control.