Chapter 7 - Strings (Code point view)

This article is concluded after reading the original chapter. Due to the need for me to carry out a certain summary refining, if there is improper place welcome readers to correct. If you have any questions about the content, welcome to discuss together.

Sometimes we need to do something directly with the underlying code points, rather than dealing with the Character, for several reasons.

First, sometimes code points are what we really need, such as rendering utF-8 encoded web pages or interacting with non-SWIFT apis. For example, let’s look at the combination of NSCharacterSet and strings in Swift. Earlier we said that NSString uses UTF-16 encoded code points, so if you want to split strings using NSCharacterSet, you need to do it in utF-16 view:

extension String {
func words(splitBy: NSCharacterSet = .alphanumericCharacterSet(a)) - > [String] {
return self.utf16.split {
!splitBy.characterIsMember($0)
}.flatMap(String.init)}}let s = "Wow! This contains _all_ kinds of things like 123 and \"quotes\"?"
print(s.words())

// Output result:
// ["Wow", "This", "contains", "all", "kinds", "of", "things", "like", "123", "and", "quotes"]
Copy the code

Take a look at how this code works. It calls the split method to split the String into several sections. The splitting principle is to split characters that are not numeric or alphabetic. The result of splitting is several slices of String.UTF16View, which are then converted to a String using the flatMap method. This is because subscripts may fall on internal character boundaries. So the use of the flatMap method also helps us filter out all nil elements, as detailed in The Optional Type Technique Tour.

If you use either self.utf16 or self.utf32 instead of self.utf16, the code will not compile.

Another reason to use code points instead of characters is that code points are much faster to process than characters. This is because characters need to combine multiple code points, which requires constantly looking ahead to see if there are any that can be combined. We’ll show this speed difference later in the “Performance” section.

Finally, the UTF-16 view has an advantage that no other view has: it can be accessed randomly. String.uft16view. Index has been extended to implement the RandomAccessIndexType protocol. In the previous section we saw that strings are stored in UTF-16 encoding exactly inside the String type. Random access means that the NTH UFT-16 code point must be in the NTH position of the buffer, regardless of whether the string contains non-ASCII codes or not.

You might think that random access is rarely useful, and most of the time strings only need linear access. But some algorithms rely on random access to ensure their efficiency. For example, the Boyer-Moore algorithm (a modified version of KMP) relies on random access, skipping multiple characters at once. You can also use this feature in your algorithms, such as:

// The search method is not implemented, so this code will not compile
let greeting = "Hello, world!"
if let idx = greeting.utf16.search("world".utf16)? .samePositionIn(greeting) {// print(greeting[idx..
}
Copy the code

However, this efficiency tip or convenience feature comes at a cost. Now your code can’t be guaranteed to be fully Unicode compliant, so the following assertion will be triggered:

let text = "Look up your Pok\u{0065}\u{0301}mon in a Pokedex."
assert(text.utf16.search("Pokemon".utf16) == nil)
Copy the code

Theoretically, the string Pok\u{0065}\u{0301}mon is exactly the same as the string “Pokemon”, but the search method here returns nil.

The Unicode standard defines alphanumeric(number or letter) as a character that joins diacritics with letters, so the following line of code should not be a problem:

print(text.words())

// Output result:
/ / / "Look", "up" and "your" and "Pokemon", "in", "a", "Pokedex"]
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Chapter 7 — Strings (Code point view)

Chapter 7 — Strings (Code point view)

Related Posts

Spring source code analysis (six) SpringAOP instance and tag analysis

Wechat mini program – Step on the hole guide

Flink SQL CDC practices and conformance analysis