This article is concluded after reading the original chapter. Due to the need for me to carry out a certain summary refining, if there is improper place welcome readers to correct. If you have any questions about the content, welcome to discuss together.

Indefinite length character

At first, string encoding is simple. ASCII code is a set of integers from 0 to 127, because 128 = 2 ^ 7, so if you store it in eight bytes, you can have one more. So every character in the string can be randomly retrieved [1].

But for non-English speakers, they need more than 128 ASCII characters (like Chinese characters). The ISO/IEC 8859 standard takes advantage of the free eighth bit and expands many symbols, but not enough. When we use all eight bits but still cannot represent some symbols, we can choose to add more bits, such as 16 bits to store characters, or we can make the number of bits per character variable. Unicode originally used a fixed length of 2 bytes, which meant it could store 2 ^ 16 = 65536 characters, which is still not enough, but increasing to 4 bytes would be inefficient in general.

Before going any further, it is important to clarify a few concepts in Unicode encoding:

  • Characters: A character is the smallest abstract unit of text that has no fixed shape (e.g. A, €, I are all characters) and has no value.

  • The character set. A character set is a collection of characters. For example, all Chinese characters constitute the Chinese character set, as well as the English character set, Japanese character set and so on.

  • Coded character set: This is a special character set. It assigns a unique number to each character. At the heart of the Unicode standard is the Unicode encoding character set. For example, the character A is assigned the number 0041. Numbers in Unicode are always in hexadecimal.

  • Code Point: A Code Point is a number that can be used to encode a character set. The code point U+0041 corresponds to the character A. The coded character set defines the range of values for code points, but not every number (code point) has a character within this range.

  • Encoding mode: Encoding mode represents the mapping from a code point to one or more code units. Common encoding modes are UTF-32, UTF-16, and UTF-8.

  • Code unit: Code unit is the most basic unit of each encoding method. Utf-32 means that the code unit is 32 bits, and since the hexadecimal 00000041 is also exactly 32 bits, utF-32 encoding is very simple: a code point maps to a code unit with the same value. In UTF-16, a code unit is 16 bits, but that doesn’t mean 00000041 necessarily maps to 0000 and 0041. Utf-16 encodings have their own mapping rules, as does UTF-8.

Take the letter A as an example. A is A character in the English character set. Its code point is 00000041, its code unit is 00000041 under UTF-32 encoding rules, 0041 under UTF-16, and 41 under UTF-8.

𐐀 is a character with the code point U+10400, the code units for UTF-32 are 00010400, the code units for UTF-16 are D801 and DC00, and the code units for UTF-8 are F0, 90, 90, and 80.

Unicode currently uses a variable-width format for two reasons:

  1. Code points map to a variable number of code units. In the previous example, you can see that a code point can be mapped to 1 to 4 code units under UTF-8.
  2. The number of code points that make up a character is variable. It is possible to combine multiple code points into a single character, as we will see in a concrete example.

Unicode scalars are additional code units that can be used as code points (in addition to the uFT-16 proxy pairs). In Swift, scalars are represented by the string literal “\u{XXXX}”, where XXXX is a hexadecimal number.

As we said earlier, the number of code points that make up a character is variable. That is, a character that the user sees on the screen may be composed of multiple code points. Most code that handles strings is somewhat oblivious to Unicode’s variable width feature, which can lead to bugs. Swift takes great pains to use Unicode as correctly as possible when it comes to strings. At least let developers know when there are bugs. This comes at a cost. Rather than being a collection, strings provide different perspectives on strings. You can think of strings as a collection of characters, as a collection of code units encoded in UTF-8 or UTF-16, or as a collection of Unicode scalars. Character differs from the other views in that it can combine several code points into a “Grapheme Cluster.”

All views outside UTF-16 cannot be randomly accessed by subscript. In this chapter, we’ll explore why different views are slower and faster at handling large amounts of text. We’ll also look at some techniques for manipulating text and improving performance.

Glyph clusters are equivalent to specifications

To show how Swift and NSString handle Unicode characters differently, let’s examine how the character E is printed. As a single character, its Unicode code point is U+00E9. But it can also be expressed as a ́(code point U+0301) followed by the letter E. Regardless of which representation you choose, the final result is e, which is the same string to the user, but also the same length, 1. This is “Canonically equivalent” in Unicode.

Let’s take a concrete example in Swift where the two strings display exactly the same:

let single = "Pok\u{00E9}mon"
let double = "Pok\u{0065}\u{0301}mon"

print(single, double)
// The output is "Pokemon Pokemon"
Copy the code

We can also prove that their string variables are equal and the number of characters is equal:

print(single == double)    // Output result: true
print(single.characters.count == double.characters.count)    // Output result: true
Copy the code

However, if you switch to utF-16 view, you can see the difference:

print(single.utf16.count)	// The output is 7
print(double.utf16.count)	// The output is 8
Copy the code

If you use NSString, not only the number of characters is different, but the string itself is different:

let nssingle = NSString(characters: [0x0065.0x0031], length: 2)
let nsdouble = NSString(characters: [0x00E9], length: 1)

print(nssingle == nsdouble)		// The output is false
print(nssingle.isEqualToString(nsdouble as String))     // The output is false
Copy the code

The equal operator, which compares two objects of type NSObject, is defined as:

func= =(lhs: NSObject, rhs: NSObject) -> Bool {
return lhs.isEqual(rhs)
}
Copy the code

This is because in the NSString comparison method, only literals are considered for equality, not for “canonical equivalence” of multiple character combinations. If you really want to do a canonical comparison, you need to use NSString compare. What, you don’t know how to do this? Sorry, you can expect a lot of errors in iOS development and database development.

The advantage of comparing code units directly is that they are fast, much faster than using characters. Such as:

print(single.utf16.elementsEqual(double.utf16))     // The output is false
Copy the code

Not only two characters can be concatenated into one, but more characters can also be concatenated. For example, ọ̀ is a character separated by the letter O and a character similar to the fourth tone in Chinese: “‘” and a dot “.” It can be expressed in four ways:

  1. The letteroConcatenated with one of the symbols, concatenated with another symbol. There are two ways to do this
  2. Three characters are respectively concatenated, whereoAt the beginning, the order of the following two characters can be reversed. So these are two ways again.

We express it in code:

// U+6F is the letter O, U+300 is the fourth sound, U+323 is the "."
let chars: [Character] = [
"\u{1ECD}\u{300}".// U+1ECD is the concatenation of U+6F and U+323, equivalent to :(o +.) The first four tones +
"\u{F2}\u{323}".// U+F2 is the concatenation of U+6F and U+300, equivalent to :(o + fourth tone) +.
"\u{6F}\u{323}\u{300}".// Equivalent to: o +. + fourth tone
"\u{6F}\u{300}\u{323}".// Equivalent to: o + fourth tone +.
]

for char in chars {
print(char)
}

ọ̀ ọ̀ ọ̀ ọ̀ */
Copy the code

In fact, this tone character can be added indefinitely, but the length is still 1:

let many = "\u{1ECD}\u{300}\u{300}\u{300}\u{300}"
print(many.characters.count)   // Output: 1
print(many.utf8.count)	// 11, U+1ECD consists of three code units in utF-8, U+300 consists of two units, 11 = 3 + 2 * 4
print(many)

/* The string output result is ọ̀̀̀ */
Copy the code

Emoji

Emoji aren’t important, but they’re fun. Understanding the following question helps us understand the concatenation of Unicode scalars:

let emoji1 = "🇩 🇪 🇺 🇸 🇩 🇪 🇺 🇸 🇩 🇪 🇺 🇸"
let emoji2 = "😂 😂 😂"

print(emoji1.characters.count)
print(emoji2.characters.count)
Copy the code

If you think the printed results are 6 and 3, you are duped. The answer is 1 and 3. Recalling the ọ̀ character previously, it is composed in four ways, but a careful reader may wonder why “\u{300}\u{6F}\u{323}” (the fourth ̀) is Not line?

This is because some characters are called base characters in Unicode. Only this kind of character can be extended backwards. The definition of the glyph cluster we mentioned earlier is “a base character followed by 0 or more characters”.

So, the reason the output is 1 instead of 6 is because the flag is a base character in the Unicode specification, and six flags joined together are considered a glyph cluster, which is still one character. 😂 is not a base character, so it is correctly recognized as three characters.

.

[1] (5-1) * 8 = 32 bits (5-1) * 8 = 32 bits (5-1 * 8 = 32 bits) This is the meaning of random access in the original text. If the length of each byte is variable, you need to iterate from the beginning.