Why are Swift strings so hard to use?

The String problem in Swift has always been a frequent point of ridicule. While the Swift team sent out a new proposal se-0265 offset-based Access to Indices, Elements, and Slices to improve String usage the other day, I wanted to share my understanding.

The content of proposal SE-0265 is not difficult to understand, mainly adding APIS to simplify the use of several collection. subscript functions, but this proposal has a lot of background story, so this time I want to talk about the design of collection. Index.

Before we look at collection.index, let’s take a look at common String usage scenarios:

let str = "Why is String Index so hard to use?"
let targetIndex = str.index(str.startIndex, offsetBy: 4)
str[targetIndex]
Copy the code

The above code is confusing in several ways:

whytargetIndexTo be calledStringInstance method to generate?
Why is it needed herestr.startIndexRather than0?
whyString.IndexA custom type is used instead of being used directlyInt?

These issues also make String API calls tedious. In other languages, a single statement can solve multiple problems in Swift, but these are all intentional by Swift……

Element of unequal length

When we use arrays, we assume that each element of the array is of equal length. For example, in C, the position of the NTH element in the array would be the array pointer + n * the length of the element, and this formula would get us the NTH element in O(1) time.

This is not always true in Swift. The best example is String, where each element may consist of 1 to 4 UTF8 code points. This means that when retrieving an element through an index, it is not possible to simply calculate the position of the element using the above formula. Instead, it is necessary to traverse all the way to the element corresponding to the index to obtain its actual position (offset).

When Int is used directly as an index, as Array does, operations such as iteration incur more performance costs because each iteration requires recalculating the offset of the code point:

// If String is Int as Index
// The following code complexity will be O(n^2)
// O(1) + O(2) + ... + O(n) = O(n!) ~= O(n^2)
let hello = "Hello"
for i in 0..<hello.count {
    print(hello[i])
}
Copy the code

How is string.index designed?

How does Swift String solve this problem? We can internally record the offset of the corresponding element, and reuse it to calculate the next Index during iteration:

// The following code complexity will be O(n)
// O(1) + O(1) + ... + O(1) = O(n)
let hello = "Hello"
var i = hello.startIndex
whilei ! = hello.endIndex {print(hello[i])
    hello.formIndex(after: i)
}
Copy the code

String.index: String Index: String Index: String Index: String Index: String Index

String 的 IndexThe memory layout is as follows: ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╥ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╥ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ b63: b16 │ b15: b14 ║ bl3: b8 │ b7: b1 ║ b0 │ ├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╫ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╫ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤ │ position │ transcoded offset ║ Grapheme Cache │ reserved ║ Scalar aligned │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╨ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╨ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ position - aka ` encodedOffset ` : a48Bit value, used to record code point offsets - transcoded offset: one2The value of bit, which is used to record the number of code points used for a character6The value of bit used to record the boundary of a character. - reserved:7Bit reserved field - Scalar aligned: one1The value of bit, which records whether the scalar has been aligned (?).Copy the code

However, since the offset of the code point is recorded in the Index, and the offset of the Index of each String will be different, we must use the String instance to calculate when generating the Index:

let str = "Why are String slices so hard to use?"
let k = 1
let targetIndex = str.index(str.startIndex, offsetBy: k) // STR must be used to generate Index
print(str[targetIndex])
Copy the code

The interesting thing about this implementation is that the most performance consuming aspect of using Index is the generation of Index. Once an Index is generated, the operation complexity of using Index is only O(1).

And because of the nature of this implementation, indexes generated by different String instances should not be mixed:

/ / C | | | | | language words
// | U+0043 | U+0020 | U+8BED | U+8A00 |
// | 43 | 20 | E8 | AF | AD | E8 | A8 | 80 |
let str = "C"

// | C | l | a | n | g |
// | U+0043 | U+006C | U+0061 | U+006E | U+0067 |
// | 43 | 6C | 61 | 6E | 67 |
let str2 = "Clang"

// I.E ncodedOffset == 2 (offset)
// i.ranscodedoffset == 3
let i = str.index(str.startIndex, offsetBy: 2) 

print(str[i])  / / language
print(str2[i]) // ang
Copy the code

The Swift development team has stated that Index mixing is an undefined behavior that could be thrown as an error at run time in the future.

Going through all the trouble of supporting elements of unequal length?

If you don’t need to support unequal length elements in a Collection, then everything is very simple. Instead of using an Index abstraction, a Collection can use an Int, and there is only a String for the Collection of unequal length elements in the library. Special treatment of it is also a feasible solution.

The Swift team was presented with two options:

We will continue to improveCollectionProtocol to better support unequal element lengths.
Or for that matterStringBuild a mechanism that operates independently of Collection’s architecture.

The devs actually had a bit of a wobble on this issue:

1 in the Swift,StringIs to followCollection.
This Conformance is removed at Swift 2~3, and plans are gradually abandonedIndexThis layer of abstraction is used directlyInt.
But it changed back after Swift 4.

The main benefit of doing this is to ensure the correctness of API and improve the reuse of code. When extending some set-related functions in Swift 2~3, the same code needs to be implemented in String and Collection respectively.

While it is true that the Index layer of abstraction is needed to express arrays of elements of unequal length like String, it is undeniable that it imposes a certain amount of burden on API calls. (Swift favors API correctness over ease of use)

Index doesn’t have to start at 0

When using a subset of sliced collections, such as ArraySlice, you might see some unexpected behavior, such as saying:

let a = [0.1.2.3.4]
let b = a[1.3]

print(b[1]) / / 1
Copy the code

Here we expect a result of 2 rather than 1, because we call b[1] with a presupposition that all subscripts for collections start at 0. But this doesn’t hold true for the collection type in Swift:

print(b.startIndex)          / / 1
print((10..<100).startIndex) / / 10
Copy the code

Collection.index is an absolute Index

In other words, the Index in a Collection is an absolute Index, but for us, there is no difference between Array and ArraySlice calls except for life cycle processing, and there should not be. It would be better to mask the differences between arrays and slicing with relative indexes, so why design it the way it is?

This issue has been hotly discussed in the forums, and the core development team has simply stated that while there is no difference for users, there is a much simpler and more efficient implementation of the collection type algorithm based on the existing design. And there is no constraint that Index must be Int.

For collections with Index == Int, it is convenient to set startIndex of SubSequence to 0, but this is also the biggest problem. Any code that assumes this is valid only for a Collection whose Index == Int, and for an Index! The Collection of = Int lacks a constant like 0 as startIndex, which makes it difficult to achieve a unified set algorithm at the level of abstraction.

What we want is a relative index

In fact, the current Index can be regarded as the absolute Index of a collection, and what we want is not a 0-based collection but a relative Index, but a relative Index must be converted into an absolute Index to obtain the corresponding data. However, this relative indexing means that API calls add a layer of index mapping, and there is additional implementation complexity to avoid the performance cost of multi-layer index mapping when dealing with nested calls of SubSequence SubSequence.

Whether Swift will add relative indexes or not, it will need to be implemented based on absolute indexes. The problem now is that absolute indexes are presented first as APIS, and it will be too tedious to use without our knowledge.

By tweaking our perception of the Collection abstraction and abandoning the idea that the array index must start with 0 and replacing it with a more abstract startIndex, this can become much more natural. It is not uncommon to introduce abstractions to improve performance at Swift, such as @escaping and Weak.

The distance between indexes is 1, but it’s not 1 either

The Index == Int Collection type must start at 0. The Index offset logic is abstracted, and the Index distance between collections is not necessarily “1”.

Suppose we want to implement a sampling function that takes an array value every n elements:

extension Array {
    func sample(interval: Int, execute: (Element) -> Void) {
        var i = 0
        while i < count {
            execute(self[i])
            i += interval
        }
    }
}

[0.1.2.3.4.5.6].sample(interval: 2) {
    print($0) // 0, 2, 4, 6
}
Copy the code

If we want to make it more general, so that it can be applied to most collection types, it is best to abstract it as a type, like those in the Swift standard library:

struct SampleCollection<C: RandomAccessCollection> :RandomAccessCollection {
    let storage: C
    let sampleInterval: Int

    var startIndex: C.Index { storage.startIndex }
    var endIndex: C.Index { storage.endIndex }
    func index(before i: C.Index) -> C.Index {
        if i == endIndex {
            return storage.index(endIndex, offsetBy: -storage.count.remainderReportingOverflow(dividingBy: sampleInterval).partialValue)
        } else {
            return storage.index(i, offsetBy: -sampleInterval)
        }
    }
    func index(after i: C.Index) -> C.Index { storage.index(i, offsetBy: sampleInterval, limitedBy: endIndex) ?? endIndex }
    func distance(from start: C.Index, to end: C.Index) -> Int { storage.distance(from: start, to: end) / sampleInterval }
    subscript(position: C.Index) - >C.Element { storage[position] }

    init(sampleInterval: Int, storage: C) {
        self.sampleInterval = sampleInterval
        self.storage = storage
    }
}
Copy the code

With the type encapsulated, we can add extension methods like prefix/suffix to the corresponding type, which is convenient to call:

extension RandomAccessCollection {
    func sample(interval: Int) -> SampleCollection<Self> {
        SampleCollection(sampleInterval: interval, storage: self)}}let array = [0.1.2.3.4.5.6]
array.sample(interval: 2).forEach { print($0)}// 0, 2, 4, 6
array.sample(interval: 3).forEach { print($0)}/ / 0, 3, 6
array.sample(interval: 4).forEach { print($0)}/ / 0, 4
Copy the code

SampleCollection achieves the effect of sampling by implementing methods related to indexes, which means that the Index abstraction is actually a concept interpreted by the Collection and has nothing to do with the Index itself.

For example, the distance between two indexes, 0 and 2, can actually be different for two different set types:

let sampled = array.sample(interval: 2)

let firstIdx = sampled.startIndex               / / 0
let secondIdx = sampled.index(after: firstIdx)  / / 2

let numericDistance = secondIdx - firstIdx.     / / 2
array.distance(from: firstIdx, to: secondIdx)   / / 2
sampled.distance(from: firstIdx, to: secondIdx) / / 1
Copy the code

So if we want to get the second element of the set Index == Int, it is an error to use 1 as the Index:

sampled[1]         / / 1
sampled[secondIdx] / / 2
Copy the code

A Collection will interpret the distance between two indexes in its own way, so even if we encounter a Collection with Index == Int, it is not a correct behavior to increment and decrement using Index directly. It is better to face this layer of generic abstraction. Reduce dependence on specific types.

Handling when crossing boundaries

Swift has always claimed to be a type-safe language. In the early stage, it removed the FOR loop of C and introduced a large number of “functional” apis to avoid array transgression. However, when using index or slice API, transgression will still lead to crash directly, which seems not in line with Swift’s “safe” concept.

Once in a while there have been suggestions in the community to switch to Optional return values instead of crashing, but they have all been called back, and there is even a section in The Commonly Rejected Changes section that asks people not to do this again (unless there is a very good reason).

So what does type safety mean? By safe, Swift does not mean to avoid crashes, but rather to avoid Undefined Behavior, such as when an array is read or written out of the array. In this case, Swift is more likely to terminate the program than to continue running with a memory error.

The Swift team believes that array out of bounds is a logical error, which was stated more clearly in an earlier mailing list:

On Dec 14, 2015, at 6:13 PM, Brent Royal-Gordon via swift-evolution wrote:

. In a very similar usage scenario, Dictionary returns an Optional value for subscripts. You might think this is very inconsistent with Array’s behavior. Let me put it another way. For Dictionary, isn’t it a programmer’s mistake when you use a key other than a key set to index values?

There are differences between Array and Dictionary usage scenarios.

I think the index used in 80% Array cases is generated indirectly or directly from the Array instance, for example, 0..

Because of the chayi in this usage scenario, arrays often use a non-optional subscript API and crash when using illegal indexes. Dictionary, on the other hand, has an Optional subscript API and returns nil if index is illegal.

conclusion

The core development team has two drafts to improve the String API. The basic direction is clear, adding a relative index type:

Collection Indicates the common index type. You don’t have to think about specificsIndexType, which does not need to be generated from an array instanceIndex, the new index is internally converted toCollectionIn the specificIndexType.
Simplify Index generation.
Subscript returns the Optional type.

Specific as you can see the content of the proposal, I am in the second draft was first put forward began to write this article, delete bowdlerize change finally finished, now has become a formal proposal draft in the review, hope this article can help you better understand the ins and outs of the proposal, also welcome to leave a message communication together.

Reference links:

Offset Indexing and Slicing – Swift Forums
String Essentials – Swift Forms
swift/SequencesAndCollections.rst at master
swift/StringDesign.rst at master
swift/StringManifesto.md at master
46: “A desire for simplicity and performance”, with special guest Michael Ilseman — Swift by Sundell
Strings in Swift 4 – Ole Begemann
Add Accessor With Bounds Check To Array – Swift Forums