I’m participating in nuggets Creator camp # 4, click here to learn more and sign up now!

In Swift, String and Array are structures themselves, so how do they store data and what is their structure in memory? We will explore these two questions in this article.

A, the String

1. Swift memory layout

1.1. String source code analysis

The String in Swift is a struct type, so how does it store characters? Its structure in the source code is as follows:

public struct String {
    public // @SPI(Foundation)
    var _guts: _StringGuts

    @inlinable @inline(__always)
    internal init(_ _guts: _StringGuts) {
        self._guts = _guts
        _invariantCheck()
    }

    // other
    .
}
Copy the code

Its initialization method requires passing a _StringGuts type. This _StringGuts is what we’re going to explore next. Find its definition in the source StringGuts. Swift file as follows:

// StringGuts is a parameterization over String's representations. It provides
// functionality and guidance for efficiently working with Strings.
//
@frozen
public // SPI(corelibs-foundation)
    struct _StringGuts: UnsafeSendable {
    @usableFromInline
    internal var _object: _StringObject

    @inlinable @inline(__always)
    internal init(_ object: _StringObject) {
        self._object = object
        _invariantCheck()
    }

    // Empty string
    @inlinable @inline(__always)
    init(a) {
        self.init(_StringObject(empty: ()))
    }
}
Copy the code

Notice that its initialization method requires passing a _StringObject. Let’s go ahead and see what this _StringObject is.

@frozen @usableFromInline
internal struct _StringObject {
    .
}
Copy the code

This _StringObject is also a structure, and we find the _StringObject(empty: ()) method, which implements the following:

@inlinable @inline(__always)
internal init(empty:)) {
    // Canonical empty pattern: small zero-length string
#if arch(i386) || arch(arm) || arch(arm64_32) || arch(wasm32)
    self.init(
         count: 0,
         variant: .immortal(0),
         discriminator: Nibbles.emptyString,
         flags: 0)
#else
    self._countAndFlagsBits = 0
    self._object = Builtin.valueToBridgeObject(Nibbles.emptyString._value)
#endif
    _internalInvariant(self.smallCount = = 0)
    _invariantCheck()
}
Copy the code

Here it calls init(count: Variant: Discirminator: flags:).

@inlinable @inline(__always)
init(count: Int.variant: Variant.discriminator: UInt64.flags: UInt16) {
    _internalInvariant(discriminator & 0xFF00_0000_0000_0000 = = discriminator,
    "only the top byte can carry the discriminator and small count")

    self._count = count
    self._variant = variant
    self._discriminator = UInt8(truncatingIfNeeded: discriminator & > > 56)
    self._flags = flags
    self._invariantCheck()
}
Copy the code

This method assigns a value to the current _StringObject member variable. What does _count, _variant, _discriminator, and _flags stand for? First, _count should represent the size of the current string. The type of _variant is Variant, which is an enumeration. Let’s look at its definition:

internal enum Variant {
    case immortal(UInt)
    case native(AnyObject)
    case bridged(_CocoaString)
    // function
    .
}
Copy the code

This enumeration represents three cases of a string, immortal, native, and bridged. The _variant type is Variant. Immortal (0), which represents the Swift native string type.

Ok, so let’s look at _discriminator, and what’s coming in from this _discriminator is an Attribute called Nibbles. EmptyString, so let’s see what Nibbles are:

// Namespace to hold magic numbers
@usableFromInline @frozen
enum Nibbles {}
Copy the code

It is an enumeration with nothing, so we can look for its extension at this point.

extension _StringObject.Nibbles {
    // The canonical empty string is an empty small string
    @inlinable @inline(__always)
    internal static var emptyString: UInt64 {
        return _StringObject.Nibbles.small(isASCII: true)}}Copy the code

The emptyString returns a Nibbles samll(isASCII:) method, where true means ASCII and false means not. Let’s look at the implementation of samll(isASCII:).

@inlinable @inline(__always)
internal static func small(isASCII: Bool) -> UInt64 {
    return isASCII ? 0xE000_0000_0000_0000 : 0xA000_0000_0000_0000
}
Copy the code

As you can see, when isASCII is true, the return is 0xE000_0000_0000_0000, otherwise 0xA000_0000_0000_0000.

1.2. Memory layout for small strings

What do 0xE000_0000_0000_0000 and 0xA000_0000_0000_0000 stand for? Let’s take a look:

At this point, I pass nothing, and we see the current string printed as 0xE000_0000_0000_0000, indicating that the empty string is ASCII. Next I pass in Chinese:

When Chinese characters are imported, the printed characters are 0xA000_0000_0000_0000, indicating that Chinese characters do not belong to ASCII.

What does the 3 after 0xa mean when passed into Chinese? Let’s take a look:

When empty is equal to ab, the end of 0xe is equal to 2. Then let’s look at ABC:

So this 0xe or 0xa represents the size of the string. And 636261 is the code for ABC, so 636261 is ABC.

We all know that a string is 16 bytes in size, so are these characters stored in those 16 bytes? Let’s see if abcdefgh fills the first 8 bytes.

If the string is larger than 8 bytes, let’s look at abcdefghijklmno’s memory:

The string is conveniently chosen to be exactly 15 in size, so the value of the string is stored entirely in those 16 bytes. If the string size exceeds 15, how will the string be stored?

1.3. Memory layout for large strings

Let’s look at abcdefghijklmnoabc memory:

Now those 16 bytes are different. Why? What do these 16 bytes represent? There is a comment in the source code:

A string whose size is less than 15 is a small string, for example, abcdefghijklmno. A string is large if its size is greater than 15. The official comments show that large strings can be native, shared, or foreign. The focus here is on native strings. According to the official annotation, this native string has tail-allocated storage space, which starts with the “nativeBias” offset of the object’s address.

That means that when the size of the string is greater than 15, extra storage is allocated, and that extra storage is used to store the string value. Let’s read on:

First, the value of nativeBias is 32. Next, consider the discriminator and objectAddr. According to the official annotation, this discriminator has a position between 63 and 60 digits high in the 64-bit discriminator. That 60 bits up to 0 bits down is the memory address of this extra storage space.

This objectAddr stores the memory address of the extra storage space, but it is a relative address because it requires nativeBias to get the address value of the extra storage space.

Return to abcDefGhijKLMNOabc memory layout, the first 8 bytes (0xD000000000000012) let’s look at the next 8 bytes (0x8000000100003f60). 0x8 is the value of this discriminator, followed by 100003F60, which is the relative address of the additional storage space. So let’s try to add the value of nativeBias to see if that’s the storage space for the string.

100003f60 + 32 = 100003f60 + 0x2 = 100003f80
Copy the code

Let’s format 0x100003F80:

As we guessed, 0x100003f80 is the memory address of the string. How do I know that 0x8 is the discriminator value?

// Discriminator for large, immortal, swift-native strings
@inlinable @inline(__always)
internal static func largeImmortal(a) -> UInt64 {
    return 0x8000_0000_0000_0000
}
Copy the code

This 0x8000_0000_0000_0000 represents a large string that is a native string.

Let’s take a look at the first eight bytes (0xD000000000000012). In the official comment there is a large string describing the flag bits, as shown in the figure below:

  • IsASCII: used to determine whether the current string isASCII, 63 bits high.

  • IsNFC: This defaults to 1 and is 62 bits high.

  • IsNativelyStored: Whether native stored, 61 bits high.

  • IsTailAllocated: Is it a tail allocated, in the high 60 digits.

  • TBD: Reserved for future use, in high 59 to high 48 bits.

  • Count: Size of the current string, from 47 bits high to 0 bits low.

With these implications in mind, let’s look at how 0xD000000000000012 is stored in 64-bit, as shown below:

You can see that this is consistent with the description in the official comment, so 0x12 is 18 in decimal, so the current string size is 18. So when the string is large, the first 8 bytes store countsAndFlags.

1.4. Summary of string memory layout

  • A String variable/constant is 16 bytes in size.

  • A small string is a string whose size is less than or equal to 15, and a large string is a string whose size is greater than 15.

  • When small, the first 15 bytes of the 16-byte size are used to store the string value, and the last byte is used to record whether the current string is ASCII and the string size.

  • For large strings, the first 8 bytes of the 16-byte size are used to record the size of the string and other information, such as whether it is ASCII. The remaining 8 bytes, the highest 63 to highest 60 bytes store the value of the discriminator, and the remaining storage space address value for the string.

2. Swift Index

In Swift, to get the value of a character in a string we can do Index.

let str = "Hello World"
Copy the code

For example, in the string above, I need to get the “e” character in “Hello World”, so we can use index(I :offsetBy:) to get the index of the “e” character. This method requires passing an Index of a character in the string, and then offsetting it to the Index to return, based on that Index offset.

For example, if we want to get the Index “e”, we would offset it by 1 from the Index at the beginning of the string as follows:

let index = str.index(str.startIndex, offsetBy: 1)
print(str[index])   // e
Copy the code

And then we get the “E” character through this index. So here’s the question: why can’t we just access an Int with an index like an array?

String in Swift refers to a series of characters, which can be expressed in many ways. For example, ASCII code, which we are most familiar with, defines a total of 128 characters, 128 characters is enough for English characters. But compared to other languages, this is not enough.

This means that different countries and languages need to have their own encoding format, and the same binary file can be translated into different characters. There is a code that includes all symbols, known as Unicode. But Unicode only specifies the binary code corresponding to a symbol, without specifying how the binary code should be stored.

Let’s say I have a string here: “I’m Coder”. The Unicode for this string is as follows:

I:6211Is: 662 fC: 0043O: 006f d:0064E:0065R:0072
Copy the code

As you can see, each of the above characters corresponds to a hexadecimal number, which is recognized by the computer as binary, so their binary representation is as follows:

I:0110 0010 0001 0001Is this:0110 0110 0010 1111
C: 0000 0000 0100 0011O:0000 0000 0110 1111D:0000 0000 0110 0100E:0000 0000 0110 0101R:0000 0000 0111 0010
Copy the code

In order to be able to see the difference between Chinese characters and English characters, THE above output is a multiple of 4, not enough to fill with 0. If English characters are stored in the same way as Chinese characters, that is, English characters are stored in the same step size as Chinese characters, it is bound to be a great waste.

To solve this problem, you can use UTF-8. One of the great features of UTF-8 is that it is a variable length encoding method. It can use 1 to 4 bytes to represent a character, varying the length of the byte depending on the symbol. Here are the utF-8 rules:

  • The first byte is set to 0. For English text, UTF-8 takes only one byte, the same as ASCII.

  • The first n bits of the first byte are set to 1, the n+1 bits are set to 0, and the first two bits of the following bytes are set to 10. The remainder of the n bytes will fill the Unicode code, and the high bits will be filled with zeros.

Such as:

I:11100110 10001000 10010010Is this:11100110 10011000 10101111
C: 0100 0011O:0110 1111D:0110 0100E:0110 0101R:0111 0010
Copy the code

For Swift, a String is a collection of characters, which means that each element in the String is of unequal length. So that means that we have different steps when we move memory. What does that mean? For example, if we have an Array (Int), when we iterate over the elements in the Array, the offset is 8 bytes at a time because each element has the same memory size.

But for strings, such as STR [1], is the offset of “is” determined only after the “I” field is iterated? Each iteration of the input-in sequence must be repeated to calculate the offset, which undoubtedly increases the memory consumption. That’s why you can’t access String via Int as a subscript.

Index layout of String Index layout

From the notes we get a rough idea of what the above expression means:

  • Position aka encodedOffset: A 48-bit value used to record code point offsets.

  • Transcoded offset: a 2-bit value that records the number of code points used for a character.

  • Grapheme Cache: A 6-bit value that records the boundaries of a character.

  • 3. Scalar aligned: A 1 bit value that records whether scalars are aligned or not.

So the essence of an Index for a String is to store encodedOffset and Transcoded offset. When we construct a String Index, we actually calculate encodedOffset and Transcoded offset and store them in the memory information of the Index. The Index itself is a 64-bit bit-field information.

Second, the Array

1. Memory structure of the Array

In Swift, Array is a structure whose size depends on the size of its member variables. How big is an Array variable? Is it the same as a normal structure?

Define a structure, let’s see its size, code as follows:

struct Person {
    var age = 18
    var height = 180
}

let p = Person(a)print(MemoryLayout.stride(ofValue: p)) / / 16
Copy the code

As you can see, this structure is 8 bytes in size. Let’s look at Array’s:

let nums = [1.2.3.4.5.6.7]
print(MemoryLayout.stride(ofValue: nums))  / / 8
Copy the code

The size of the array is only 8 bytes, which means that the elements of the array are not stored on numS variables. Where is the data stored in the array? Let’s start with sil code to see how NUMS allocates memory at the bottom.

In our view, mian function, when we initialize nums will call a _allocateUninitializedArray method, and a digital in the array elements. So, what does this method do internally? Let’s take a look at how it is implemented in the source code.

Find an implementation of this function in the arrayShared. swift file:

@inlinable // FIXME(inline-always)
@inline(__always)
@_semantics("array.uninitialized_intrinsic")
public // COMPILER_INTRINSIC
func _allocateUninitializedArray<Element> (_  builtinCount: Builtin.Word)- > (Array<Element>, Builtin.RawPointer) {
    let count = Int(builtinCount)
    if count > 0 {
        // Doing the actual buffer allocation outside of the array.uninitialized
        // semantics function enables stack propagation of the buffer.
        let bufferObject = Builtin.allocWithTailElems_1(
        _ContiguousArrayStorage<Element>.self, builtinCount, Element.self)

        let (array, ptr) = Array<Element>._adoptStorage(bufferObject, count: count)
        return (array, ptr._rawValue)
    }
    // For an empty array no buffer allocation is needed.
    let (array, ptr) = Array<Element>._allocateUninitialized(count)
    return (array, ptr._rawValue)
}
Copy the code

This method returns a tuple that uses count to determine whether the currently initialized array needs to be buffered. When count is greater than 0, allocWithTailElems_1 is called to allocate heap space to store the elements in the array. Then create an array with _adoptStorage. Let’s look at the internal implementation of the _adoptStorage method:

@inlinable
@_semantics("array.uninitialized")
internal static func _adoptStorage(
_ storage: __owned _ContiguousArrayStorage<Element>.count: Int )- > (Array.UnsafeMutablePointer<Element>) {

    let innerBuffer = _ContiguousArrayBuffer<Element>(
    count: count,
    storage: storage)

    return (
        Array(
        _buffer: _Buffer(_buffer: innerBuffer, shiftedToStartIndex: 0)),
        innerBuffer.firstElementAddress)
}
Copy the code

Notice that the tuple is returned with the Array and firstElementAddress, which is the address of the first element in the Array.

Why return the address of the first element? There should be more information before the address of the first element. Here is the conclusion, as shown in the figure:

The numS variable stores a memory address of a class named _ContiguousArrayStorage, which stores the metatype, reference count, number of elements, size of capacity, flag bit, and memory address of the element.

Let’s first print through LLDB to see if it is consistent with the conclusion, as shown in the figure:

As can be seen from the figure, it is basically consistent with our previous conclusion, but in fact, I did not find the code and document description related to metadata and refCount in the source code. If you are interested, you can search the implementation of the class _ContiguousArrayStorage in the source code. Or familiar with assembly debugging can try to prove these two things through assembly.

2. Splicing Array data

The call to append for Array to concatenate data is explained in the documentation, as shown below:

When you add elements to an array and the array starts to exceed its reserved capacity, the array allocates a larger area of memory and copies its elements into the new storage. The new storage is a multiple of the size of the old storage. This exponential growth strategy means that append an element in a constant amount of time, averaging the performance of many append operations. Appends that trigger reallocation have performance costs, but they occur less and less frequently as the array gets larger.

Let’s take a look at how append expands, which is as follows:

@inlinable
@_semantics("array.append_element")
public mutating func append(_ newElement: __owned Element) {
    // Separating uniqueness check and capacity check allows hoisting the
    // uniqueness check out of a loop.
    _makeUniqueAndReserveCapacityIfNotUnique()
    let oldCount = _buffer.mutableCount
    _reserveCapacityAssumingUniqueBuffer(oldCount: oldCount)
    _appendElementAssumeUniqueAndCapacity(oldCount, newElement: newElement)
    _endMutation()
}
Copy the code

First of all, it calls a _makeUniqueAndReserveCapacityIfNotUnique method, its implementation is as follows:

@inlinable
@_semantics("array.make_mutable")
internal mutating func _makeUniqueAndReserveCapacityIfNotUnique(a) {
    if _slowPath(!_buffer.beginCOWMutation()) {
        _createNewBuffer(bufferIsUnique: false,
                         minimumCapacity: count + 1,
                         growForAppend: true)}}Copy the code

If the buffer’s storage is uniquely referenced, place the buffer in a mutable state and create a new buffer. We then look at the _createNewBuffer (bufferIsUnique: minimumCapacity: growForAppend) the implementation of the code is as follows:

@_alwaysEmitIntoClient
@inline(never)
@_semantics("optimize.sil.specialize.owned2guarantee.never")
internal __consuming func _consumeAndCreateNew(bufferIsUnique: Bool.minimumCapacity: Int.growForAppend: Bool ) -> _ArrayBuffer {
    let newCapacity = _growArrayCapacity(oldCapacity: capacity,
                                         minimumCapacity: minimumCapacity,
                                         growForAppend: growForAppend)
    .
}
Copy the code

In this method, it calls the _growArrayCapacity method to expand the capacity. Let’s continue to look at the implementation of _growArrayCapacity, as shown in the figure below:

So, in general, when an array is expanded, it is essentially doubled over the old capacity.