“This is the 17th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

An overview of the

Through this example, you’ll learn how to manage data dependencies and avoid processor waits between the CPU and GPU.

This example renders consecutive triangles arranged along the sine wave order. Each frame updates the position of the triangle vertices and renders the new image. These dynamically updated data create the illusion of motion, with the triangle appearing to move along the sine wave.

This example stores triangular vertices in a buffer shared by the CPU and GPU. The CPU writes data to a buffer, and the GPU reads it.

Data dependencies and processor waits

Resource sharing causes data dependency among processors; The CPU must write to the resource before the GPU can read it. If the GPU reads a resource before the CPU writes it, the GPU reads undefined resource data. If the GPU reads resources when the CPU writes resources, the GPU reads incorrect resource data.These data dependencies cause processor waits between the CPU and GPU; Each processor must wait for another processor to finish its work before starting its own.

However, since the CPU and GPU are separate processors, they can be made to work simultaneously by using multiple instances of a resource. In each frame, the shader must be supplied with the same parameters, but this does not mean that the same resource object needs to be referenced. Instead, you can create pools of multiple instances of a resource and use different instances each time you render a frame. As shown in the figure below, the CPU can write position data to the buffer used by frame N +1, while the GPU reads position data from the buffer used by frame N. By using multiple instances of the buffer, the CPU and GPU can work continuously and avoid pauses while constantly rendering frames.

Initialize data with CPU

Define a structure AAPLVertex to represent each vertex, including position and color properties:

typedef struct

{

    vector_float2 position;

    vector_float4 color;

} AAPLVertex;
Copy the code

Create a custom AAPLTriangle class that provides an interface to get a triangle consisting of three vertices:

+(const AAPLVertex *)vertices { const float TriangleSize = 64; Static const AAPLVertex triangleVertices[] = {// Pixel Positions, RGBA colors. {{-0.5*TriangleSize, -0.5*TriangleSize}, {1, 1, 1, 1}}, {{0.0*TriangleSize, +0.5*TriangleSize}, {1, 1, 1, 1}}, {{+0.5*TriangleSize, -0.5*TriangleSize}, {1, 1, 1, 1}}}; return triangleVertices; }Copy the code

Initialize multiple triangle vertices with positions and colors, and store them in a triangle array (_Triangles) :

NSMutableArray *triangles = [[NSMutableArray alloc] initWithCapacity:NumTriangles]; // Initialize each triangle. for(NSUInteger t = 0; t < NumTriangles; t++) { vector_float2 trianglePosition; // Determine the starting position of the triangle in a horizontal line. trianglePosition.x = ((-((float)NumTriangles) / 2.0) + t) * horizontalSpacing; TrianglePosition. Y = 0.0; // Create the triangle, set its properties, and add it to the array. AAPLTriangle * triangle = [AAPLTriangle new]; triangle.position = trianglePosition; triangle.color = Colors[t % NumColors]; [triangles addObject:triangle]; } _triangles = triangles;Copy the code

Allocating a Data store

Calculates the total storage size of the triangle vertices. App renders 50 triangles, each with 3 vertices for a total of 150 vertices, each vertex is the size of the AAPLVertex structure:

const NSUInteger triangleVertexCount = [AAPLTriangle vertexCount];
_totalVertexCount = triangleVertexCount * _triangles.count;
const NSUInteger triangleVertexBufferSize = _totalVertexCount * sizeof(AAPLVertex);
Copy the code

Initialize multiple buffers to store multiple copies of vertex data. For each buffer, allocate exactly enough memory to store 150 vertices:

for(NSUInteger bufferIndex = 0; bufferIndex < MaxFramesInFlight; bufferIndex++)

{

    _*vertexBuffers[bufferIndex] = [* _device newBufferWithLength:triangleVertexBufferSize 
            options:MTLResourceStorageModeShared];

    _*vertexBuffers[bufferIndex].label = [NSString stringWithFormat:@"Vertex Buffer* #%lu", 
    (unsigned long)bufferIndex];

}
Copy the code

At initialization, the contents of the buffer instance in the _vertexBuffers array are empty.

Update data with CPU

In each frame, the CPU updates the contents of a buffer instance in the updateState method at the start of the Draw (in:) rendering cycle:

// Vertex data for the current triangles. AAPLVertex *currentTriangleVertices = _vertexBuffers[_currentBuffer].contents;  // Update each triangle. for(NSUInteger triangle = 0; triangle < NumTriangles; triangle++) { vector_float2 trianglePosition = _triangles[triangle].position; // Displace the y-position of the triangle using a sine wave. trianglePosition.y = (sin(trianglePosition.x/waveMagnitude  + _wavePosition) * waveMagnitude); // Update the position of the triangle. _triangles[triangle].position = trianglePosition; // Update the vertices of the current vertex buffer with the triangle's new position. for(NSUInteger vertex = 0; vertex < triangleVertexCount; vertex++) { NSUInteger currentVertex = vertex + (triangle * triangleVertexCount); currentTriangleVertices[currentVertex].position = triangleVertices[vertex].position + _triangles[triangle].position; currentTriangleVertices[currentVertex].color = _triangles[triangle].color; }}Copy the code

After a buffer instance is updated, its data cannot be accessed by the CPU for the remainder of the same frame.

Note: All CPU writes to the buffer instance must be completed before a command buffer (referencing a buffer instance) can be committed. Otherwise, the GPU might start reading buffer instances while the CPU is still writing to them.Copy the code

Encoding GPU commands

Next, encode the commands in the render channel that reference a buffer instance:

[renderEncoder setVertexBuffer:_vertexBuffers[_currentBuffer]

                        offset:0

                       atIndex:AAPLVertexInputIndexVertices];



// Set the viewport size.

[renderEncoder setVertexBytes:&_viewportSize

                       length:sizeof(_viewportSize)

                      atIndex:AAPLVertexInputIndexViewportSize];


// Draw the triangle vertices.

[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangle

                  vertexStart:0

                  vertexCount:_totalVertexCount];

Copy the code

Submit and execute GPU commands

At the end of the rendering loop, call the commit() method of the command buffer to commit the work to the GPU:

[commandBuffer commit];
Copy the code

The GPU reads data from the vertex buffer in the RasterizerData vertex shader, taking the buffer instance as an input parameter:

vertex RasterizerData

vertexShader(const uint vertexID [[ vertex_id ]],

             const device AAPLVertex *vertices [[ buffer(AAPLVertexInputIndexVertices) ]],

             constant vector_uint2 *viewportSizePointer  [[ buffer(AAPLVertexInputIndexViewportSize) ]])

Copy the code

Reuse multiple buffer instances

When both processors have finished their work, the work of a complete frame is done. For each frame, perform the following steps:

  1. Writes data to a buffer instance.
  2. Encodes commands that reference buffer instances.
  3. Submits the command buffer containing the encoding command.
  4. Reads data from a buffer instance.

When a frame’s work is complete, the CPU and GPU no longer need the buffer instance used in that frame. However, it is expensive and wasteful to discard a used instance of the buffer and create a new one for each frame. Instead, as shown below,

Set the buffer instance (_vertexBuffers) in App to a cyclic first-in, first-out (FIFO) queue so that it can be reused. The maximum number of buffer instances in the queue is defined by the value of MaxFramesInFlight, which is set to 3:

static const NSUInteger MaxFramesInFlight = 3;
Copy the code

In each frame, at the beginning of the rendering cycle, the next buffer instance in the _vertexBuffer queue is updated. You loop through the queue sequentially, updating only one buffer instance per frame; At the end of every three frames, we return to the beginning of the queue:

// Iterate through the Metal buffers, and cycle back to the first when you've written to the last.

_*currentBuffer = (* _currentBuffer + 1) % MaxFramesInFlight;


// Update buffer data.

[self updateState];

Copy the code
Note: Core Animation provides optimized displayable resources, often referred to as drawable resources, for rendering content and displaying it on the screen. Drawables are efficient but expensive system resources, so Core Animation limits the number of drawables that can be used simultaneously in the App. The default limit is 3, but it can be set to 2 using the maximumDrawableCount attribute (2 and 3 are supported values). Because the maximum number of objects that can be drawn is 3, this example creates three buffer instances. There is no need to create more buffer instances than the maximum number of drawable instances available.Copy the code

Manages the CPU and GPU working rates

When you have multiple buffer instances, you can have the CPU start frame N +1 with one instance and the GPU finish frame N with another instance. This implementation improves the efficiency of the App by making the CPU and GPU work simultaneously. However, you need to manage the rate at which your App works so as not to exceed the number of available buffer instances.

To manage the working rate of your App, use the semaphore to wait for the full frame to complete, in case the CPU is working much faster than the GPU. A semaphore is a non-Metal object that controls access to a resource shared across multiple processors (or threads). The semaphore has an associated count that can be decrement or incremented to indicate whether the processor has started or finished accessing the resource. In App, semaphores control access to buffer instances by the CPU and GPU. Initialize the semaphore with the count of MaxFramesInFlight to match the number of buffer instances. This value indicates that the App can process up to 3 frames at any given time:

_inFlightSemaphore = dispatch_semaphore_create(MaxFramesInFlight);
Copy the code

At the beginning of the rendering cycle, subtract the semaphore count by one to indicate that you are ready to process a new frame. When the count is below 0, the semaphore causes the CPU to wait until the value is increased:

dispatch_semaphore_wait(_inFlightSemaphore, DISPATCH_TIME_FOREVER);
Copy the code

At the end of the render loop, a command buffer is registered to complete the processing callback. When the GPU finishes executing the command buffer, it calls this callback and increments the semaphore count by one. This indicates that all work has been done for a given frame and that buffer instances used in that frame can be reused:

__block dispatch_semaphore_t block_semaphore = _inFlightSemaphore;

[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer)

 {

     dispatch_semaphore_signal(block_semaphore);

 }];
Copy the code

The addCompletedHandler(_ 🙂 method registers a code block that is called immediately after the GPU completes executing the relevant command buffer. Since only one command buffer is used per frame, receiving a completion callback indicates that the GPU has completed the frame.

Set buffer variability

App performs all render Settings per frame on a single thread. First, the CPU writes the data to the buffer instance. After that, the render command for the buffer instance is encoded. Finally, a command buffer is submitted for the GPU to execute. Because these tasks are always executed in this order on a single thread, App guarantees that writing data to the buffer instance is complete before encoding the commands that reference the buffer instance.

This order allows buffer instances to be marked immutable. When configuring the render pipeline descriptor, set the mutability property of the vertex buffer at the buffer instance index to MTLMutability. Immutable:

pipelineStateDescriptor.vertexBuffers[AAPLVertexInputIndexVertices].mutability = MTLMutabilityImmutable;
Copy the code

Metal can optimize the performance of immutable buffers, but not mutable buffers. For best performance, use immutable buffers whenever possible.

conclusion

This paper explains the reason for the data dependence between CPU and GPU, which is caused by resource sharing. This section describes how to avoid the wait between CPU and GPU work by using multiple instances of resources.

Download the sample code for this article