This article was first published in the RT-thread community and shall not be republished without authorization. preface

The GPU, or graphics processor, is the core of modern graphics cards. In the pre-GPU era, all graphics drawing was done by the CPU, which needed to calculate the boundaries, colors and other data of the graphics, and was responsible for writing the data to video memory. Simple graphics have not been a problem, but with the development of computers (especially games) and the need to display more and more complex graphics, the CPU has become more and more difficult. Therefore, GPU came into being later, which saved CPU from the heavy task of graphics calculation and greatly accelerated the speed of graphics display.

And the SCM side also has a similar development history. In the early SCM use scenarios, there was little demand for graphical display. Even if there is, it is only a simple 12864 display equipment, the amount of computation is not large, the single-chip CPU can be very good processing. But with the development of embedded graphics, the single chip microcomputer needs to undertake more and more graphics calculation and display tasks, the display resolution and color of embedded system is also soaring. Slowly, the single-chip CPU on these calculations began to be out of their power. So, starting with the STM32F429, a GPU-like peripheral was added to the STM32 microcontroller, ST called the Chrom-Art Accelerator, or DMA2D (we’ll use that name for this article). DMA2D can provide acceleration in many 2D graphics applications, perfectly fitting the “GPU” function of modern graphics cards.

Although this “GPU” only provides 2D acceleration, and its functionality is very simple, it is not comparable to the GPU in PC. But it has been able to meet most of the embedded development of graphical display acceleration needs, as long as the use of DMA2D, we can also make a smooth, gorgeous UI effect on the SCM.

This article will introduce the role that DMA2D can play in the embedded graphics development from the example. The goal is to allow the reader to easily and quickly establish the basic concept of DAM2D and learn the basic usage. To avoid being too obscure, this article won’t go into the depth of DMA2D’s advanced features (such as a detailed description of DMA2D’s architecture, all registers, etc.). If you need to learn DAM2D in more detail and professionally, you can refer to “STM32H743 Chinese Programming Manual” after reading this article.

Before reading this article, it is necessary to have some understanding of the TFT LCD controller (LTDC) in STM32 and basic graphics knowledge (such as framebuffer, pixel, color format, etc.).

In addition, in addition to ST, many other manufacturers produced in the MCU also exist similar functions of the peripheral (such as NXP in the RT series design of PXP), but these are not within the scope of this discussion, interested friends can understand. Preparation of work hardware preparation

You can use any of the STM32 development boards with DMA2D peripherals, such as STM32F429, STM32F746, STM32H750, and other MCU development boards, to validate the examples in this article. The development board used in this article is Art-Pi. Art-Pi is an official development board produced by RT-Thread. It adopts the powerful configuration of 480MHz STM32H750XB+32MB SDRAM. And the onboard debugger (ST-Link V2.1) is very convenient to use, especially suitable for the verification of various technical solutions, used as the hardware demonstration platform of this article is very suitable.

The display can be any color TFT display, and 16 – or 24-bit color RGB interface displays are recommended. In this paper, a 3.5 “TFT LCD screen with RGB666 interface and 320×240 (QVGA) resolution is used. In LTDC, the color format used by the configuration is RGB565

Image.png development environment preparation

The content and code presented in this article can be used in any development environment you like, such as RT-Thread Studio, MDK, IAR, etc.

To begin the experiment in this article you will need a basic project that uses framebuffer technology to drive LCD displays. You need to pre-enable DMA2D before running all of the code in this article.

Enable DMA2D can be implemented with this macro (enable once on hardware initialization) :

// Enable the DMA2D peripheral __HAL_RCC_DMA2D_CLK_ENABLE();

The summary of DMA2D

So let’s first look at how does ST describe DMA2D

image.png

It may seem a little obscure at first, but here’s what it does:

Color fill (rectangular area) image (memory) copy color format conversion (such as YCbCR to RGB or RGB888 to RGB565)

The first two are memory-specific operations, and the last two are computation-accelerated operations. Among them, transparency blending, color format conversion can be done together with the image copy, which brings greater flexibility.

As you can see, ST positions DMA2D, as its name suggests, as a DMA enhanced for image processing. In the actual development process, we will find that the use of DMA2D is very similar to the traditional DMA controller. DMA2D can even be used in place of traditional DMA in some non-graphics processing applications.

It should be noted that DMA2D accelerators of different product lines of ST are slightly different. For example, DMA2D of STM32F4 series MCU does not have the function of ARGB and AGBR color format conversion. Therefore, it is best to check the programming manual first to see if the required function is supported.

This article covers only the features common to DMA2D across all platforms. How DMA2D works

Just like traditional DMA has three operating modes: peripheral-to-peripheral, peripheral-to-memory, and memory-to-peripheral, DMA2D as a DMA can be divided into the following four operating modes:

Register to memory memory to memory memory to memory and perform pixel color format conversion memory to memory and support pixel color format conversion and transparency blending

As you can see, the first two modes start out as simple memory operations, while the last two modes combine color format conversion and/or opacity mixing as needed during memory copying. DMA2D and HAL

In most cases, using the HAL library simplifies code writing and improves portability. The exception is the use of DMA2D. Because the biggest problem of HAL inventory is the number of nested layers coupled with too many security tests and low efficiency. When operating with other peripherals, the loss of efficiency from using HAL libraries is not much affected. However, for DMA2D, which is a peripheral for calculation and acceleration purposes, considering that the related operations are called several times during the drawing cycle of a screen, the use of HAL library at this point will cause the acceleration efficiency of DAM2D to be seriously reduced.

So, most of the time, we don’t use the HAL library’s correlation functions to manipulate DMA2D. For efficiency, we manipulate registers directly to maximize acceleration.

The graphical configuration of DMA2D in Cubemx is meaningless because most of The Times we work with DMA2D we change the working mode so frequently. DMA2D scene instance

  1. Color fill

Here is a simple bar chart:

image.png

So let’s think a little bit about how we can graph this.

First, we need to fill the screen with white as the background for the pattern. This process cannot be ignored, otherwise the pattern on the screen will interfere with our subject. However, the bar graph is actually made up of four blue rectangular squares and a line segment, which can also be thought of as a special rectangle of height one. Therefore, the drawing of this figure can be decomposed into a series of “rectangle fill” operations:

Fill a rectangle equal to the size of the screen with white fill four bars with blue fill a line segment of height 1 with black fill

The essence of drawing a rectangle of any size at any location in the canvas is to set the data of the corresponding pixel position in the memory region to the specified color. However, because framebuffer storage in memory is linear, an area that looks like a contiguous rectangle is not contiguous in memory unless the width of the rectangle is exactly the same as the width of the display area.

The figure below shows a typical memory distribution, where the numbers represent the memory address of each pixel in the frame buffer (offset from the first address, ignoring multiple bytes of a pixel), and the blue areas are the rectangles we want to fill. It can be seen that the memory addresses of the rectangular region are discontinuous.

image.png

This nature of the framebuffer makes it impossible to simply use efficient operations such as memset to fill rectangular areas. In general, we will use the following double loop to fill any rectangle, where xs and ys are the screen coordinates of the top left corner of the rectangle, width and height are the width and height of the rectangle, and color is the color to fill:

for(int y = ys; y < ys + height; y++){

for(int x = xs; x < xs + width; x++){
    framebuffer[y][x] = color;        
}

}

Although the code is simple, in the actual execution, a large number of CPU cycles are wasted in judgment, addressing, self-increment and other operations, the actual memory writing time is very small. As a result, efficiency will decrease.

This is where DMA2D’s register-to-memory operating mode comes in handy. DAM2D can fill rectangular memory areas at a very high rate, even if those areas are actually discontinuous in memory.

Let’s take a look at how it works, again using the case illustrated in this figure:

image.png

First, since we are only doing memory padding and not copying memory, we want DAM2D to work in register-to-memory mode. This is done by setting the [17:16] bit of the DMA2D CR register to 11, as follows:

DMA2D->CR = 0x00030000UL;

Then, we tell Dam2D the properties of the rectangle to fill, such as where the starting address of the region is, how many pixels the rectangle is wide, and how high the rectangle is.

The starting address of the region is the memory address of the first pixel in the upper left corner of the rectangle region (the address of the red pixel in the figure), which is managed by DAM2D’s Omar register. The width and height of the rectangle are in pixels, which are respectively managed by the high 16 bits (width) and low 16 bits (height) of the NLR register. The specific code is as follows:

DMA2D->OMAR = (uint32_t)(&framebuffery); / / set the filling area starting pixel memory address DMA2D – > NLR = (uint32_t) (width < < 16) | (uint16_t) height; // Set the width and height of the rectangle

Then, because the rectangle’s address in memory is discontinuous, we tell DMA2D how many pixels to skip (that is, the length of the yellow area in the picture) after filling a row of data. This value is managed by the OOR register. There is a simple way to count the number of pixels skipped, which is to subtract the width of the rectangle from the width of the display area. The specific implementation code is as follows:

DMA2D->OOR = screenWidthPx – width; // Sets the row offset, which is the skipped pixel

Finally, we need to tell Dam2D what color you are going to use for filling and what the color format is. These are managed by the OCOLR and OPFCCR registers respectively, where the color format is defined by the LTDC_PIXEL_FORMAT_XXX macro, as follows:

DMA2D->OCOLR = color; Dma2d-> OPFCCR = PixelFormat; dma2d-> OPFCCR = PixelFormat; // To set the color format, for example, to RGB565, you can use the macro LTDC_PIXEL_FORMAT_RGB565

Now that everything is set up, DMA2D has all the information it needs to fill the rectangle. Next, we need to turn on the DMA2D transfer by setting bit 0 of the DMA2D CR register to 1:

DMA2D->CR |= DMA2D_CR_START; DMA2D_CR_START is a macro with a value of 0x01

Once the DMA2D transmission has started, we just need to wait for it to complete. Bit 0 of the CR register is automatically set to 0 after the DAM2D transfer is completed, so we can wait for the DAM2D transfer to complete with the following code:

While (dma2d-> CR & DMA2D_CR_START) {} // Wait for the Dma2D transfer to complete

Tips0: If you are using OS, you can disable the transmission of DMA2D. We can then create a semaphore and wait for it after the transmission is turned on, then release the semaphore in DMA2D’s completion of transmission interrupt service function. This allows the CPU to do something else while DMA2D is working instead of waiting around.

Tips1: Of course, since the actual execution of DMA2D fills memory so fast that the OS switching task takes longer than this, we still choose to die even if OS is used :).

For the purpose of versatility of the function, the initial transmission address and line offset are passed in after calculation outside the function. The complete function code we extracted is as follows:

static inline void DMA2D_Fill( void * pDst, uint32_t width, uint32_t height, uint32_t lineOff, uint32_t pixelFormat, uint32_t color) {

/* DMA2D->CR = 0x00030000UL; /* DMA2D->CR = 0x00030000UL; // Configure to register to memory mode dma2d-> OCOLR = color; // Set the color to be used for filling. The color format should be the same as that set for filling. Dma2d-> OOR = lineOff; Dma2d-> OPFCCR = PixelFormat; dma2d-> OPFCCR = PixelFormat; / / set the color format DMA2D - > NLR = (uint32_t) (width < < 16) | (uint16_t) height; / / set the width and height of filling area, the unit is a pixel / * * / DMA2D start transmission - > CR | = DMA2D_CR_START; /* while (dma2d-> CR & DMA2D_CR_START) {}

}

To make the code easier, we wrap a rectangle fill function for the screen coordinate system we are using:

void FillRect(uint16_t x, uint16_t y, uint16_t w, uint16_t h, uint16_t color){

void* pDist = &(((uint16_t*)framebuffer)[y*320 + x]);
DMA2D_Fill(pDist, w, h, 320 - w, LTDC_PIXEL_FORMAT_RGB565, color);

}

Finally, let’s try to draw the sample diagram at the beginning of this section in code:

// Fill the background color fillRect (0, 0, 320, 240, 0xFFFF); // FillRect(80, 80, 20, 120, 0x001F); FillRect(120, 100, 20, 100, 0x001f); FillRect(160, 40, 20, 160, 0x001f); FillRect(200, 60, 20, 140, 0x001f); // draw the X axis fillRect (40, 200, 240, 1, 0x0000);

Code running effect:

Image.png 2. Image display (memory copy)

Let’s say we’re developing a game and we want to show a bouncing flame on the screen. In general, the artist first draws each frame of the flame, and then puts it into the same image material, as shown in the picture below:

fire

We then display each frame at regular intervals to create a “flickering flame” effect on the screen.

Let’s skip the loading of the footage file into memory and assume that the footage image is already in memory. Then let’s think about how to display one of those frames on the screen. In general, we will do this: first calculate the data address of each frame in memory, and then copy the data of this frame to the corresponding location in the framebuffer. The code looks something like this:

/ * *

  • Copies a frame from the material to a corresponding location in the framebuffer
  • Index is the index */ of the frame sequence

static void General_DisplayFrameAt(uint16_t index) {

// #define FRAME_COUNTS 25 // #define TILE_WIDTH_PIXEL 96 // #define TILE_WIDTH_PIXEL 96 // #define TILE_COUNT_ROW 5 // // Calculate frame starting address uint16_t *pStart = (uint16_t *) img_firesequenceFrame; pStart += (index / TILE_COUNT_ROW) * (TILE_WIDTH_PIXEL * TILE_WIDTH_PIXEL * TILE_COUNT_ROW); pStart += (index % TILE_COUNT_ROW) * TILE_WIDTH_PIXEL; Uint32_t offlineSrc = (tile_count_row-1) * TILE_WIDTH_PIXEL; // Calculate framebuffer address offset (320 is screen width) uint32_t offlineDist = 320-tile_width_pixel; // Copy data to the framebuffer uint16_t* pFb = (uint16_t*) framebuffer; for (int y = 0; y < TILE_WIDTH_PIXEL; y++) { memcpy(pFb, pStart, TILE_WIDTH_PIXEL * sizeof(uint16_t)); pStart += offlineSrc + TILE_WIDTH_PIXEL; pFb += offlineDist + TILE_WIDTH_PIXEL; }

}

You can see that a lot of memory copying is required to achieve this effect. In embedded systems, hardware DMA is most efficient when large amounts of data are copied. However, hardware DMA can only transport data with contiguous addresses. Here, the data that needs to be copied is not contiguous in both the source image and the FramBuffer addresses. This incurs additional overhead (the same problem that occurred in the first section) and prevents us from using hardware DMA for efficient data replication.

So, while we achieved our goal, it wasn’t as efficient (or as high as it could be).

In order to move a piece of the material image into the frame buffer as quickly as possible, let’s look at how to do this using DMA2D.

First of all, since we are copying data in memory this time, we will set the DMA2D working mode to “memory to memory mode” by setting the [17:16] bit of the CR register of DMA2D to 00. The code is as follows:

DMA2D->CR = 0x00000000UL;

Then we need to set the source and target memory addresses separately, unlike in the first section, because the data source also has memory offsets, so we need to set the data offsets for both source and target locations

DMA2D->FGMAR = (uint32_t)pSrc; // dma2d-> OMAR = (uint32_t)pDst; Dma2d-> FGOR = OffLineSrc; // source data offset (pixel) dma2d-> OOR = OffLineDst; // Target address offset (pixels)

Then you set the width and height of the image you want to copy, as well as the color format, the same as in the first section

DMA2D->FGPFCCR = pixelFormat;

DMA2D->NLR = (uint32_t)(xSize << 16) | (uint16_t)ySize;

In the same way, we turn on the transfer for DMA2D and wait for the transfer to complete:

/ / start transmission DMA2D – > CR | = DMA2D_CR_START;

/ while (dma2d-> CR & DMA2D_CR_START) {} / while (dma2d-> CR & DMA2D_CR_START) {}

Finally, the function we extracted is as follows:

static void DMA2D_MemCopy(uint32_t pixelFormat, void pSrc, void pDst, int xSize, int ySize, int OffLineSrc, int OffLineDst)

{

/* DMA2D configuration */ DMA2D->CR = 0x00000000UL; DMA2D->FGMAR = (uint32_t)pSrc; DMA2D->OMAR = (uint32_t)pDst; DMA2D->FGOR = OffLineSrc; DMA2D->OOR = OffLineDst; DMA2D->FGPFCCR = pixelFormat; DMA2D->NLR = (uint32_t)(xSize << 16) | (uint16_t)ySize; / * * / DMA2D start transmission - > CR | = DMA2D_CR_START; /* while (dma2d-> CR & DMA2D_CR_START) {}

}

For convenience, we wrap a function that calls it:

static void DMA2D_DisplayFrameAt(uint16_t index){

uint16_t *pStart = (uint16_t *)img_fireSequenceFrame;
pStart += (index / TILE_COUNT_ROW) * (TILE_WIDTH_PIXEL * TILE_WIDTH_PIXEL * TILE_COUNT_ROW);
pStart += (index % TILE_COUNT_ROW) * TILE_WIDTH_PIXEL;
uint32_t offlineSrc = (TILE_COUNT_ROW - 1) * TILE_WIDTH_PIXEL;


DMA2D_MemCopy(LTDC_PIXEL_FORMAT_RGB565, (void*) pStart, pDist, TILE_WIDTH_PIXEL, TILE_WIDTH_PIXEL, offlineSrc, offlineDist);

}

Then play each frame in turn, set the frame interval here to 50 milliseconds, and define the target address to the center of the FramBuffer:

while(1){

for(int i = 0; i < FRAME_COUNTS; i++){
    DMA2D_DisplayFrameAt(i);
    HAL_Delay(FRAME_TIME_INTERVAL);
}

}

Final running effect:

Fire.gif 3. Toggles the image gradient

Let’s say we’re developing a viewing app. Switching between two images would be awkward, so we’re going to add a dynamic effect to the switch, and a gradient (fade in and out) is a very common effect that looks good.

I’ll just use these two images:

image.png

Here we need to understand the basic concept of Alpha Blend. First the transparency blend needs to have a foreground, a background. The mixed result is the effect of looking at the background through the foreground. If the foreground is completely opaque, the background will not be visible at all, whereas if the foreground is completely transparent, only the background will be visible. If the foreground is translucent, the result is that the two mix according to the rules of the transparency of the foreground.

If 1 means full transparency and 0 means opaque, then the mixture formula for transparency is as follows, where A is the background color and B is the foreground color:

X(C)=(1-alpha)X(B) + alphaX(A)

Since the color has three channels of RGB, we need to calculate all three channels, and then combine them after the calculation:

R(C)=(1-alpha)R(B) + alphaR(A)

G(C)=(1-alpha)G(B) + alphaG(A)

B(C)=(1-alpha)B(B) + alphaB(A)

For the sake of efficiency (the CPU is slow for floating point calculations), we do not use values in the 0 to 1 range. Typically we will use a number of 8 bits for transparency, ranging from 0 to 255. Note that the higher this number is, the more opaque it is, meaning 255 is completely opaque and 0 is completely transparent (hence opacity), and then we can get the final formula:

outColor = ((int) (fgColor alpha) + (int) (bgColor) (256 – alpha)) >> 8;

Realize RGB565 color format pixel transparency mixed code:

typedef struct{

uint16_t r:5;
uint16_t g:6;
uint16_t b:5;

}RGB565Struct;

static inline uint16_t AlphaBlend_RGB565_8BPP(uint16_t fg, uint16_t bg, uint8_t alpha) {

RGB565Struct *fgColor = (RGB565Struct*) (&fg);
RGB565Struct *bgColor = (RGB565Struct*) (&bg);
RGB565Struct outColor;

outColor.r = ((int) (fgColor->r * alpha) + (int) (bgColor->r) * (256 - alpha)) >> 8;
outColor.g = ((int) (fgColor->g * alpha) + (int) (bgColor->g) * (256 - alpha)) >> 8;
outColor.b = ((int) (fgColor->b * alpha) + (int) (bgColor->b) * (256 - alpha)) >> 8;


return *((uint16_t*)&outColor); 

}

Now that you understand the concept of transparency blending and have achieved transparency blending for a single pixel, let’s look at how to switch between gradients in an image.

Assuming that the gradient is done in 30 frames, we need to create a buffer in memory equal to the size of the image. Then we take the first image (currently displayed) as the background, the second image (next displayed) as the foreground, set an opacity for the foreground, mix the opacity for each pixel, and temporarily store the mixed result in a buffer. After blending, the data in the buffer is copied into the framebuffer to complete the display of a frame. Then proceed to the second frame, the third frame…… Gradually increase the opacity of the foreground until the foreground becomes opaque. This completes the gradient switch.

Because each frame requires the blending of every pixel in the two images, it takes a lot of computation. It would be unwise to hand it over to the CPU, so let’s hand it over to DMA2D.

This time DMA2D’s blending function is used, so we need to enable DAM2D’s mixed-color memory-to-memory mode with a value of 10 corresponding to the CR register [17:16] bit, i.e. :

DMA2D->CR = 0x00020000UL; // Set the operating mode to memory to memory with color blending

Then set the memory address of foreground, background and output data, data transmission offset, and the width and height of the transmitted image respectively:

DMA2D->FGMAR = (uint32_t)pFg; // Set foreground data memory address dma2d-> BGMAR = (uint32_t)pBg; Dma2d-> OMAR = (uint32_t)pDst; // Set the data output memory address

DMA2D->FGOR = offlineFg; // Set foreground data transfer offset dma2d-> BGOR = offlineBg; // Set background data transfer offset dma2d-> OOR = offlineDist; // Set the data output transfer offset

DMA2D->NLR = (uint32_t)(xSize << 16) | (uint16_t)ySize; // Set the width and height of the image data in pixels.

Set the color format. Here need to pay attention to when set the foreground color format, because if you are using ARGB color format, so we mixed transparency, color data itself in the alpha channel will impact on mixed results, so here we will set in mixed operation, ignore the foreground its alpha channel. And force the transparency of blending.

Output color format and background color format

Dma2d-> FGPFCCR = PixelFormat // Set the color format for the front color

| (1 ul < < 16) / / ignore the Alpha channel in the foreground color data | (opa (uint32_t) < < 24); // Set the foreground opacity

DMA2D->BGPFCCR = pixelFormat; Dma2d-> OPFCCR = PixelFormat; // Set the output color format

Tips0: Sometimes a picture with a transparent channel will appear superimposed on the background, so the alpha channel of the color itself should not be disabled

Tips1: In this mode, we can not only mix the colors, but also convert the color format at the same time. We can set the foreground and background as well as the color format of the output according to our needs

Finally, start the transmission:

/ / start transmission DMA2D – > CR | = DMA2D_CR_START;

/ while (dma2d-> CR & DMA2D_CR_START) {} / while (dma2d-> CR & DMA2D_CR_START) {}

The complete code is as follows:

void _DMA2D_MixColors(void pFg, void pBg, void* pDst,

uint32_t offlineFg, uint32_t offlineBg, uint32_t offlineDist, uint16_t xSize, uint16_t ySize, uint32_t pixelFormat, uint8_t opa) { DMA2D->CR = 0x00020000UL; // Set the operating mode to memory to memory with color blend DMA2D->FGMAR = (uint32_t)pFg; // Set foreground data memory address dma2d-> BGMAR = (uint32_t)pBg; Dma2d-> OMAR = (uint32_t)pDst; Dma2d-> FGOR = offlineFg; // Set foreground data transfer offset dma2d-> BGOR = offlineBg; // Set background data transfer offset dma2d-> OOR = offlineDist; / / set data output transmission shift DMA2D - > NLR = (uint32_t) (xSize < < 16) | (uint16_t) ySize; / / set wide high image data (pixels) DMA2D - > FGPFCCR = pixelFormat / / set the foreground color format | (1 ul < < 16) / / ignore the Alpha channel in the foreground color data | (opa (uint32_t) < < 24); Dma2d-> BGPFCCR = PixelFormat; Dma2d-> OPFCCR = PixelFormat; / / set the color output format / * * / DMA2D start transmission - > CR | = DMA2D_CR_START; /* while (dma2d-> CR & DMA2D_CR_START) {}

}

To write the test code, this time you don’t need to double wrap the function:

void DMA2D_AlphaBlendDemo(){

const uint16_t lcdXSize = 320, lcdYSize = 240; const uint8_t cnvFrames = 60; // const uint32_t interval = 33; // UINT32_T time = 0; Uint16_t distX = (lcdxsie-demo_img_width) / 2; uint16_t distY = (lcdYSize - DEMO_IMG_HEIGHT) / 2; uint16_t* pFb = (uint16_t*) framebuffer; uint16_t* pDist = pFb + distX + distY * lcdYSize; uint16_t offlineDist = lcdXSize - DEMO_IMG_WIDTH; uint8_t nextImg = 1; uint16_t opa = 0; void* pFg = 0; void* pBg = 0; While (1){if(nextImg){pFg = (void*)img_cat; pBg = (void*)img_fox; } else{ pFg = (void*)img_fox; pBg = (void*)img_cat; } // complete for(int I = 0; i < cnvFrames; i++){ time = HAL_GetTick(); opa = 255 * i / (cnvFrames-1); _DMA2D_MIXCOLORS (PFG, PBG, PDIST, 0,0, OFFLINEDIST, DEMO_IMG_WIDTH, DEMO_IMG_HEIGHT, LTDC_PIXEL_FORMAT_RGB565, OPA); time = HAL_GetTick() - time; if(time < interval){ HAL_Delay(interval - time); } } nextImg = ! nextImg; HAL_Delay(5000); }

}

End result:

GIF. GIF performance comparison

Three examples of embedded graphics development have been presented, and methods implemented through traditional and DMA2D, respectively. At this point, I’m sure some of my friends will ask, how much faster is a DMA2D implementation compared to a traditional approach? So let’s actually test that out.

Common test conditions are as follows:

The framebuffer is placed in SDRAM, 320x240, RGB565 SDRAM operating frequency 100MHz, CL2, 16 bit bandwidth. MCU is STM32H750XB, master frequency is 400MHz, open I-Cache and D-Cache code and resources on internal Flash, 64-bit AXI bus, speed is 200MHz. GCC compiler (version: arm-atolice-eabi-gcc-6.3.1)

Rectangle filling

Test method:

Draw the chart in section 1 of the previous chapter, draw it 10,000 times, and count the results

Test results: drawing mode consumption time (-O0) consumption time (-O3) software to achieve 39641 ms 9930 ms DMA2D 9827 ms 9817 ms memory copy

Test method:

Draw the sequence frame of 10000 in Section 2 of the previous chapter and count the results

Test results: drawing mode consumption time (-O0) consumption time (-O3) software to achieve 68787 ms 48654 ms DMA2D 26201 ms 26160 ms transparency mixing

Test method:

Gently switch the two images in Section 3 of the previous chapter 100 times, 30 frames each time, a total of 3000 frames mixed results output directly to the framebuffer, no longer through the buffer buffer

Test results: drawing mode consumption time (-O0) consumption time (-O3) software implementation 20824 MS 2617 MS DMA2D 681 MS 681 MS performance test summary

As can be seen from the above test results, DAM2D has at least two advantages:

One is faster: in some projects, DMA2D implementations can be up to 30 times faster than pure software implementations! This is true on a 400MHz STM32H750 platform with a L1 Cache. If the test is performed on a no-cache STM32F4 platform with a lower Cache frequency, the difference will be even larger.

Second, performance is more stable: the results show that the way DMA2D is implemented is almost negligible to the compiler optimization level, which means that you can achieve the same performance using DMA2D whether you use IAR, GCC, or MDK. It is not likely that the same piece of code will be ported with vastly different performance.

In addition to these two intuitive results, there is actually a third advantage, which is that the code is much easier to write. DMA2D has few registers and is relatively intuitive. In some cases, it is more convenient to use than software implementation. conclusion

The three examples in this article are all situations that I have encountered frequently in embedded graphics development myself. In fact, there are many uses of DMA2D. If you are interested, you can refer to the relevant content in the “STM32H743 Chinese Programming Manual”. I believe that with the basis of this article, you will get twice the result with half the effort when reading the content inside.

Due to the limitation of the author’s technology, the content in the article cannot be 100% correct. If there is any mistake, please point it out. Thank you.

Original link: https://club.rt-thread.org/as…