Abstract: The importance of CPU built-in a small amount of cache is self-evident, in the volume, cost, efficiency and other factors produced today’s use of the computer storage structure.

1. The salt interface

2. Structure of CPU cache

3. Cache access and consistency

4. Code design considerations

5. After the most

CPU frequency is too fast, its processing speed is much faster than the storage medium reads and writes. Therefore, it leads to the waste of CPU resources and needs to effectively solve the mismatch between IO speed and CPU computing speed. Chip-level caching can greatly reduce latency between processes. Advances in CPU-making have allowed billions of transistors to be packed into a smaller space than ever before, leaving more room for the cache to be as close to the core as possible.

The importance of the small amount of cache built into the CPU is self-evident, because of the size, cost, efficiency and other factors that give rise to the memory structure of today’s computers.

  1. says

Computer memory has a speed-based hierarchy, and CPU caches, which are blocks of static memory between the CPU core and physical memory (dynamic memory DRAM), sit at the top of that hierarchy and are the fastest. It is also the place closest to central processing and is part of the CPU itself, usually directly integrated with the CPU chip.

CPU computation: A program is designed as a set of instructions that are ultimately run by the CPU.

Loaders and data, first read from the nearest level 1 cache, if any, directly return, read layer by layer, until loaded from memory and other external storage, and loaded data into the cache in turn.

Write data from the cache back to main memory is not executed immediately.

1. When the cache is full, write back in FIFO or the longest unused order;

2.# LOCK signal, cache consistency protocol, clearly requires that the data should be synchronized back to main memory immediately after the completion of calculation.

  1. The structure of the CPU cache

Modern CPU cache architectures are divided into multi-processing, multi-core, and multi-level hierarchies.

2.1 Multi-level cache structure

There are three main levels, namely L1, L2 and L3. The closer the cache is to the CPU, the higher the reading efficiency, the smaller the storage capacity and the higher the cost.

The L1 cache is the fastest memory available on the system. In terms of priority, the L1 cache has the data that the CPU is most likely to need to complete a particular task, typically up to 256KB in size, with some powerful CPUs occupying close to 1MB. Some server chipsets, such as Intel’s high-end Xeon CPUs, have 1-2MB. L1 cache is usually divided into instruction cache and data cache. The instruction cache handles information about the operations the CPU must perform, while the data cache holds data about the operations to be performed on it. This reduces Cache contention and improves processor performance.

L2 level caches are slower than L1 but larger in size, typically between 256KB and 8MB, and powerful CPUs tend to exceed this size. The L2 cache holds data that may be accessed by the CPU next. In most CPUs, the L1 and L2 caches are located in the CPU kernel itself, and each kernel has its own cache.

The L3 level cache is the largest high-speed storage unit and also the slowest. Sizes range from 4MB to more than 50MB. Modern CPUs have dedicated space on the CPU bare chip for L3 caching and take up a large portion of that space.

2.2 Multi-processor cache structure

Computers have already entered the multi-core era, and software runs in the multi-core environment. One processor corresponds to one physical slot and contains multiple cores (one core contains register, L1 Cache, L2 Cache). Multiple cores share L3 Cache, and multiple processors are connected via QPI bus.

The L1 and L2 caches are caches that are private to a single CPU core, and the L3 caches are caches that are shared with all cores of the slot.

The L1 cache is divided into a separate 32K data cache and 32K instruction cache, while the L2 cache is designed as a buffer between L1 and the shared L3 cache. It is 256K in size and serves primarily as an efficient memory access queue between L1 and L3, containing both data and instructions. The L3 cache contains data from all L1 and L2 caches in the same slot. This design consumes space, but intercepts requests for L1 and L2 caches, reducing the burden on the individual core private L1 and L2 caches.

2.3 the Cache Line

The Cache stores data in units of fixed size, called Cache lines/blocks. Given the capacity and Cache Line size, the number of items stored is fixed. For X86, the Cache Line size is the same as the size of the data that DDR can retrieve in a single fetch, which is 64B. Older ARM architectures had 32B Cache lines, so it was often necessary to fill two at a time. The CPU can retrieve data from the Cache in bytes, the Cache can retrieve data from Memory in lines, and the Memory can retrieve data from disk in 4K.

The Cache is divided into several groups. Each group is divided into several Cache lines.

On Linux, use the following command to view the Cache information, or the lscpu command.

  1. Cache access and consistency

The following table describes the access information of different storage media for your reference.

Access speed: register > cache(L1~L3) > RAM > Flash > hard disk > network storage

For a 2.2GHz CPU, for example, each clock cycle is about 0.5 nanoseconds.

3.1 Read memory data

According to the CPU hierarchy cache structure, the order of fetching data is cache first and then main storage. Of course, if the data comes from a register, just read it directly and return it.

1) If the data to be read by the CPU is in L1 cache, lock the line in the cache and unlock it after reading

2) If the CPU wants to read data in L2 cache, and the data in L2 lock, copy the data to L1, and then perform the read L1

If the CPU reads data in the L3 cache, it will be copied from L3 to L2, then from L2 to L1, and finally from L1 to CPU

4) If the CPU needs to read the memory, it will first inform the memory controller to occupy the bus bandwidth, then the memory will lock, initiate the read request, wait for the response, and the response data will be saved to L3, L2,L1, and then release the bus lock after L1 to the CPU.

3.2 Cache hits and delays

Due to the principle of locality of data, CPU often needs to read data repeatedly in a short period of time. The running frequency of memory is far behind the processing speed of CPU, so the importance of cache is highlighted. The CPU can avoid the memory cache to read the desired data, called a hit. L1 is very fast, but it has very little capacity. The probability of a hit in L1 is about 80%. L2 and L3 have similar mechanisms. As a result, the CPU needs to read about 5%-10% of the data in main memory, the rest of the hit can be obtained in L1, L2, L3, greatly reducing the response time of the system.

Caching is designed to speed up the transfer of data between main memory and the CPU. The time required to access data from memory is called latency, with L1 having the lowest latency and being closest to the core, and L3 having the highest latency. When the cache misses, the latency is even greater because the CPU has to fetch data from the main memory.

3.3 Cache replacement policy

The data in the Cache is a copy of the data commonly used in Memory. When a new item is stored after the Cache is full, an old item needs to be removed from the Cache. This process is called EVICT. The Cache management unit uses an algorithm to determine which data needs to be removed from the Cache, called a replacement policy. The simplest strategy is LRU. During the design process of CPU, the replacement strategy is usually improved, and almost every chip uses a different replacement strategy.

3.4 MESI cache consistency

In a multi-CPU system, each CPU has its own local Cache. Therefore, it is possible to have multiple copies of the same address in the local caches of multiple CPUs. In order for the program to execute correctly, the same variable must be guaranteed, and each CPU will see the same value. In other words, it is necessary to ensure that the actual data in memory is reflected in each CPU’s local Cache.

If a variable has a copy in both CPU0’s and CPU1’s local caches, CPU0 must somehow notify CPU1 when it changes the variable so that CPU1 can update its local Cache copy in time to keep the data in sync between the two CPUs. This synchronization between CPUs is costly.

To ensure cache consistency, modern CPUs implement the complex multi-core, multi-level cache consistency protocol (MESI), which locks individual cache rows.

MESI: Modified Exclusive Shared or Invalid

1) Status in the Agreement

Each cache line in the CPU is marked with four states (represented by an extra two bits)

M: Modified

The cache row is only cached in the CPU’s cache and is dirty, i.e. not consistent with the data in main memory. The contents of the cache row need to be written back to main memory at some point in the future before other CPUs are allowed to read the corresponding memory in main memory. When written back to main memory, the state of the cache row becomes exclusive

E: Exclusive

The cache row is cached only in the CPU’s cache, not modified, and is consistent with the data in main memory. At any point in time when the memory is read by another CPU, it becomes shared. Again, the state can be changed to Modified when the contents of the cache row are Modified.

S: Shared

This means that a cached row may be cached by multiple CPUs, and the data in each cache is consistent with main memory. When a cached row is modified by a single CPU, the cached row may be invalidated in other CPUs.

I: Invalid, the cache was not valid (another CPU may have modified the cache line)

2) State switching relationship

The figure below shows how the cache ensures its data consistency.

For example, a chunk of data that is currently being read by the core is in the cache state of Invalid in the core, and is in the state of Modified in the other cores. The current state of the cache is Modified, and the other cores want to read the block (” Modified “to” Shared “dotted line in the figure). The other cores first write the changed data to memory (prior to the reads of the other cores), and then update the cache state to Share. Current core Angle: The current state is Invalid, and you want to read the block of data (the solid green line from Invalid to Shared) : In this case, the cache state Share is reloaded from memory and updated

For your information, the following table lists all cases from both perspectives:

3) Operation description of the cache

A typical system will have several caches (each core) sharing the main memory bus, and each corresponding CPU will issue a read and write request. The purpose of the cache is to reduce the number of CPUs reading and writing to the shared main memory.

A cache can satisfy a CPU’s read request except in the Invalid state. Cached lines in the Invalid state must be read from main memory to satisfy a CPU’s read request.

L a written request only when the cache line is M or E state can be implemented, if the cache line in S state, must first to other cache the cache line becomes Invalid (does not allow different CPU to modify the same cache line at the same time, even if the change of different locations in the cache line nor) data, the operation is often done by radio.

The L-cache can invalidate, or make Invalid, a cache row that is not in the M-state at any time. An M-state cache row must first be written back to main memory. A cache row in state M must always listen for any attempts to read the cache row relative to main memory, and the operation must be deferred until the cache writes the cache row back to main memory and changes the state to S state.

L A cache row in state S listens for requests from other caches to invalidate or monopolize the cache row and invalidate the cache row.

A cache row in state E must also listen for other operations that read that cache row in main memory. If such operations occur, the cache row must become in state S.

L is always accurate for the M and E states, and is consistent with the true state of the cache row. And S status may be consistent, if a cache will be held in a state of S cache line became invalid, and the other a cache may actually have exclusive to the cache line, but the cache will not be for E state, the cache line promotion because other cache will not broadcast scrapped off the cache line notice, also, because of the cache does not save the cache lines of copy number, So there is no way to determine if you have exclusive access to the cache row.

In the above sense, the E state is a speculative optimization: if a CPU wants to change a cache row in the S state, a bus transaction would need to make all copies of that cache row Invalid, whereas a cache changing the E state would not need to use a bus transaction.

  1. Code design considerations

Understand the impact of computer memory hierarchies on application performance. If the required program is in the CPU register, it can be accessed within 1 cycle when the instruction is executed. If it is in the CPU Cache, it takes 1-30 cycles. If it’s in main memory, it takes 50 to 200 cycles; On disk, you need about ten thousand cycles. In addition, Cache Line access is also a part of the code designer needs to pay attention to avoid pseudo-shared execution scenarios. Therefore, making full use of the structure and mechanism of cache can effectively improve the execution performance of the program.

4.1 Local characteristics

Once the CPU wants to access data from memory or disk, there will be a large delay and the program performance will be significantly reduced. Therefore, we have to increase the Cache hit ratio, which is to take full advantage of the locality principle. In general, a program with good locality will run faster and perform better than a program with less locality.

Locality ensures that when accessing a storage device, access data or instructions tend to be clustered in a contigous area. A well-designed computer program usually has good locality, time locality and space locality.

1) Time locality

If a data/information item is accessed once, there is a good chance that it will be accessed again in a very short period of time. Examples include loops, recursion, repeated method calls, etc.

2) Spatial locality

A single 64-byte Cache Line can be loaded in a single 64-byte load, which increases the Cache Line hit rate (rather than re-addressing it) by loading all of the data that will be accessed later in the program. If one piece of data is accessed, it is likely that other pieces of data located near that data will soon be accessed as well. Examples include code that executes sequentially, objects that are created consecutively, arrays, etc. An array is a data structure that takes the principle of locality to the extreme.

3) Code samples

Example 1, (C language)

Array1.c char array[10240][10240]; array1.c char array[10240]; int main(int argc, char *argv[]){ int i = 0; int j = 0; for(i=0; i < 10240 ; i++) { for(j=0; j < 10240 ; J ++) {array[I][j] = 'A'; }} return 0; }

The code for the red font is adjusted to arrayj = ‘A’; // By column

Compiling and running results are as follows:

The test results show that the first program runs in 0.265 seconds and the second in 1.998 seconds, which is 7.5 times longer than the first program.

Case reference: https://www.cnblogs.com/wangh…

Results analysis

Array elements are stored in contiguous memory, and multidimensional arrays are stored row by row in memory. When the first program accesses an element on a row basis, all elements of the same size in the vicinity of that element are loaded into the Cache. As a result, when the next element is accessed, the data in the Cache can be accessed directly, without having to load it from memory. That is, the array is accessed row by row with better spatial locality and Cache hit ratio.

The second program accesses an element in a column. Elements of the size of a Cache Line adjacent to the element are also loaded into the Cache, but the next element to be accessed is not the element immediately adjacent to it, so there is a risk of another Cache miss and having to load the data from memory. In addition, although the Cache will try to store the most recently accessed data, the Cache size is limited, and when the Cache is full, some of the data will have to be replaced. This is also one of the important reasons why programs with poor spatial locality are more likely to generate Cache misses.

Example 2, (Java)

In the following code, rows and columns of length 16 are contiguous on the Cache Line’s 64-byte data block, and can be loaded into the Cache Line at once. This results in a high hit rate when accessing the array.

public int run(int[] row, int[] column) { int sum = 0; for(int i = 0; i < 16; i++ ) { sum += row[i] * column[i]; } return sum; }

The variable I represents a temporal locality, is frequently operated as a counter, is always stored in a register, and is accessed from the register each time, rather than from the cache or main memory.

4.2 Lock contention for cache rows

Under the multiple processors, to ensure that each processor cache is consistent, can achieve cache coherence protocol, each processor by sniffing the spread of the data on the bus to check their cache value is out of date, when the processor found himself cache line corresponding to the memory address has been changed, and will replace the current processor cache line into a invalid state, When the processor modifies this data, it forces the data to be re-read from system memory into the processor cache.

When multiple threads access the same cache row, for example, one thread locks the cache row and then operates, then other threads cannot operate on the cache row. In this case, we in the program code design is to try to avoid.

4.3 Avoidance of pseudo-sharing

Contiguous and tight memory allocation leads to high performance, but it doesn’t always work. Pseudo sharing is a silent performance killer. False Sharing is caused by different threads running on different CPUs simultaneously modifying data on the same Cache Line. Write contention on cached lines is the most important limiting factor to the scalability of parallel threads running in SMP systems, and in general, it is difficult to see from the code whether pseudo sharing occurs.

Different variables are modified on a per-CPU basis, but because these variables are located next to each other in memory, they are on the same Cache Line. When one CPU modifies this Line, it invalidates the other CPU’s local Cache, causing a Cache miss and reloading the variable’s modified value from memory. Multiple threads frequently modify data in the same Cache Line, resulting in a large number of Cache misses, resulting in significant performance degradation.

The following diagram shows two threads with different cores updating different information items on the same cache row:

The figure above illustrates the problem of pseudo-sharing. The thread running on Core1 is ready to update the variable X, while the thread running on Core2 is ready to update the variable Y. However, both variables are in the same cache line. Each thread has to compete for ownership of the cache row to update the variable. If Core1 takes ownership, the caching subsystem invalidates the corresponding cache rows in Core2. When Core2 takes ownership and then performs an update, Core1 invalidates its corresponding cache row. Going back and forth through the L3 cache greatly affects performance. The problem can be even more serious if competing cores are located in different slots and have additional connections across slots.

1) Avoidance treatment

L Increase the spacing of array elements so that different threads access elements in different cache lines, space for time

L Creates a local copy of each element of the global array in each thread, and then writes back to the global array when finished

2) Code examples

Example 3, (Java)

From a code design perspective, consider which variables in the class structure are constant, which change frequently, which changes are completely independent of each other, and which properties change together. In a business scenario, the following objects satisfy several characteristics

public class Data{

    long modifyTime;

    boolean flag;

    long createTime;

    char key;

    int value;

}

L When the value variable is changed, modifyTime will definitely change

L CreateTime variables and key variables do not change after they are created

L Flag also changes frequently, but has nothing to do with modifyTime and value variables

When the object above needs to be accessed by multiple threads at the same time, from the Cache perspective, when we don’t do anything about it, Data

All of the object’s variables will most likely be loaded in one Line of the L1 Cache. Under high concurrent access, this problem can occur:

As shown in the figure above, each time the value changes, the associated Cache lines on the object’s other CPUs are set to be invalidated according to the MESI protocol. When other processors want to access the unchanged data (key and createTime), they must pull the data from memory, increasing the overhead of data access.

Effective Padding

The correct way to do this is to group the object properties, putting those that change together in one group, those that are unrelated to others in one group, and those that do not change in one group. In this way, when the object changes every time, it will not drive all the properties to reload the cache, which improves the read efficiency. Prior to JDK1.8, it was common to add long integer variables between attributes to separate each set of attributes. The number of bytes of each set of attributes to be operated on plus the number of bytes of the preceding and subsequent populated attributes is no less than the number of bytes of a cache line.

public class DataPadding{ long a1,a2,a3,a4,a5,a6,a7,a8; // Prevent pseudo-shared int values from the previous object; long modifyTime; long b1,b2,b3,b4,b5,b6,b7,b8; // Prevent false sharing of unrelated variables; boolean flag; long c1,c2,c3,c4,c5,c6,c7,c8; // long createTime; char key; long d1,d2,d3,d4,d5,d6,d7,d8; // Prevent pseudo-sharing with the next object}

Photographs after taking the above measures:

In Java

Java8 implements byte padding to avoid pseudo-sharing, and the JVM parameter -XX:-RestrictContended

@Contended is located in sun.misc for annotating Java property fields, automatically filling in bytes to prevent pseudo-sharing.

Example 4, (C language)

** #include <stdio.h> #include <pthread.h> struct {int a; ** #include <pthread.h> struct {int a; // charpadding[64]; // thread2.c int b; }data; void *thread_1(void) { int i = 0; for(i=0; i < 1000000000; i++){ data.a = 0; } } void *thread_2(void) { int i = 0; for(i=0; i < 1000000000; i++){ data.b = 0; }}

/ / program thread2. C

The red line in thread1.c that opens the comment is thread2.c

The main() function is simple. It creates two threads and runs them, as follows:

pthread_t id1;
int ret = pthread_create(&id1,NULL, (void*)thread_1,NULL);

Compiling and running results are as follows:

The first program took three times as long as the second program

Case reference: https://www.cnblogs.com/wangh…

Results analysis

This example involves pseudo-sharing of Cache lines. The only difference between the two programs is that the second program has a 64-byte character array between fields A and B. In the first program, field A and field B are on the same Cache Line. When two threads modify these two fields at the same time, the pseudo-cache Line sharing problem will be triggered, resulting in a large number of Cache misses, and the performance of the program will decrease.

In the second program, we add a 64-byte array between fields A and B, ensuring that they are on different Cache lines. In this way, even if two threads modify the two fields at the same time, the two cache lines will not affect each other. The cache hit rate is very high, and the performance of the program will be greatly improved.

Example 5, (C language)

When designing data structures, try to separate read-only data from read-write data, and try to combine data accessed at the same time. This allows the CPU to read in the required data at once. For example, the following data structure is not good.

struct __a { int id; // not easy to change int factor; // char name[64]; // not easy to change int value; / / variable};

Under X86, you can try to modify and tweak it

#define CACHE_LINE_SIZE 64 // cache length struct __a {int id; // char name[64]; Char __align[CACHE_LINE_SIZE -- sizeof(int)+sizeof(name) * sizeof(name[0]) % CACHE_LINE_SIZE] int factor; // Variable int value; // variable char __align2[CACHE_LINE_SIZE -- 2* sizeof(int)%CACHE_LINE_SIZE]};

CACHE_LINE_SIZE — sizeof(int)+sizeof(name)*sizeof(name[0])%CACHE_LINE_SIZE looks discordant. CACHE_LINE_SIZE represents the cache line (64B size). __align is used for explicit alignment, such that the size of the alignment of the structure bytes is the size of the cache line.

4.4 Cache and memory alignment

1) Byte alignment

Attribute ((packed)) tells the compiler to remove optimal alignment of the structure during compilation and align the structure according to the actual number of bytes occupied, which is a GCC-specific syntax;

__attribute__((aligned(n))) indicates that variables defined are n-byte aligned;

struct B{ char b; int a; short c; }; (Default 4-byte alignment)

This is also a seven-byte variable, but sizeof(struct B) has the value 12.

The details of byte alignment depend on the compiler’s implementation, but in general, three criteria are met:

1) The first address of a (struct) variable is divisible by the size of the member of its (widest) primitive type;

The offset of each member relative to the first address is several times the size of the member. If necessary, the compiler adds an internal adding byte between the members.

3) The total size of the struct is several times the size of the widest primitive type member of the struct. If necessary, the compiler will add trailing padding after the last member.

2) Cache line alignment

Data spans two cache lines, meaning two loads or two stores. If the data structure is cache line aligned, it is possible to reduce a read or write. The cache line alignment of the first address of a data structure means that there may be wasted memory (especially for data structures such as arrays), so there is a trade-off between space and time. The normal malloc() function, for example, returns a memory address that is already 8-byte aligned, in order to provide better performance for most programs.

In C, to avoid pseudo-sharing, the compiler automatically allocates the structure, byte completion, and alignment, preferably to the size of the cache line. In general, the structure instance will be aligned with its widest member. The compiler does this because it is the easiest way to ensure that all members are self-aligned for fast access.

__attribute__((aligned(cache_line))) align struct syn_str {ints_variable; }; (cache_line __attribute__ ((aligned)));

Example 6, (C language)

struct syn_str { int s_variable; }; void *p = malloc ( sizeof (struct syn_str) + cache_line ); syn_str *align_p=(syn_str*)((((int)p)+(cache_line-1))&~(cache_line-1);

4.5 CPU branching prediction

Code in memory is arranged sequentially, can be accessed sequentially, effectively improve cache hits. For branching programs, if the code after the branching statement has a greater chance of execution, it can reduce the jump. Generally, CPUs have instruction prefetching function, which can increase the probability of instruction prefetching hit. Branch prediction uses macros such as likely/unlikely, which generally require compiler support, and is static branch prediction. Many CPUs now support storing internally the results of executed branch instructions (branch cache), so static branch prediction doesn’t make much sense.

Example 7, (C language)

int testfun(int x) { if(__builtin_expect(x, 0)) { ^^^--- We instruct the compiler, "else" block is more probable x = 5; x = x * x; } else { x = 6; } return x; } `

In this example, it is more likely that x will be 0. After compiling, look at the assembly instruction and the results are as follows:

'Disassembly of section. text: 00000000 <testfun >:0:55 push %ebp 1:89e5 mov %esp,%ebp 3:' Disassembly of section. text: 00000000 <testfun >:0:55 push %ebp 1:89e5 mov %esp,%ebp 3: 8b 45 08 mov 0x8(%ebp),%eax 6: 85 c0 test %eax,%eax 8: 75 07 jne 11 <testfun+0x11> a: b8 06 00 00 00 mov $0x6,%eax f: c9 leave 10: c3 ret 11: b8 19 00 00 00 mov $0x19,%eax 16: eb f7 jmp f <testfun+0xf>`

As you can see, the compiler uses the JNE directive, and the code in the else block follows immediately

8:  75 07              jne   11 <testfun+0x11>
a:   b8 06 00 00 00          mov   $0x6,%eax

4.6 Monitoring of hit ratio

The program design should pursue the better utilization of the CPU cache, to reduce the inefficiency of reading data from memory. Cache hit ratio is a metric you usually need to look at when your program is running.

Monitoring method (Linux) : query the number of missed CPU cache and the number of cache requests, and calculate the cache hit rate

perf stat -e cache-references -e cache-misses

4.7 small knot

When a program retrieves data from memory, it not only retrieves the required data each time, but also loads a continuous memory data into the cache according to the size of the cache row. The optimization pattern in the program design can be referred to as follows.

L In the case of set traversal, we can use an ordered array, which is a contiguous space in memory.

L fields should be defined as small bytes, such as int can be satisfied, do not use long. This can load more data into the cache at a time;

If possible, set the size of the l object or structure to an integer multiple of the CPU core cache rows. Based on the size of 64B cache line, if the data read is 50B, it will need to load from memory twice in the worst case; At 70B, it takes three memory reads in the worst case to load into the cache.

L Multiple attributes of the same object/structure may exist in the same cache line, leading to the pseudo-sharing problem. It needs to be designed separately for the invariant and constant change of attributes, the independence and association of changes, and the alignment of cache lines, so as to solve the problems of cache failure and mutual constraint in multi-thread high concurrency environment.

L CPU has branch prediction ability, when using Ifelse case when and other cyclic judgment of the scene, can be sequential access, effectively improve the cache hit

.

In addition to the examples described in this section, the impact of CPU caches on program performance can be seen everywhere in the system.

Click on the attention, the first time to understand Huawei cloud fresh technology ~