M1 beating Intel? Q: How is the chip different this time

1. Introduction

When I saw the M1 chip coming out, I wanted to say something. As a result, I wrote more than 4000 words about x86 and ARM. Considering the length of the article, I had to divide it into two parts.

So without further ado, this article we will talk about, this chip in the end what is different, so that so many people say that Apple does not talk about martial virtue.

Before that, I would like to make it clear that I am just a student majoring in computer science. I have never claimed to be an expert or scholar. I write my blog just to do some simple science popularization and share my knowledge with you. If there are any mistakes in the article, please correct them.

2. M1 chip ≠ CPU

First, let’s get rid of the misconception that the M1 chip is not a CPU, but a SoC chip designed for the Mac. The CPU is just one component of the M1 chip.

The SoC Chip refers to a System on Chip, also known as a Chip on a Chip, which is an integrated circuit that integrates a computer or other electronic System into a single Chip. SoC integration mainly includes processor (CPU, GPU, etc.), baseband, various interface control modules, various interconnection buses, etc. The typical representative is mobile phone chip. For example, a CPU company sells its CPU design to other companies, and other companies add their own peripheral controllers based on the CPU, which is called the SoC.

So to put it simply: SoC is a chip that integrates CPU, GPU and other structures. Therefore, do not simply think that M1 chip is CPU.

There is a picture on the website that illustrates the components of the M1 chip. As shown below, Apple calls it Unified Memory Architechture, which connects the CPU, graphics processor, neural network engine, cache, and DRAM memory together via the Fabric high-speed bus. Therefore, the strength of M1 chip is not supported by a powerful CPU, but the joint efforts of many powerful components combined with Apple’s excellent design.

3. Unified Memory Architecture (UMA)

From the previous paragraph, we know that the power of M1 chip is not enough to rely on a powerful CPU. After all, Apple cannot break through the laws of physics and improve the performance of CPU several times through simple design. Apple clearly uses other tricks to overtake on corners, and “unified memory architecture” is one of them.

As we know, when a processor is working on a task, all it needs to do is simply fetch things and calculate things, which is “receiving instructions + calculating data” mentioned in the previous article.

As in the previous article, let’s take the migrant workers as an example. To get things is to move bricks, and to calculate things is to lay bricks. No matter how fast and good you lay bricks, if you move bricks slowly, the speed of laying bricks will not be fast. Conversely, if bricks are laid slowly and bricks are moved, there will be bricks piled up.

So ideally, you move the bricks at the same speed as you lay them, so you don’t have to wait for each other. The same is true for cpus, where inconsistent speeds can result in wasted performance.

So, to solve the above problem, Apple provided a solution: unified memory architecture.

3.1 What does the UMA Do?

So what does UMA do?

In our computers, there are many Processing units (PUS), including the Central Processing Unit (CPU), the Graphics Processing Unit (GPU). And NPU (Neural Network Processing Unit). They all need to fetch things and calculate things, but before UMA, they could only allocate things through the CPU, which had to fetch data from memory beforehand. Obviously, this way of working is very inefficient, and the data processing speed of different processors is also different, so the performance of hardware is compromised to some extent in order to synchronize the timing.

Moreover, as an independent entity, each processing unit processes different packet formats. Different PUs use different languages to communicate with each other, and it takes time to unify the data format. This is a problem if the communication efficiency is high. For example, when people from different countries work together, they all speak their own languages and communicate through translators, which sounds inefficient.

Of course, these are just a few simple examples, and the real situation is more complicated than that, but even so, we can feel that there are many steps that shouldn’t exist and take a lot of time. So, to solve these problems, Apple has come up with several solutions.

3.1.1 PU Directly Accesses the MEMORY

Before the UMA is installed, you need to remove data from the memory and allocate it to the CPU first, as shown in the following figure 👇

With UMA, these processing units can access the memory directly, and there is no need to fetch data from the CPU, as shown below 👇

By doing this, the PU no longer needs to synchronize its speed with the CPU, and no longer needs to check with the CPU for any random things, which saves a lot of time.

3.1.2 Apple – designed package

This solution solves the timing problem, but the communication problem of each processing unit is still not solved. This solution is designed by Apple.

Through apple-designed package, the format of data packets is unified when each unit processes data, and the communication between them does not need to be translated. Even if all the words are “abba abba”, they can understand what is said respectively, which saves some time.

3.1.3 High integration

Both the disassembly diagram and the example diagram show that Apple’s high degree of integration this time is directly stored next to the processor, which greatly reduces the physical distance between the memory and the processor, and the speed of data retrieval will naturally be faster.

Do not underestimate this reduction in physical distance, the current computer belongs to the Von Neumann structure, and one of the biggest hidden trouble of this structure is: after the memory capacity increases exponentially, the data transmission bandwidth between CPU and memory becomes the bottleneck, one of the reasons is that the physical distance between memory and CPU is too large. Although the speed of CPU and memory is getting faster, the distance between them cannot change, and neither can the speed of light.

We can do a simple math problem. I9-7980xe is an 18-core, 36-thread civil CPU with a maximum core frequency of 4.4GHz. Assume that the CPU executes one computation instruction in one clock cycle. The time it takes the CPU to execute an instruction is 0.000000000227273 seconds, or 0.22ns (nanosecond). In that time, the distance traveled by light is 0.0681819 meters, rounded to about 7 centimeters. So if the distance between the CPU and memory is more than 7cm, the CPU will have to wait a little longer to receive instructions. This is still one instruction at a time, but what if there are too many?

Therefore, the reduction of physical distance can naturally make the CPU get data faster, which is also the key point of M1 chip performance improvement.

3.2 Large Cache

All of this is just to make it faster for the CPU to fetch data, but the bandwidth between the CPU and memory becomes a bottleneck not only because of the physical distance, but also because the CPU is too fast to synchronize with the memory. Therefore, we need to import frequently used data into the cache, also known as the cache, so that the computer does not need to ask for things in memory, let alone the computer should not ask for things in the hard disk.

3.2.1 What is cache

A Cache is a data exchange buffer (called a Cache) that is a temporary place to store data (frequently used data). In fact, caching is used in many places. For example, when a user queries data, they first look for it in the cache and execute it directly if they find it. If you can’t find it, look it up in the database.

The CPU is also designed to have such an existence, a component that reduces the average time it takes the processor to access memory. It works like this: When the processor makes a memory access request, it first looks to see if there is any requested data in the cache. If there is (hit), the data is returned without accessing memory; If not, the data in memory is loaded into the cache before being returned to the processor.

Or take the move brick for example to deepen understanding, if I’m bricklaying, even if the people move brick brick moved, I also too late to build by laying bricks or stones, let him stand in that light, etc is not good, so let him put the bricks in the feet (cache), the brickwork taken every time you don’t have to run to move brick brick, only need to take brick from feet (cache).

Therefore, we know that caches are mainly created to compensate for the difference in read and write speeds between CPU and memory, so in theory, the bigger the cache is, the better.

3.2.2 Cache design of M1 chip

The M1 chip is also designed in the same way, but Apple has a huge cache for the M1 chip. How big is this cache? Let’s do a simple comparison. (Following data from Wikipedia and CPU-Z)

L1 indicates level 1 cache, and L2 indicates level 2 cache, that is, the cache of level 1 cache. Level 1 Cache is also divided into level 1 Data Cache (D-cache, L1d) and level 1 Instruction Cache (I-cache, L1i), which are respectively used for storing Data and decoding instructions to execute Data. Both can be accessed by CPU at the same time. The conflict caused by CPU multi-core and multi-thread contention cache is reduced, and the processor efficiency is improved. CPU L1i and L1d have the same capacity.

This comparison is particularly noticeable in the tier 2 cache, which is shared in the M1 chip but 16MB is still much larger than the i9-10900K tier 2 cache. Although the i9-10900K’s level 3 cache is 20MB, it is only 4MB larger than M1’s level 2 cache, and level 3 cache is much slower than level 2 cache.

Also, don’t forget that due to Apple’s high degree of integration, THE DRAM memory and processor are directly connected via the Fabric high-speed bus, which makes the integrated memory almost like a large L3 cache. Apple’s strategy of sacrificing scalability for throughput, It brings higher bandwidth and lower latency to the M1 chip. Of course, bigger cache is not always better. On the one hand, it is difficult to make cache. On the other hand, cache hit ratio is also an important indicator to evaluate cache performance. If the cache is too large, the hit ratio is going to go down, and if you do that, it’s not worth the cost.

In fact, the reason why M1 chip can cram such a large cache is greatly related to its manufacturing process. Compared with 10nm and 14nm, M1 chip is made by TSMC’s most advanced 5nm process. The smaller the transistor, the more transistors can be crammed into a unit area. This allowed Apple to design a larger cache for the M1 chip. We’ll talk about that in the next video.

Let’s go back to what we said at the beginning: “The unified memory architecture is the Fabric high-speed bus that connects the CPU, gpu, neural network engine, cache, and DRAM memory together.” This is not just about connecting cells together, but apple’s experience with mobile soCs over the years. It’s apple’s own moment.

Therefore, even if the physical performance of THE CPU chip of M1 chip is not improved and the performance is not the most powerful, the design architecture of UMA will also improve the comprehensive performance of M1 chip.

Besides, who says THE CPU of M1 chip is not good?

4. Process & number of transistors

The M1 chip is introduced on Apple’s official website as follows:

The M1 is Also Apple’s first PC chip built using the advanced 5nm process, and packs a staggering 16 billion transistors, the most of any Apple chip.

I’ve highlighted two important numbers in bold here. The first number is the 5 nanometer process I just mentioned, and the second number is 16 billion transistors.

4.1 What does the 5 nm process refer to?

When we read an article about chips, we often see words such as 5nm, 7nm, 14nm, for example, Huawei’s “last” Kirin chip, Kirin 9000 is the 5nm process, so what does this 5nm refers to?

To be honest, this piece of content is too deep, I am not engaged in the semiconductor field, it is difficult to explain clearly, here can only briefly explain.

To quote a diagram from Zhihu, in the transistor structure shown above, current flows into Drain from the Source, and the Gate acts as a Gate that controls the on-off of the Source and Drain stages at both ends. The current loses, and the width of the gate determines how much it loses as it passes through, which is the heat and power consumption typical of mobile phones. The narrower the width, the lower the power consumption. The minimum gate width (gate length) is the XX nm process value.

In simple terms, the smaller the Leakage Path, the lower the current loss and, in general, the lower the power consumption. On a macro level, the smaller the Leakage Path, the smaller the distance between transistors, the more transistors can be crammed into a single area, and the greater the overall computing performance.

Recently, AMD crazy yes reason and its process improvement has a lot to do with, and the M1 chip is currently the most advanced 5nm process on the market, strong performance is expected.

4.2 Why does the number of transistors increase the performance?

The transistor can be thought of as a little switch, with two states: on and off. You can think of it as on is 1, off is 0, so a transistor once on, or once off, provides a 2-bit data: 0 or 1. All data can be represented by an infinite number of zeros or ones. This is why information in the electronic age is called digitization. It’s just putting all the information in numbers. Numbers, on the other hand, can be processed by computers. Computers can’t process human information directly. This is why computers use binary representation of data.

So we need to understand that it is this characteristic of the circuit that led earlier generations to choose binary as the language of machines, not binary because it is simple. Two small questions: Why do people use the decimal system? Are there other examples of base counting in your life? Feel free to leave your thoughts in the comments section.

A transistor can only represent one zero, or one, at a time. What about a whole bunch of transistors working at once?

Simply said like a large storage switch factory, each transistor is a switch, off the time represents 0, on the time represents 1, the more transistors, the more switches, in dealing with the same problem when the line is also more. It’s like the parallel circuits you used to have when you were studying physics in junior high, where the more paths there are, the more paths there are in circulation. Similarly, the more transistors a CPU has, the more branches of current can flow per unit of time, which in a macro sense means that the more data you can process at the same time on a CPU, the faster the machine.

However, the more transistors the better chip performance is not absolute, but relatively speaking, more transistors, can be designed more space, the rest depends on whether manufacturers can make good use of this part of the design space.

5. Is M1 really perfect?

So the question is, right? So many advantages of M1 chip, UMA architecture, and the most advanced process, is he a perfect chip? I don’t think so.

5.1 scalability

Compared with the above introduction, you have a certain understanding of the unified memory architecture of M1 chip, and also know that such architecture is very helpful for performance improvement.

Simply welding memory to a SoC chip in this way will be impossible for users who want to expand their hardware in the future.

In addition, although THE PERFORMANCE of THE M1 GPU is very strong, it is still far behind that of the GPU on mobile phones and desktop gpus. After all, the size of the GPU is too large, which is very difficult for users to connect to the monitor or play games. The same goes for external graphics cards, but I don’t think anyone is playing games on a Mac.

Of course the Mac can play games. Dig a hole here, and next time we’ll peek into the future — cloud gaming.

5.2 compatibility

Apple’s decision to move from x86 to ARM has been long in the making. In order not to let users worry that the app ecosystem will change significantly, Apple has offered three different solutions: “Universal apps”, “Rosetta 2 translation apps” and “native ARM apps”. Universal is a cross-platform application between ARM and X86. Now, some developers will switch their software to Universal, such as Adobe Lightroom and Photoshop will be updated next year. Here I have to sigh with emotion about the appeal of Apple, apple Silicon, the major manufacturers are quickly following up, it is estimated that a factory next door to envy crying.

If the new application is not available to Universal, you can also translate the application through Rosetta 2. Those native X86 compiled applications can be translated to run directly on the ARM platform through Apple’s Rosetta tool, although there is a performance loss. But it can greatly improve compatibility. From various compatibility test videos so far, Rosetta 2 is very complete, unlike the factory next door, which is a half-finished product.

If these two solutions still don’t cut it for you, there’s still apple’s App Store ecosystem where you can run native ARM apps directly on macOS, iOS, and iPadOS, which is the equivalent of cutting through the small screen to the big screen on major devices.

Even if macOS’s software ecosystem was once incomplete, iOS has almost no such problem, making macs with M1 chips less likely to worry about not having enough apps to use.

The new macOS Big Sur also shows that Apple intends to unify the styles of the three ends. Both the system interface and the style of ICONS are the same on the iPad and iPhone.

So why does the same software have compatibility problems when migrating? This has nothing to do with M1 in this article, but for the sake of space, we will talk about it later.

So what software is compatible with and not compatible with the M1 MacBook?

A software test, the amount of engineering is very large, and the software is constantly updated. Fortunately, there is a compatibility test for the M1 MacBook called “DoseitARM” on GitHub. In this project, you can see compatibility testing of various productivity software such as development tools, audio tools, graphics tools, editing tools, etc. The compatibility of various types of software is divided into several different cases as follows:

His address is 👉 : github.com/ThatGuySam/…

Interested readers can follow the project for a long time.

6. The final

For apple with deep finances, there is no doubt that it will make a large amount of long-term investment to iterate on M series chips in the future, and its own ecology also ensures that it can feed back the RESEARCH and development of M series chips. Hope that domestic enterprises can also have their own chips like Apple, the future is worth looking forward to!

If you think my article is helpful to you, you might as well click a like to pay attention to me, as a little encouragement to me, your encouragement will let me work harder to share, thank you for your support 🙏.