Overview: DPU craze can be clearly understood only when you understand cloud computing.

If you look at the hot tech concepts of late, DPU is definitely on the list.

This is the new rich story of Nvidia, which is the hottest entrepreneurship track in the SoC field in 2021, and another “pillar” of data center after CPU and GPU.

Although nurtured in cloud computing for many years, DPU is not an easy concept to understand to outsiders, and product definitions and architectural designs vary from player to player.

Generally speaking, DPU is a data processing unit integrating hardware and software. It usually exists in the form of an architecture. DPU can help cpus reduce load, solve some shortcomings in DATA processing, and provide hardware accelerated network, storage, security, and infrastructure management services.

There are two cloud computing giants — Amazon Cloud AWS in the West and Ali Cloud in the East — all over the world to trace the origin of DPU and truly achieve large-scale commercial DPU architecture.

In October 2017, The Divine Dragon architecture of Aliyun was born. Just one month later, AWS Nitro made its mark. These two innovative products, created to solve virtualization problems, are regarded by the industry as the two most successful DPU to date.

Zhang Xiantao, who contributed to the birth of Divine Dragon architecture, is also one of the most knowledgeable DPU figures in China.

Now, the fourth generation of Ali Cloud Shenlong has begun to support ali Cloud’s large-scale cloud business, and has reached the highest level in the industry in four key indicators of computing, storage, network and security.

Recently, core things exclusive dialogue with Alibaba Group researcher, Ali Cloud elastic computing product line leader Zhang Xiantao (alias Xu Qing), listen to him on cloud business demand changes, continuous innovation research and development experience, as well as the DPU boom unique thinking.

In his opinion, this is not a kind of chip suitable for the general route. For cloud manufacturers, DPU is a very close combination of software and hardware technology stack, and it is a software-defined computing architecture. DPU must be mainly self-developed, and the relevant software and hardware technology stack can be fully controlled, and it has been verified on a large scale. The company that makes general DPU can hardly meet the needs of cloud vendors, and being acquired may be the best outcome.

01 Wind cloud computing

DPU’s new wind outlet came out of the blue.

In October 2020, At the NVIDIA GTC 2020 conference, NVIDIA founder and CEO Jenhsun Huang announced the launch of a new data processor — DPU. This processing unit, called “one of the three pillars of future computing”, has burst into the public and capital’s vision.

At this point, it has been four years since the birth of ali Cloud integrated software and hardware virtualization architecture “Shenlong”. Today, it is known as the DPU of AliYun, a groundbreaking innovative architecture that was originally created to solve the cost, performance, quality of service and security problems caused by the application of traditional virtualization technology to cloud computing.

Zhang Xiantao is the key figure in charge of AliYun Shenlong. He joined Aliyun in 2014. At that time, he had been engaged in virtualization technology research for about 10 years and Aliyun had just entered its fifth year. The shortcomings of traditional virtualization architecture are increasingly hampering the cloud computing company’s ability to cut costs, increase efficiency and improve service quality.

Virtualization technology is the basis of cloud computing. It abstracts indivisible hardware resources into a shared resource pool, allocating and sharing computing, storage, and network resources on demand.

Managing resources takes up some of the CPU and memory used to run the business load, resulting in resource consumption. For example, a factory has 100 workers, if they all work on the assembly line, then the resource utilization is 100%; However, if 10 people are selected to be in charge of overall management, then only 90 people are left to work on the assembly line, and utilization rate drops to 90%.

As the scale of cloud computing continues to expand, problems such as resource competition, loss of computing power and performance bottleneck become increasingly serious, and it is extremely urgent to find a solution.

After two years of exploration, the stability of Ali Cloud is no longer a problem. In 2016, Zhang Xiantao began to think about the next generation of virtualization technology solutions? What kind of plan can meet the long-term development of Ali Cloud in the future?

The X-Dragon was born.

This is the result of team brainstorming. If you create an architecture dedicated to virtualization, you free up CPU computing resources to focus more on running cloud services.

The project of the first generation divine Dragon will be officially approved on April 1, 2017. After the problem was defined, Zhang Xiantao began to build a team, from system architecture design to chip and hardware development, to server development, and then to the research and development of system software adaptation. The initial team of more than 20 people worked for half a year, and successfully launched DpCA in October 2017.

Since then, as a master of processing high-speed data streams, The DpCA architecture has taken over the heavy burden of virtualization from the CPU, and along the way has brought key performance such as storage, networking, and security.

02 From small test to large-scale landing

At first, Zhang xiantao did not accept the name “DPU”.

There are several common definitions of THE DPU D, including Data, data center, and data-centered. But strictly speaking, which of the main chips in the data center (such as CPU, GPU) does not meet these characteristics?

Therefore, Ali Cloud’s description of Shenlong architecture is a technical architecture that is truly born for cloud and integrates software and hardware. In his view, the future is an era of cloud, and such technical architecture is needed to comprehensively solve the problems of cost, performance, security and so on. For now, DPU seems to be trying to do something similar, and the market believes that Alibaba Cloud and AWS are doing DPU.

Ali cloud and AWS launch DPU is almost “synchronization”, the most direct reason is the cloud computing development to a certain stage, researchers realized that with such a data processing architecture, the cost will be dropped sharply, performance will be improved at the same time, combined with cloud vendor’s scale advantage can provide customers with more competitive cost-effective services.

Both Ari Cloud’s Dragon and AWS’s Nitro focused early on the performance wastage, resource wastage, cost and performance of virtualization.

Zhang Xiantao calculated that alibaba Cloud’s revenue had reached tens of billions of yuan at that time. If storage and network occupied about 10% of CPU resources, it would mean an annual loss of more than 1 billion yuan. It is imperative to develop Shenlong, both to improve performance and to optimize costs.

A technological breakthrough is only the first step. After The launch of DpCA, how to apply it on a large scale became a new challenge.

Ali Cloud first tested the water in internal business, and deployed 1000 units to support Tmall to promote business during the Double 11 in 2017, which was successfully verified to be no problem. Saic is the first outside customer to eat the crab, put forward to try this new product, share the risk with Ali Cloud. The two sides worked together for nearly two months, and finally solved all the problems of stability and performance when the Spring Festival approached.

Under the support of such seed customers, Ali Cloud Dragon has polished the foundation of large-scale cloud, and began to grow into one of the core competitiveness of Ali Cloud.

Starting from 2019, all businesses of Alibaba Group, including aliyun’s computing services, will be transferred to DpCA. In October 2021, the fourth generation Of DpCA will be released, which will set the industry’s highest standards in storage IOPS of 3 million, network PPS of 50 million and network latency of 5 microseconds.

According to the 2021 global cloud computing vendors’ overall capability Assessment report released by Gartner, a well-known international market research institution, Alibaba Cloud IaaS infrastructure capability has surpassed AWS and won the first place in the world, and got the highest score in computing, storage, network and security.

▲Gartner Solution Scorecard 2021 report shows that Alibaba Cloud exceeds AWS in four capabilities

03 Cloud vendors must develop their own DPUS

Cloud computing market is rapidly expanding, when every cloud server needs a DPU, who can be in the head, who may enjoy inestimable market dividend.

In 2021 alone, there will be no less than 7 domestic DPU enterprises that have obtained new financing, including Huzhou Xinqiyuan, Beijing Dayu Zhixin, Zhuhai Xingyun Zhilian, Shanghai Yisxin Technology, Shenzhen Yunbao Intelligent, Shanghai Yunmai Xinlian, Beijing Zhongkeyu Digital, etc.

Most of them have raised hundreds of millions of yuan in a single transaction, and many of the investors are well-known technology companies. For example, Cloud Zhilian A round of financing by Meituan exclusive investment, Tencent invested in Cloud Leopard intelligence, cloud pulse core investors bytedance, wall Renren technology……

But capital only see heat, may not be able to see the pit inside.

In Zhang’s opinion, DPU should not be seen as the successor to “intelligent network card”, which only solves the problem of network acceleration, while DPU has far more functions than intelligent network card.

Some DPU startups are still making inics, and some startups want to evolve on the basis of inics. But he said: “You can’t fix a patch on an intelligent network card because the design concept is inconsistent.” In terms of architecture, the DPU architecture inserts a server into the DPU system to solve data processing acceleration, security and control problems related to the entire server, while the intelligent network interface card (INIC) architecture inserts a nic into the server to solve network acceleration problems. The two are fundamentally different.

The shape is so different, but the god is so different.

Since the launch of DpCA in 2017, Zhang Xiantao has been impressed that almost all DPU companies’ DPU architectures, interfaces, functional modules and capability implementation are designed after dpCA’s disclosed architectures.

But why is it still difficult to do a good DPU by imitating dragon design?

The core problem is understanding the cloud business. Zhang Xiantao said that third-party manufacturers can only get a one-sided understanding of cloud business needs through communication with customers and engineers, so the final results are difficult to meet customer needs.

He firmly believes that cloud vendors must develop their own DPU architectures. “If you are not familiar with the software architecture and the system software stack, and where the bottlenecks are in your own technology stack, it will be difficult to design it well, and these are technical information that external DPU companies can hardly get.”

From another point of view, cloud manufacturers can become a responsible cloud manufacturer only when they develop themselves from hardware architecture to firmware to software stack, so that the whole technology link can be controlled.

The trend of cloud computing manufacturers in recent years confirms Zhang Xiantao’s judgment. Jingdong Cloud develops virtualization architecture based on self-developed intelligent chip Jinggang, Google Cloud and Intel cooperate to develop infrastructure processing chip IPU, Bytedance announced that its self-developed DPU will provide external services through volcano engine cloud products…

“Judging from the end, there is no good way out for DPU startups today. The best way out is to sell the relevant businesses to cloud computing companies in need and realize them through acquisition.” Zhang Xiantao said that if you do not know enough about cloud computing business, it is difficult to succeed in just trying to make DPU into a general architecture. DPU companies will eventually cooperate with cloud manufacturers to build products and technologies, so that the possibility of success will be improved.

04 THE DPU is not suitable for the general route

“The DPU invested by the industry all want to try to make a universal DPU, and some even want to promote the supporting software stack as the industry standard. In fact, the starting point is questionable.”

This is because DPU is completely software-defined architecture, driven by customer requirements or business development patterns, and tightly integrated with the customer’s entire back-end software stack, making it difficult to achieve a common level.

In Zhang Xiantao’s opinion, it is actually more difficult than AI chip to make DPU and let customers use it on a large scale.

The key difficulty is that its software ecosystem is doomed to fail to be nurtured, because the software stack of each company has been developed for many years, and it is difficult to scrap it to fit a stack recommended by an outside vendor who is difficult to control himself. Therefore, after three years, there is bound to be consolidation in the DPU space, and some companies may disappear or be sold.

DPU users are usually cloud computing companies or virtualization software companies. If it is only for a certain software stack, it cannot be universal. If it is intended to be very universal, “because different software stacks and the design of the entire security mechanism are different, it will be difficult to adapt to cloud vendors”.

This is different from GPU+CUDA logic. It took nvidia more than a decade of research and development, and an explosion of deep learning, to establish that ecosystem as the industry standard.

However, in the field of DPU, the software stacks of different companies already exist and are different from each other. It is difficult to implement forced standardization, and the development cycle is long, the firmware is difficult to open, and the definition of interfaces is inconsistent.

“If you want to make a unified standard, a universal standard or software ecosystem, it’s very difficult.” Zhang xiantao explained that when each stack is different, DPU processes different data formats, so it is difficult to solidify such a one to make a unified thing.

Zhang Xiantao, Researcher of Alibaba Group and head of Aliyun elastic Computing product line

Update security and trust functions in line with software iteration

After four years of experience, what are the advantages of DpCA? How do you achieve performance that exceeds AWS Nitro?

Zhang Xiantao first mentioned “rapid iteration”.

Integration of software and hardware requires architecture to be upgraded with software iteration. The development cycle for an ASIC is about 24 months, which is too long for the pace of Internet software iterations.

Therefore, Alibaba Cloud Dragon adopts the FPGA mode to do, and achieves the comprehensive online real-time thermal upgrade capability of FPGA and supporting system software, so as to upgrade every week, and finally achieve more extreme performance through flexible continuous optimization.

“To this day, programmable and scalable FPGas are the best for DPU.” Zhang Xiantao also talked about the limitations of choosing FPGA. With more functions to be done in DPU, the number of LOGICAL units of FPGA may restrict the development of DPU, which requires more technical personnel not to waste every logical unit to achieve the necessary functions and performance to the extreme.

To make a good DPU, an “understanding of hardware/software fusion design” must also be in place. This is an iterative process from software and hardware to the corresponding firmware and upper system.

The design of interfaces and even registers between software and hardware needs to be fully integrated with the software. If an enterprise has a high degree of mastery of its own software, and has a deep understanding of the ideas of software and hardware collaboration, architecture and related protocols, it can gradually improve the performance.

The first Generation of Dragon network forwarding reached 6 million packets per second, while Nitro was around 3 million packets per second. After more tasks on data links switched to hardware acceleration, the third generation of The Dragon achieved 24 million, and the latest generation achieved 50 million. The traditional RDMA networking capacity is usually thousands of units, while the fourth-generation DpCA eRDMA networking capacity can reach several hundred thousand units, truly realizing the universality and popularization of RDMA capacity, helping high-performance computing and the current popular cloud native software architecture development needs.

In addition, the new generation of DpCA has added support for “trusted computing and cryptographic computing” to realize the system’s trusted tamper-proof and data availability invisible, ensuring customers’ requirements for “security”.

In the future, Aliyun plans to do more preprocessing when all data links pass through DpCA architecture, thus greatly improving the computing efficiency of DPU. It used to calculate 10,000 pieces of data, and all the data fell into the memory one by one. Now it may only need to calculate 50 pieces after preprocessing, so the efficiency is improved several times.

According to Zhang Xiantao, in addition to achieving faster speed, higher bandwidth, lower latency and more IO times per second, Shenlong architecture will also improve performance, stability and security to promote Shenlong as a carrier of encryption computing.

06 Conclusion: the future trend, crack the memory wall problem

As DPU continues to grow in popularity, cloud computing companies have explored this track through their own research or investment, and some DPU design and innovation companies have also begun to emerge.

“In 2017, we made the divine Dragon architecture public, and everyone is following this standard. Today, we feel very pleased to develop to such an extent.” Zhang Xiantao believes that the development of DPU is on the right track, and more people are aware of its importance is a good thing for the industry, which will improve the efficiency of the whole cloud computing.

DPU is essentially a system of basic hardware and software co-design. It takes two or three years for some seed users to put it into use. He believes that in the next two to three years, DPU will continue to be hot, but to a certain stage, it will converge like today’s AI chips, and some incorrect ideas will be gradually eliminated.

There is still a lot to be done in DPU for the future.

For example, the emerging memory computing, in essence, to solve the same problem as DPU, that is, how to reduce data migration, so as to improve computing efficiency and reduce power consumption. All data passing through DPU can be filtered by in-memory calculation. Only valid data will enter the main CPU memory, so that the performance of the whole computing system will be improved several times.

“If you look into the future, especially with the trend of heterogeneous computing today, almost all of the DPU efforts are aimed at getting rid of the memory wall that brings down the efficiency of data processing.” Zhang Xiantao believes that the future development of DPU is worth looking forward to, and it will be more and more combined with some business.

(This article is from Xindong, xinyuan)

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.