Abstract:In order to have a deeper understanding of the 100 billion parameter Pangu model, Huawei Cloud Community interviewed Xie Lingxi, senior researcher of Huawei Cloud EI Pangu team. In a very popular way, Dr. Xie told us the “past and present lives” of the development of Pangu Da model, as well as the difficult past behind it.

This article was shared from Huawei Cloud Community”Xie Lingxi, Senior Researcher at Huawei: Where will the next generation of AI go? Pangu Great Model Exploring Journey”, the original author: Huawei cloud community selection.

“Everyone lives in a certain era, and everyone’s life path is different in a certain era. At the same time, some people lament the unfortunate time of life, while others only want to live in peace…” This is the beginning of the essay “On the timing of birth” for the entrance examination of 2021 Beijing.

The answer was a special examinee who had never been to primary school, junior high school or senior high school. He just learned a lot of People’s Daily articles in a short time, and then relied on his reading comprehension, text association and language generation skills to write this seemingly “decent” essay for the college entrance examination.

Yes, it is an AI-Huawei cloud Pangu model, which was just voted as the “treasure of the town pavilion” at the World Artificial Intelligence Conference 2021 (WAIC2021)! In the scene, the audience can interact with the large model, directly give each other problems. For example, a sentence like “clearly clearly like him, but he just don’t say, he is very cold.” In this sentence, “Mingming” shows a person’s name and then acts as an adjective, and the whole sentence needs to be broken. But when the reporter asked the big model, “Whom do you like for nothing? “, the big model quickly replied “Ming Ming”. Correct answer!

Although Pangu did not study hard at a cold window for more than ten years, it also experienced hundreds of millions of parameters “learning”.

Let’s look at another example to understand these two sentences:

  1. Xiao Ming was studying. Through constant persistence, he overcame all kinds of difficulties and finally finished reading.
  2. Xiao Hong met a lot of difficulties while she was painting, and finally she finished this painting.

Although the characters and events in the above two sentences are different, Pangu can also extract a common meaning from them, just like us human beings: perseverance. This capability has already been demonstrated at the Huawei Developer Conference (Cloud) 2021. We can’t help but wonder how the Pangu model is so “smart”.

In order to more deeply understand the pull of the billions of parameters of pangu great model, huawei cloud community in an interview with EI pangu huawei cloud team, a senior researcher at Xie Lingxi, considering the large model involving some of the techniques more obscure, so Dr. Xie in a very popular way for us to explain the pangu great model research and development of “all men are mortal”, and the difficult past behind it.

Xie Lingxi, senior researcher of Huawei Cloud EI Pangu team

What is the big model: the only way for AI to land thousands of lines of business

In myth and legend, Pangu created the world and the universe changed from chaos to order. Talking about the Pangu model, Xie Lingxi started with the birth of artificial intelligence.

“In the 1950s, when the concept of AI was first proposed, people defined it by designing rules by humans; In the 1980s, in the wave of big data, people realized AI by training data models; Later, with the expansion of data scale and the development of computing power, deep learning sets off a new wave, and various AI models continue to emerge.”

“It wasn’t until the last two years that we began to integrate cross-domain knowledge into AI models, and various large models based on Transformer structures emerged, including OpenAI’s GPT-3 and Pangu’s large model. They open up the scale and performance of deep learning models to new heights in deep learning.” Xie Lingxi said.

In the past decade, the demand of AI algorithms for computing resources has increased by 400,000 times, and the development of neural networks from small models to large models has become an inevitable trend. Large model can solve the problem of AI model customization and application development fragmentation. It can absorb massive knowledge, improve the generalization ability of the model, and reduce the dependence on domain data annotation.

On the one hand, the large model activates the self-supervised learning ability of deep neural network for large-scale unlabeled data, and at the same time, it has high requirements for the deep optimization and parallel ability of the AI framework, and it is the integrator of the deep learning framework to achieve the ultimate AI. “It’s a big jump from traditional methods to deep learning, where big models are already at the forefront, waiting for the next step.”

The current Pangu series of super large scale pre-training models include NLP large scale model, CV large scale model, multi-mode large scale model, and scientific calculation large scale model. A large model means that it absorbs massive data knowledge. Take Pangu NLP large model as an example, it learns 40TB of Chinese text data. The large model of Pangu CV contains 3 billion + parameters. These data improve the generalization ability of the large model and the adaptability of the algorithm to fresh samples, so as to learn the underlying rules behind the data and reduce the reliance on domain data annotation.

Xie Lingxi further explained that, on the one hand, the large model can transfer knowledge from unlabeled data to the target task in a more general way, thus improving the task performance. On the other hand, better parameter initial points can be learned through the pre-training process, so that the model can achieve good results with only a small amount of data on the target task.

When large models can learn more from small data samples, it opens the door to universal AI, which can solve the problems of AI model customization and application development fragmentation.

Xie Lingxi made a calculation for us. He believed that the difficulty in the implementation of AI algorithm was not because it could not solve practical problems, but because the application scenarios were too narrow, and each pain point needed customized development, which led to high cost and manpower input.

Once the scenario changes, the entire model may need to be redeveloped. The large model is a new mode of industrial AI development, which can solve the problem of customization of small models, make a model can be applied to multiple scenes, and make the AI really land in thousands of industries.

Therefore, as the inevitable product of the development of this era, the big model is worth our efforts to dig, to explore the deep learning, and even what the next stage of AI will be.

Before we can do that, we need to understand how big models are made.

More than parameters, Pangu NLP and CV models have more tricks

In January, Google proposed the 1.6 trillion parameter large model Switch Transformer. Nvidia and Stanford, together with MSR, jointly developed a GPT with 1 trillion parameters. Wisdom Source Institute released the 1.75 trillion parameter model Wudao 2.0; …

In all kinds of news reports, it’s easy to attribute the big model breakthrough to a hundred million level of parameters.

Xie demolises this stereotype: “Large-scale and diverse models are inevitable requirements, but parameters are not the best indicators of a model’s capabilities. If the intermediate states of the training of the large model are stored and a simple fusion is done, we can even multiply the number of parameters of the model by a very large number. It can even be said that there are already hundreds of trillion or quadrillion parameters of the model, but this will not greatly help the effect of the model. Therefore, the number of parameters is not the final evaluation of the strength of the large model.”

The large model is a complete system that integrates data preprocessing, model architecture, algorithm training and optimization. Even if there is enough computing power, original data and original model, it does not mean that the large model can really run smoothly, which is a great test of technical research and development and collaboration ability.

But there is no doubt that the more data you have, the more you can learn from the larger model. “As long as you give it enough data to ‘learn by rote,’ its understanding does improve.” What kind of data determines what kind of basic effect the model has. Xie Lingxi said that based on a large number of parameters, the model can learn the relationship between data, abstract the logic ability, and become more intelligent.

Pangu NLP large model

In the most recent Clue ranking, Pangu’s NLP model ranked first in the overall ranking, first in reading comprehension, and first in categorical tasks, scoring a percentage point higher on the overall ranking than the second. To illustrate how Pangu’s NLP model is close to humans in terms of understanding ability, Xie Lingxi goes back to the beginning of the article and gives the example of “perseverance” we mentioned at the beginning of the article:

  • Xiao Ming is studying. He overcomes difficulties and finally succeeds through perseverance.
  • Xiao Hong met a lot of difficulties while she was painting, and finally she finished this painting.

Humans can easily through the logic judgment ability to know two things to express the same meaning: perseverance, but the big model requires a lot of data, feeding and learning to capture the relationship between the element and element, such as the relationship between the two text, between a few piece of text, which is closer to the relationship between two segments, to logical judgment conclusions.

Again, in the above example, if 2 is changed to “Xiao Ming read a book, but he met a lot of difficulties, but he failed to finish it”, then the words 1 and 2 are very similar, but in fact they express completely different meanings.

Big model need to learn how to determine the relationship, Xie Lingxi explained: “characterization (from simple characteristics of direct extraction of text and images) and semantics is very complicated, the connection between the people can understand, but let the computer to understand and to establish a calculation model is very difficult, big model hope and piled up a lot of training in the form of big data parameters to finish it.”

Work beyond parameters is also critical if the larger model is to understand our logical world.

First, large models with hundreds of billions of parameters are expensive to optimize every time. Therefore, Xie Lingxi and her team chose to add tasks based on Prompt in the pre-training stage to reduce the difficulty of fine-tuning and solve the difficulty of fine-tuning large models for different industrial scenarios in the past. When the downstream data is sufficient, the difficulty of fine-tuning is reduced so that the model can be continuously optimized with the increasing data. When the downstream data is scarce, the reduced difficulty of fine-tuning makes the learning effect of the model with fewer samples significantly improved.

Pangu NLP large model architecture

In addition, in terms of model structure, different from the traditional NLP large model trained by other enterprises, Pangu values not only the generation ability of large model, but also stronger understanding ability. Huawei adopts the architecture of Encode and DeCODE to ensure the performance of Pangu Model in both generation and understanding.

Pangu CV large model

Regarding the large model of Pangu CV, Xie Lingxi also first gave an example: how to distinguish between pictures of white cats and white dogs? A human looking at these two images can instantly tell which is the cat and which is the dog, but how does the larger model deal with this?

We need to train the model to know something really strong about the relationship between the samples.Xie emphasized that one of the most important things in images is hierarchical information. “In the process of judging the image, we should first grasp the hierarchical information in the image, and be able to quickly locate which part of the information in the image is decisive, so that the algorithm can focus on the more important places or content in an adaptive way, so that it is easy to capture the relationship between samples. In both images, it’s clear that white is not the most important message, but the animal is the dominant message.”

Pangu CV large model architecture

Based on this, the Pangu CV large model for the first time gives consideration to the ability of image discrimination and generation, can simultaneously meet the needs of low-level image processing and high-level semantic understanding, and at the same time can integrate the fine-tuning of industry knowledge to quickly adapt to various downstream tasks.

In addition, in order to solve the problems of low learning efficiency and weak characterization performance caused by large models and large data, the Pangu CV large model is mainly optimized in the three stages of data processing, architecture design and model optimization in the pre-training stage. At present, Pangu CV large model has reached the highest level in the industry in terms of classification accuracy of small samples on 1% and 10% Image Net data sets.

In the CV large model, in addition to applying some algorithms commonly used in the industry, there are also algorithms developed by Huawei itself, such as forcibly injecting some hierarchical information into the model in the vision, so that the model can learn better.

And behind each self-developed algorithm, in fact, is the valuable experience of the team after solving each difficulty.

Big models are hard to develop, but they’re good to have

In the whole research and development process of Panguda model, there are many difficulties, such as the original algorithm mentioned above, because in addition to the structure and data, algorithm is a very core technology.

Xie Lingxi discussed in detail one of the technical difficulties: whether it is text information or image information, what looks similar in representation is completely different in semantic understanding.

“Starting from the problem, we found that visual features are a hierarchical process of capture. Some features of representation are more concentrated in the shallow features, but when it comes to semantics, they are more reflected in the deep features. So we need to align these features at different levels so that we can learn better. Similarly, on NLP you need to focus the attention of the model in the most appropriate place. This key point is also found using a complex neural network, rather than using an algorithm to find the key point in a piece of text.”

This is a very general explanation, but the technical details are relatively complex and difficult to abstract. However, this problem is only the tip of the iceberg. In the development of the large model, Xie Lingxi and his team must continue to explore the nature of the surface problems and solve similar technical problems.

Another tricky issue is debugging the model. In order to obtain more knowledge from pre-training, the data of Pangu Big Model is bound to be larger and larger, which requires higher performance of the underlying hardware platform. At this point, the effect of pre-training depends not on the model itself, but on whether the infrastructure is good enough.

For example, running a large model requires enough machines to provide sufficient computing power, but one machine can only install a maximum of 8 GPU cards. The NLP large model requires thousands of GPU cards, and even the smaller CV large model requires 128 GPUs to run simultaneously, so there must be a very good mechanism to allocate resources properly.

Xie Lingxi was also very upset at the beginning. Who would support the operation of the large model? Practice has proved that Huawei Cloud has played a great role in the cloud channel platform which can be multi-machine and multi-card parallel provided by Pangu. The platform is able to easily allocate resources and avoid infrastructure problems that could hinder Pangu’s development progress. It can also store data in the most appropriate format on the server so that it can be read more efficiently during use.

In addition, the difficulty of the large model also lies in the engineering. Huawei CANN, Mindspore framework and ModelArts platform coordinate and optimize to fully release the computing power and provide strong support for the Pangu large model:

  • For the performance of the underlying operator, based on HUAWEI CANN, the operator quantization and operator fusion optimization technologies are adopted to improve the performance of the single operator by more than 30%.
  • Huawei Mindspore innovatively adopts the multi-dimensional automatic hybrid parallelism technology of “pipeline parallelism, model parallelism and data parallelism”, which greatly reduces the workload of manual coding and improves the cluster linearity by 20%. With the help of the MindSpore open source framework, how to “refine” the first Chinese pre-training language model with one hundred billion parameters and terabyte of memory? Careful interpretation of these key technologies.
  • ModelArts platform provides E-level computing force scheduling, combined with physical network topology, provides dynamic routing planning capability, and provides optimal network communication capability for large-scale model training.

However, as we all know, the reason why the big model is big is that “there are more data and the model is big”, which brings the high training cost of the model. In the case of the GPT-3, a training session costs $12 million. Xie Lingxi sighed, “It is very difficult to adjust the parameters of a large model. Before each model training, validation work needs to be done in many small scenes in advance. Every training of the model needs to be done in a perfect way. You can’t have a Bug that’s already been trained.”

Born for “applications,” Pangu empowers more users

Large-scale model training has made breakthroughs in all aspects and paved the way for industries lacking large amounts of data to access the intelligent era. As Huawei’s chief scientist in the field of cloud artificial intelligence, IEEE Fellow Professor Tian Qi mentioned in the release of the Pangu Da Model, the Pangu Da Model is born for the application of various industries, Pangu has unprecedented universality, whether 2B scene or 2C scene.

Industry knowledge is derived from industry data, and the Pangu team used a large amount of industry voice and text data to fine-tune the model, which greatly improved the industry-specific intent and knowledge understanding of the model.

Taking Pangu CV large model as an example, it shows a super strong application ability in the electric power inspection industry. It makes use of massive unlabeled power data for pre-training, and combines with a small number of labeled samples to fine-tune the efficient development mode, saving the time of manual labeling. In terms of model generality, combining with Pangu’s automatic data enlargement and the optimization strategy of class adaptive loss function, the maintenance cost of the model is greatly reduced.

Xie Lingxi also said that in addition to industrial applications, for developers, Pangu Da model is gradually being launched into the AI Gallery. Later will be open to invite measurement, please look forward to. On the platform Pangu will develop some more common and easy to use workflow: if you are a basic developer, you can do more customized development from the workflow, better release the ability of pre-training model; If you are just an AI developer and want to do simple AI development with a large model, Pangu will also give you a more easy to understand interface, allowing you to do it in a way of dragging and dropping. In the future, Pangu will launch a series of courses for developers to guide them to develop applications based on the Pangu model in practical scenarios.

On the other hand, Pangu also wants to grow with developers. “The big model is just a grip to get it into the real world. Not only does it better help the user improve the training schedule and shorten the training time, but as the number of applications on the model increases, the cost to the user naturally decreases.” “Our team alone is not enough for the development of Pangu,” Xie said. “We also need to work with the developers to build the ecology.”

The last

When it comes to the future of Pangu’s grand model, Xie has a simple, small goal — to push Pangu to the next technological breakout point. Big AI models are the highest level of deep learning. Going down can be a flat straight line, and everyone is waiting for the jump. Huawei Cloud has been working hard to solve the problems encountered by AI developers with a variety of original technologies. The most essential purpose is to enable the implementation of AI in thousands of fields.

If the path is long, the line will come.

Just like the name of the Pangu Model, Huawei also hopes to push AI to an unprecedented height by using the big model as a grasp, let us go to the next generation of AI, and break down the “chaos” on the future road of AI.

Click on the attention, the first time to understand Huawei cloud fresh technology ~