Abstract: In order to have a deeper understanding of the Pangu Model with 100 billion parameters, Huawei Cloud community interviewed Xie Lingxi, senior researcher of Huawei Cloud EI Pangu team. In a very popular way, Dr. Xie described the “past life” of pangu grand model development and the difficult past events behind it.

This article is shared from Huawei cloud community “Huawei senior researcher Xie Lingxi: Where will the next generation OF AI go? Pangu Grand Model Journey”, the original author: Huawei cloud community selection.

“Everyone lives in a certain time, and everyone takes a different path in a certain time. In the same age, some lament their untimely birth, others wish only to be content…” This is the beginning of the essay “On being Born at the right time” for the 2021 Beijing College entrance examination.

The answer was given to a special examinee who had not attended primary school, middle school or high school. He just learned a lot of People’s Daily articles in a short period of time, and then wrote the seemingly “decent” essay based on his reading comprehension, text association and language generation skills.

Yes, it is an AI — Huawei Cloud Pangu Large model, just in the World Artificial Intelligence Conference 2021 (WAIC2021) was selected as the “town hall treasure” of the conference! In the field, the audience can interact with the large model and directly set questions for each other. For example, “I love him, but he doesn’t say anything. He’s cold.” In this sentence, “Ming Ming” shows a person’s name, and then as an adjective, and the whole sentence needs to be broken. But when a reporter asked the model, “Who do you like?” “, the big model quickly answered “Ming Ming”. That’s right!

Although Pangu did not study hard for more than ten years at a cold window, it also experienced the “learning” of hundreds of millions of parameters.

Let’s look at another example. For example, understand these two sentences:

Xiao Ming was reading a book. He overcame all kinds of difficulties and finally finished reading it.

2. Xiao Hong met a lot of difficulties during her painting, and finally finished this painting.

Although the characters and events in the above two sentences are different, Pangu can extract the same meaning from them as we humans do: perseverance. This capability has actually been demonstrated at huawei Developer Conference (Cloud) 2021. How did Pangu get so “smart”?

In order to more deeply understand the pull of the billions of parameters of pangu great model, huawei cloud community in an interview with EI pangu huawei cloud team, a senior researcher at Xie Lingxi, considering the large model involving some of the techniques more obscure, so Dr. Xie in a very popular way for us to explain the pangu great model research and development of “all men are mortal”, and the difficult past behind it.

Lingxi Xie, senior researcher of Huawei Cloud EI Pangu team

What is the big model: the only way for AI to land in thousands of industries

According to the myth, Pangu created the universe and transformed it from chaos into order. Talking about the Pangu Model, Xie Lingxi began with the birth of artificial intelligence.

“In the 1950s, when the concept of AI was proposed, people used artificial rules to define AI; In the 1980s, in the wave of big data, people realized AI by training data models; Later, with the expansion of data scale and the development of computing power, a new wave of deep learning and various AI models continue to emerge.”

“It was only in the last two years that we began to integrate our cross-domain knowledge into AI models, and various large models based on Transformer structures emerged, including OpenAI’s GPT-3 and Pangu’s Large model. “They open up the scale and performance of deep learning models to new heights.” Xie Lingxi said.

In the past decade, the demand of AI algorithms for computing resources has increased by 400,000 times, and neural networks have become an inevitable development trend from small model to large model. Large model can solve the problem of AI model customization and application development fragmentation. It can absorb massive knowledge, improve the generalization ability of model and reduce the dependence on domain data annotation.

On the one hand, the large model activates the self-supervised learning ability of the deep neural network for large-scale unlabeled data, and at the same time, it has high requirements for the deep optimization and parallel ability of THE AI framework. It is the one who achieves the perfection of AI under the deep learning framework. “It’s a big leap from traditional methods to deep learning, where big models are already at the forefront, waiting for the next step.”

The current pangu series of large scale pre-training models include NLP large model, CV large model, multi-mode large model, and scientific computing large model. The large size of the model means that it has absorbed massive data knowledge. Take pangu NLP as an example, it has learned 40TB of Chinese text data. Pangu CV large model contains 3 billion + parameters. These data improve the generalization ability of the large model and the adaptability of the algorithm to fresh samples, so as to learn the laws hidden behind the data and reduce the dependence on domain data annotation.

Xie Lingxi further explained that, on the one hand, large model can transfer knowledge from unlabeled data to target tasks in a more general way, thus improving task performance; On the other hand, by learning better initial points of parameters through pre-training, the model can achieve good results with only a small amount of data on the target task.

When large models can learn more from small data samples, it can help open the door to general AI, which can solve the challenges of AI model customization and application development fragmentation.

Xie Lingxi gave us a calculation. He believed that the AI algorithm was difficult to be implemented not because it could not solve practical problems, but because the application scenarios were too narrow and each pain point needed to be customized for development, resulting in high input costs and manpower.

Once the scenario changes, the entire model may need to be redeveloped. Large model is a new mode of industrial AI development, which can solve the problem of customization of small models, so that a model can be applied to multiple scenarios, so that AI can really be implemented in thousands of lines and hundreds of industries.

Therefore, as an inevitable outcome of the development of this era, large models are worth our efforts to explore deep learning, and even what the next stage of AI will be.

Before we can do that, we need to understand how big models are made.

Pangu NLP and CV large models have more “tricks” than parameters

In January, Google proposed Switch Transformer with 1.6 trillion parameters; Nvidia, Stanford and MSR jointly trained the GPT of 1 trillion parameters; Zhiyuan Research Institute released 1.75 trillion parameter big model Wudao 2.0; …

In all kinds of news reports, it’s easy to attribute the breakthrough of the big model to the hundred-billion metric.

Mr Xie overturns this stereotype: “Large volume and variety are necessary for large models, but parameters are not the best indicator of a model’s capability. If we store all the intermediate states of the large model training and make a simple fusion, we can even multiply the number of parameters of the model by an extraordinary number. We can even say that there are hundreds of trillion or quadrillion parameters of the model now, but this will not be very helpful to the effect of the model. Therefore, the number of parameters is not the ultimate criterion for evaluating the strength of large models.”

Large model is a complete system that integrates data preprocessing, model architecture, algorithm training and optimization. Even if there is enough computing power, original data and original model, it does not mean that it can make a truly workable large model, which is a great test of technology research and development and collaboration ability.

But there is no doubt that the more data there is, the more the big model learns. “As long as you give him enough data to ‘memorize,’ his understanding does increase.” What kind of data determines what kind of basic effect the model has. Based on a large number of parameters, the model can learn the relationship between data, abstract logical capabilities and become more intelligent, Xie said.

Pangu NLP large model

In the most recent CLUE list, Pangu’s NLP model was no. 1 overall, reading comprehension, and categorizing tasks, with an overall score one percentage point higher than the runner-up. To illustrate how Pangu’s NLP model approaches human comprehension, Xie Lingxi goes back to the beginning of the article and explains the “perseverance” example we mentioned in the beginning:

  • Xiao Ming was studying, but by persevering, he overcame difficulties and finally succeeded.

  • Xiao Hong met many difficulties during her painting, and finally finished this painting.

Humans can easily through the logic judgment ability to know two things to express the same meaning: perseverance, but the big model requires a lot of data, feeding and learning to capture the relationship between the element and element, such as the relationship between the two text, between a few piece of text, which is closer to the relationship between two segments, to logical judgment conclusions.

In the example above, if 2 is changed to “Xiao Ming reads a book, meets many difficulties, but fails to finish it in the end”, the words 1 and 2 are very similar, but in fact they express completely different meanings.

Big model need to learn how to determine the relationship, Xie Lingxi explained: “characterization (from simple characteristics of direct extraction of text and images) and semantics is very complicated, the connection between the people can understand, but let the computer to understand and to establish a calculation model is very difficult, big model hope and piled up a lot of training in the form of big data parameters to finish it.”

Going beyond the parameters is also crucial if large models are to understand our logical world.

First of all, every optimization of a large model with 100 billion parameters will cost a huge amount of money, affecting the whole body. Therefore, Xie Lingxi and his team chose to add Prompt based tasks in the pre-training stage to reduce the difficulty of fine-tuning and solve the difficulty of fine-tuning for different industry scenarios in the past. When the downstream data is sufficient, the difficulty of fine-tuning is reduced so that the model can be continuously optimized with more data. When downstream data is scarce, the difficulty of fine-tuning is reduced so that the learning effect of model with small samples is significantly improved.

Pangu NLP large model architecture

In addition, in terms of model structure, different from the traditional NLP large model trained by other enterprises, Pangu values not only the generation ability of large model, but also stronger understanding ability. Huawei uses Encode and Decode architecture to ensure the two performance of Pangu Model in generation and understanding.

Pangu CV large model

For pangu CV large model, Xie Lingxi also gave an example: how to distinguish white cat and white dog pictures? Humans can look at these two images and tell which one is a cat and which one is a dog, so how does the large model deal with this?

“We need to get the model to learn the really strong correlations between these samples as we train them.” Xie lingxi stressed that one of the most important things in images is hierarchical information. “In the process of judging the image, we should first grasp the hierarchical information in the image, and quickly locate which part of the information in the image is decisive, so that the algorithm can pay attention to the more important places or content in an adaptive way, so that it is easy to capture the relationship between samples. In both images, it is clear that white is not the most important information, but the animal is the dominant information.”

Pangu CV large model architecture

Based on this, Pangu CV large model for the first time takes into account the ability of image discrimination and generation, can simultaneously meet the needs of low-level image processing and high-level semantic understanding, and can integrate the fine-tuning of industry knowledge to quickly adapt to various downstream tasks.

In addition, in order to solve the problems of low learning efficiency and weak representation performance caused by large model and large data, the optimization of Pangu CV large model is mainly focused on data processing, architecture design and model optimization in the pre-training stage. At present, pangu CV large model has reached the highest level in the industry in the classification accuracy of small samples on 1% and 10% of Image Net data sets.

In CV large model, in addition to applying some common algorithms in the industry, ** there are also algorithms developed by Huawei, such as injecting some hierarchical information into the model forcibly in the vision, so that the model can learn better. ** Behind each self-developed algorithm, in fact, the team solves every difficulty after the valuable experience summary.

Big model development is hard, thank god they are there

In the whole panguda model development process, there are many difficulties, such as the original algorithm mentioned above, because algorithm is a very core technology in addition to architecture and data.

Xie lingxi talked at length about one of the technical difficulties: whether it is text information or image information, what looks similar in representation is completely different in semantic understanding.

“Starting from the problem, we found that visual features are a hierarchical process of capture. Some features of representation are more concentrated in shallow features, but when it comes to semantics, they are more reflected in deep features. So, we need to align these features on different levels so that we can learn better. Again, on NLP you need to focus your model’s attention in the right place. This key point is also found through a sophisticated neural network, rather than an algorithm in a random text.”

This is a popular explanation, and the technical details are more complex and difficult to abstract. However, this problem is only a tip of the iceberg. In the whole research and development of the large model, Xie Lingxi and his team have to constantly explore the essence of the problem and solve similar technical problems.

Another tricky issue is debugging the model. In order to gain more knowledge from pre-training, the data of Pangu Model will definitely become bigger and bigger, requiring higher performance of the underlying hardware platform. At this point, the effect of pre-training is not the model itself, but whether the infrastructure is good enough.

For example, running large models requires enough machines to provide sufficient computing power, but one machine can only install 8 GPU cards at most. NLP large model requires thousands of GPU cards. Even small CV large model requires 128 Gpus to run at the same time, so there must be a very good mechanism to allocate resources reasonably.

One can not make bricks without straw, at the beginning of xie Lingxi is also very distressed, who will support the operation of the large model? Practice has proved that the cloud channel platform with multiple machines and multiple cards provided by Huawei cloud has played a great role in Pangu. The yundao platform can easily allocate resources, avoid the development progress of Pangu due to infrastructure problems, and store data in the most appropriate format on the server, so that it can be read more efficiently during use.

Not only that, the difficulties of large model are also difficult in engineering. Huawei CANN, MindSpore framework and ModelArts platform are coordinated and optimized to fully release computing power and provide strong support for Pangu large model:

  • For the performance of the underlying operator, operator quantization and operator fusion optimization technologies are adopted based on Huawei CANN, which improves the performance of single operator by more than 30%.
  • Huawei MindSpore innovatively adopts the multi-dimensional automatic hybrid parallel technology of “pipeline parallelism, model parallelism and data parallelism”, which greatly reduces the workload of manual coding and improves cluster linearity by 20%. MindSpore open source framework, how to “refine” the first 100 billion parameter, TERabyte memory Chinese pre-training language model? Detailed interpretation of these key technologies.
  • ModelArts platform provides e-class computing force scheduling, and combines physical network topology to provide dynamic routing planning capability, providing optimal network communication capability for large model training.

However, as we all know, the reason why large models are big lies in “large data and large model”, which brings high training cost of models. Gpt-3, for example, costs $12 million per training session. Xie Lingxi sighed, “It is very difficult to adjust the parameters of large models. Before each model training, we need to do verification work in many small scenes in advance. Every time you train your model, you need to make sure that you don’t have a Bug when you start training.”

Born for “application”, Pangu enables more users

Large model training in all aspects of breakthrough, but also for the lack of a large amount of data industry paved access to the track of the intelligent era. As The chief scientist of Huawei cloud ARTIFICIAL intelligence field and IEEE Fellow Professor Tian Qi mentioned when releasing Pangu Model, Pangu model is born for the application of various industries. Pangu has unprecedented universality, no matter 2B or 2C scenarios.

Industry knowledge is derived from industry data, and the Pangu team uses a large amount of industry voice and text data to fine-tune the model’s industry specific intent and knowledge understanding.

Taking Pangu CV large model as an example, it shows super strong application ability in electric power inspection industry. It uses a large amount of unlabeled power data for pre-training and combines the efficient development mode of fine-tuning a small number of labeled samples to save manual labeling time. In terms of the universality of the model, the maintenance cost of the model is greatly reduced by combining pangu’s automatic data augmentation and category adaptive loss function optimization strategy.

Xie lingxi also said that in addition to industrial applications, pangu Model is gradually online to the AI Asset sharing community (AI Gallery) for developers. In the later stage, we will open the invitation measurement system, please look forward to it. Pangu will develop some popular and easy to use workflow on the platform: if you are a developer with a certain foundation, you can do more customized development from workflow, better release the ability of pre-training model; If you are just an AI developer and want to use a large model to do simple AI development, Pangu will also give you a more accessible interface, let people use some drag-and-drop way to achieve. Later pangu will launch a series of courses for developers to guide developers to develop applications based on Pangu’s model in practical scenarios.

On the other hand, Pangu wants to grow with developers. “The big model is just a grab to get it applied to a real scenario. Not only does it help users improve training progress and shorten training time, but the number of applications on the model increases, and the cost for users naturally decreases.” “The development of Pangu is far from enough by our team alone,” said Xie lingxi. “We also need to work with developers to build the ecosystem.”

The last

When it comes to the future of Pangu Grand Model, Xie lingxi has a simple, small goal — to push Pangu to the next technological breakout point. AI grand model is the highest stage of deep learning, going down may be a flat straight line, everyone is waiting for the day to jump. Huawei Cloud has been making great efforts to use various original technologies to solve the problems that AI developers actually encounter. The most essential purpose is to enable the AI of thousands of industries to land.

When the path is long and obstructed, it will come.

Just like the name of Pangu Grand Model, Huawei also hopes to push AI to an unprecedented height with the grand model as the starting point, so that we can go to the next generation of AI and split the “chaos” on the road of AI in the future.

Click to follow, the first time to learn about Huawei cloud fresh technology ~