Planning to edit | Debra
Compile | Debra
Edit | Natalie, Vincent
AI Front Line introduction:Today, OpenAI released an analysis. The analysis shows that the amount of computing power used in the largest AI training sessions has increased exponentially since 2012, doubling on average every 3.5 months (compared with Moore’s Law, which doubles every 18 months). Since 2012, computing power has increased more than 300,000 times (Moore’s Law would have increased only 12 times). The increase in computing power has long been considered by many to be a key factor in AI’s progress, and OpenAI’s analysis does a good job of showing just how dramatically the amount of computing power AI consumes has increased, but some have questioned this: Not all of AI’s progress is due to increased computing power!






Please pay attention to the wechat public account “AI Front”, (ID: AI-front)
A controversial analysis

When OpenAI posted this article, many people praised it, but it also sparked some debate, and some even expressed concern about the rapid improvement in computing power. One commenter asked:

Will the singularity come sooner than expected?

Faster than Moore’s Law. That’s scary.

Super AI won’t be found in smartphone chips, but in massive cloud-based computers. That means the speed with which these companies improve their computing skills is more important.

Some experts disagree with some of the points in the article:

Ben Recht, an associate professor at the University of California, Berkeley, joked on Twitter about the crazy interpretation of the increase in computing power and posted a series of comments to make his point:

If you take out the research work of Alphabet, Google’s deep-pocketed parent, the trend is entirely different.

To go from VGG to ResNets does not require exponential growth in computing power.

Why don’t you say that the complexity of translation has been reduced by moving from attentional LSTMs to attentional models?

The applications that are most relevant for flop’s boost here (neuroarchitectural search and data) aren’t exactly breakthroughs.

He also thought Jeremy Howard’s comments on the article were spot on.

I think it’s totally backwards. Engineers love to play this game, so they do it with all the resources they can get their hands on. What this graph shows is that deep learning researchers have more resources available to them — that’s all.

There was a lot of discussion under his tweet:

Jonathan Raiman: Isn’t the point of this article that this trend shows that we have more and more ways to use computing power at the scale of a data center for machine learning training, and that these systems are not single machines and therefore can transcend Moore’s Law? This doesn’t violate Parkinson’s Law, but without these advances, it would be hard to do 1V1 DOTA or Alpha Zero training.

Will Knight: OpenAI’s analytics are great, but I don’t think AI performance always improves with computing power. If so, more data and gpus will solve all the problems.

Others wrote:

Smerity: I’m inclined to believe there’s a downside to MML, but I think their point is: Is there an underlying technology (architectural search, automation…)? , it can use huge amounts of computing power to accomplish a specific task, and the amount of computing power required to accomplish this task is astronomical.

AlphaGo/AlphaGoZero is so perfect in papers and blogs, the amount of computation required to win a game is amazing, let’s hope similar technological improvements can be put to practical use…

Jason Taylor: It’s a multiplier effect: Moore’s Law (for a single GPU) and capital coming into the space (buy more Gpus).

So what kind of article is causing so much controversy?

AI Frontier compiled the OpenAI text as follows:

AI computing power has increased by 300,000 times in six years, far surpassing Moore’s Law

Logarithmic scale

The linear scale

The chart shows the total amount of calculation in petaflop/s-days, used to train relatively familiar selected results, takes longer to calculate, and provides enough information to estimate the computational force used. Petaflop /s-day (PFS-day) refers to the execution of 10^15 neural network operations in a single day, or approximately 10^20 operations in total. The calculated time, as a measure, is equivalent to kilowatt-hours of energy. Instead of measuring the hardware’s peak theoretical FLOPS, we estimate the number of actual operations. We operate addition and multiplication separately, treating all addition or multiplication calculations as single calculations, regardless of numerical accuracy (which makes “FLOP” a slight misnomer), And ignore the integration model (http://web.engr.oregonstate.edu/~tgd/publications/mcs-ensembles.pdf). An example calculation of this chart is provided in the appendix at the end of this article. The doubling time for best results was shown to be 3.43 months.

Any reference

Ai is driven by three things: algorithm innovation, data (either supervised data or interactive environments), and computational power that can be used for training. Algorithm innovation and data are generally hard to measure, but computational power is quantifiable, making it possible to measure factors of AI progress. Of course, using large-scale calculations sometimes exposes the shortcomings of our current algorithms. But at least in many existing domains, improved computing power often leads to better performance and is often a complement to algorithm advances.

In this article, we argue that the relevant number is not the speed of a single GPU, nor the capacity of the largest data center, but rather the amount of computation used to train a single model — which is the number most relevant to model performance. The calculation of each model and the overall calculation are quite different, because the parallelism (http://learningsys.org/nips17/assets/slides/dean-nips17.pdf) hardware and algorithm limits the size of the model, or how many training useful part. Of course, important breakthroughs still come through the right amount of computing-this article only covers computing power.

The trend is increasing about tenfold every year. Some of this is due to advances in custom hardware, which allows systems to perform more operations per second at a given price (Gpus and TPus), but it is largely driven by researchers’ efforts to find more chips that can be used in parallel to make operations more economical.

Ji yuan

The chart shows that the history of computing power can be roughly divided into four different eras:

  • Before 2012: Using gpus for ML is not common, making any of the outcomes in the chart difficult to achieve.

  • 2012-2014: Infrastructure for training on many Gpus is not common, so most results are produced using 1-8 Gpus rated at 1-2 TFLOPS for a total of 0.001-0.1 PFS-days.

  • 2014-2016: large-scale use of 10-100 Gpus with rated power of 5-10 TFLOPS and 0.1-10 PFS-days. Diminishing returns on data parallelism means that training on a larger scale is of little value.

  • 2016-2017: Approaches that allow greater algorithm parallelism (such as high-volume, architectural search, and expert-level iteration) and dedicated hardware (such as TPU and faster interconnect) have significantly raised these limits, at least for certain applications.

AlphaGoZero/AlphaZero is the most famous example of large-scale use of parallel algorithms, now algorithmically available in many other applications and probably already in production environments.

Not to

There are good reasons to believe that the trend shown in the chart is likely to continue. A number of hardware startups are developing ai-specific chips, some of which claim they will dramatically increase the number of FLOPS/Watt (related to FLOPS / $) in the next 1-2 years. Simply reconfiguring the hardware to do the same number of operations at reduced cost may also be more profitable. In terms of parallelism, in principle we can make multiplicative combinations of many of the above algorithm innovations, such as architecture search combined with massively parallel SGD.

On the other hand, cost ultimately limits parallelism, and there are physical limits to the efficiency of the chip. We know that today’s largest training runs on hardware that costs millions of dollars (though the amortization costs are much lower), but most neural network computing is still used for reasoning (deployment) rather than training, which means companies can repurpose or buy more chips for training. So if there are enough economic incentives, we can do parallel training on a much larger scale and keep this trend going for years. With total hardware budgets around the world at $1 trillion a year, absolute limits are a long way off. Overall, given the above data, the trend of computing index gains, ML on specific hardware, and economic incentives, we don’t see this trend going away anytime soon.

Past trends are not enough to predict how long they will last or what will happen if they continue. But even if the growth in computing power is within reasonable expectations, it means we need to focus on addressing the security and malicious use of AI from now on. Foresight is essential for responsible decision-making and technological development, and we must stay ahead of these trends, not wake up to them when they arrive.

Appendix: Methods

We use two methods to generate these data points. When we have enough information, we directly calculate the number of flops in each of the schemas described in the training sample (add and multiply) and multiply by the total number of forward and backward channels during the training. When we don’t have enough information to directly calculate the FLOP, we look at the GPU training time and the total number of Gpus used, and set the usage efficiency (usually 0.33). In most papers, we were able to use the first method, but in a few papers we used the second method and checked for consistency between the two whenever possible. Calculations are not exact, and the goal is to keep the accuracy within the range of two or three times. Here are some sample calculations.

Example of method 1: operations in the count model

When the author gives the positive transfer of the operands, this method is very simple, just like in Resnet paper (especially the Resnet – 151 https://arxiv.org/pdf/1409.4842.pdf) model:

Calculations can also be done programmatically in some known deep learning frameworks, or we can simply do the calculations manually. If a paper provides enough information to do the calculations, the results can be very accurate, but in some cases the paper does not contain all the necessary information and the author cannot make it public.

Example of method 2: GPU time

If we cannot directly calculate, we can calculate the number of GPU and training duration, and estimate the computation amount according to the GPU utilization rate. We emphasize that the peak theoretical FLOPS are not counted here, but the assumed score of the theoretical FLOPS is used to estimate the actual FLOPS. As a rule of thumb, we usually assume GPU utilization is 33% and CPU utilization is 17%, unless we have more specific information (such as we’re talking to the author or the work is done on OpenAI).

For example, AlexNet’s paper states that “our network takes five to six days to train on two GTX 580 3GB Gpus”. According to our hypothesis, this means that the total amount of computation is:

Our goal is to estimate orders of magnitude. In practice, when both approaches are available, it usually works well (for AlexNet, we can also calculate the computation directly).

Select extra computation
Dropout

https://arxiv.org/abs/1207.0580

Visualization and understanding of Conv networks

https://arxiv.org/abs/1311.2901

DQN

https://arxiv.org/abs/1312.5602

Seq2Seq

https://arxiv.org/abs/1409.3215

VGG

https://arxiv.org/pdf/1409.1556.pdf

DeepSpeech2

https://arxiv.org/abs/1512.02595

Xception

Neural architecture search

Neuromachine translation

Appendix: The most recent results obtained using appropriate calculations

Large-scale calculations are certainly not necessary to produce important results, and many important results have been produced recently using only a modest amount of computational power. The following are some examples of results obtained with just the right amount of computational force, which provide enough information to estimate computational force. We did not use many methods to estimate the computational results of these models, and from the upper limit, we made conservative estimates for all missing information, thus having greater overall uncertainty. It wasn’t important for our quantitative analysis, but it was interesting enough that we thought it was worth sharing:

What you Need is attention: 0.089 PFS-days (6/2017)

https://arxiv.org/abs/1706.03762

Adam optimizer: Less than 0.0007 PFS-days (12/2014)

https://arxiv.org/abs/1412.6980

0.018 PFS-Days (09/2014)

https://arxiv.org/abs/1409.0473

GANs: less than 0.006 pfS-days (6/2014)

https://arxiv.org/abs/1406.2661

Word2Vec: less than 0.00045 PFS-days (10/2013)

https://arxiv.org/abs/1310.4546

Less than 0.0000055 PFS-days (12/2013)

https://arxiv.org/abs/1312.6114

Katja Grace, Geoffrey Irving, Jack Clark, Thomas Anthony and Michael Page contributed to this article.

Original link:

https://blog.openai.com/ai-and-compute/