Excerpted from Google Cloud by Kaz Sato, Compiled by Heart of Machine, siyuan, Liu Xiaokun.

Many readers may not be able to tell the difference between CPU, GPU and TPU, so In this blog post Google Cloud will briefly introduce the differences and discuss why TPU can speed up deep learning.

A Tensor processing unit (TPU) is a custom ASIC chip designed from the ground up by Google and designed specifically for machine learning workloads. TPU provides computing support for Google’s major products, including translation, photos, search Assistant and Gmail. Cloud TPU offers TPU as a scalable Cloud computing resource for all developers and data scientists running cutting-edge ML models on Google Cloud. In Google Next ’18, we announced that TPU V2 is now widely used by users, including those who use it as a free trial, and that TPU V3 is now available in private beta.

Third generation Cloud TPU

The above screenshot shows tpudemo.com, where the PPT explains the features and definition of TPU. In this article, we’ll focus on some specific attributes of TPU.


How does a neural network work

Before we compare cpus, Gpus, and TPus, we can understand exactly what kind of computation machine learning or neural networks require. As shown below, suppose we use a single-layer neural network to recognize handwritten numbers.

If the image is a 28×28 pixel grayscale, it can be transformed into a vector containing 784 elements. The neuron takes all 784 values and multiplishes them by the parameter values (the red line above) to recognize “8.” The function of parameter value is similar to extracting features from data with “filter”, so the similarity between input image and “8” can be calculated:

This is the most basic interpretation of neural networks to classify data by multiplying the data by the corresponding parameters (the two colored dots above) and adding them together (the calculation is collected on the right side of the image above). If we can get the highest predicted value, then we will find that the input data matches the corresponding parameter very well, which is the most likely correct answer.

In short, neural networks need to perform a lot of multiplication and addition between data and parameters. We usually combine these multiplications and additions into matrix operations, which we’ll talk about in linear algebra in college. So the key is how do we do large matrix operations quickly, but with less energy.


How the CPU works

So how does the CPU perform such a large matrix computation task? The generic CPU is a general-purpose processor based on the Von Neumann architecture, which means that the CPU works with software and memory as follows:

How the CPU works: This GIF only shows conceptual principles and does not reflect actual CPU computing behavior.

The CPU’s greatest advantage is flexibility. With the Von Neumann architecture, we can load any software for millions of different applications. We can use cpus to process words, control rocket engines, perform bank transactions, or use neural networks to classify images.

However, because the CPU is so flexible, the hardware doesn’t always know what the next calculation is until it reads the next instruction from the software. The CPU must internally save the results of each calculation to memory (also known as register or L1 cache). Memory access becomes a CPU architecture deficiency known as the Von Neumann bottleneck. Although each step in a neural network’s large-scale operations is completely predictable, each CPU’s arithmetic logic unit (ALU, the component that controls the multiplier and adder) can only execute them one after another, each requiring access to memory, limiting overall throughput, and requiring significant power consumption.


How the GPU works

To achieve higher throughput than cpus, gpus use a simple strategy: use thousands of ALUs in a single processor. Modern Gpus typically have 2500-5000 ALUs in a single processor, meaning you can perform thousands of multiplication and addition operations simultaneously.

How the GPU works: This animation is for concept demonstrations only. Does not reflect how real processors actually work.

This GPU architecture works well in applications where there is a lot of parallelization, such as matrix multiplication in neural networks. In fact, gpus can achieve throughput orders of magnitude higher in typical training workloads for deep learning compared to cpus. This is why the GPU is the most popular processor architecture for deep learning.

But the GPU is still a general-purpose processor that must support millions of different applications and software. This brings us back to the fundamental problem, the Von Neumann bottleneck. In each calculation of several thousand ALUs, the GPU needs to access registers or shared memory to read and save the intermediate calculation results. Because the GPU performs more parallel computing on its ALU, it also uses proportionately more energy to access memory, while also increasing the GPU’s physical footprint due to complex wiring.


How does TPU work

When Google designed TPU, we built a domain-specific architecture. This means that we are not designing a general-purpose processor, but a matrix processor dedicated to neural network workloads. Tpus can’t run text-processing software, control rocket engines, or perform banking, but they can handle a lot of multiplication and addition for neural networks while being very fast, consuming very little energy, and taking up less physical space.

The main reason is the simplification of von Neumann bottleneck. Because the processor’s primary task is matrix processing, TPU hardware designers know every step of the process. So they put thousands of multipliers and adders and connected them directly to build a physical matrix of those operators. This is called the Systolic Array architecture. In the case of Cloud TPU V2, there are two 128X128 pulsating arrays that integrate 32,768 ALU 16-bit floating-point values in a single processor.

Let’s look at how a pulsating array performs neural network calculations. First, TPU loads parameters from memory into a matrix of multipliers and adders.

The TPU then loads the data from memory. As each multiplication is performed, the result is passed to the next multiplier, while addition is performed. So the result will be the sum of the product of all the data and parameters. There is no need to perform any memory access during the entire process of massive computation and data delivery.

This is why TPU can achieve high computational throughput on neural network operations with very little power consumption and physical space.


Benefit: Cost reduced to 1/5

Therefore, the benefits of using TPU architecture are: reduced cost. Here are the usage prices for Cloud TPU V2 as of August 2018 (as of this writing).

Cloud TPU V2 prices, as of August 2018.

Stanford University has released DAWNBench, a benchmark suite for deep learning and reasoning. You can find a combination of different tasks, models, computing platforms, and their respective baseline results.

DAWNBench:dawn.cs.stanford.edu/benchmark/

At the end of the DAWNBench competition in April 2018, the minimum training cost for a non-TPU processor was $72.40 (training ResNET-50 with field examples achieved 93% accuracy). With Cloud TPU V2 preemptive pricing, you can get the same training results for $12.87. This represents less than one-fifth of the cost of a non-TPU. This is where the power of specific architectures in the domain of neural networks lies.

The original link: cloud.google.com/blog/produc…