The Robot Brains Podcast

Compile | OneFlow community

For such a heavy asset industry as chip, especially to start a business, it may test the judgment of long-term technical development and market demand, as well as the confidence and courage to execute these judgments.

Simon Knowles is a very typical serial entrepreneur. His first two companies were basically in business for ten years before being acquired. Now Graphcore has been doing IPU for almost seven years. Simon Knowles, who has been developing processors for more than 30 years, certainly knows a lot about the industry.

After studying electrical engineering at Cambridge, he worked on early neural networks at the British government research laboratory. He then joined ST Microelectronics, where he led the microprocessor development group.

In the 1990s, Simon Knowles founded Element 14, a DSL chip business that was acquired by Broadcom, an American cable and wireless communications semiconductor company, for $640 million in 2000. Icera, a 3G cellular modem chip company founded by Simon Knowles and Nigel Toon in 2002, was sold to Nvidia in 2011 for $436 million.

Graphcore co-founders Nigel Toon (left) and Simon Knowles (right)

Simon Knowles and Nigel Toon had been thinking about starting another venture together since 2013, and they started raising money in 2013. In 2016, Graphcore was announced. They mainly develop a new processor IPU (Intelligence Processing Unit) dedicated to AI computing to help accelerate the development of machine Intelligence products and services. Simon Knowles is co-founder, CTO and Executive Vice President of the company.

To develop a chip that could use all the power but was more energy-efficient than a GPU, they designed a chip with 1,216 processor cores and 24 billion transistors. In late 2018, their first IPU chip was launched, capable of detecting 10,000 different images per second. In December 2020, Graphcore announced that it had raised $222 million in Series E funding, bringing its total funding to over $450 million.

Their goal is to enable machines to recognize more complex data and find ways to accomplish a given task rather than just simple objects.

In The Robot Brains Podcast hosted by Pieter Abbeel, Simon Knowles introduced The development background of AI computing, shared The startup idea and process of Graphcore and The design concept of IPU, and predicted The future development of AI chip industry.

The following conversation has been translated and edited without changing the original meaning by the OneFlow community.

1

The origin of my third business idea

Pieter: Arguably, the latest wave of AI’s rise was marked by the ImageNet Image Recognition Challenge in 2012, AlexNet, a deep convolutional neural network structure proposed by Geoffrey Hinton and his students Alex Krizhevsky and Ilya Sutskever, won an overwhelming victory and set off the application boom of neural networks.

One could argue that this rise is largely due to Alex’s use of a GPU instead of a CPU to train neural networks, which can train more data and be more efficient. Can you first explain the difference between GPU and CPU and why GPU is better for AI computing?

Simon: If you don’t know much about processors, you probably don’t know that there are many types of microprocessors. Intel makes processors called cpus, which are highly versatile. Dedicated to graphics processing is the GPU. There are many other types of processors out there that don’t work the same way as gpus or cpus.

For example, the network processor used in the field of communication is mainly used for packet processing, route search and other tasks; There are also different types of digital signal processor DSP, mainly the signal in digital way and processing; In addition, there is the media processor that my team and I have developed, which was one of the first dedicated media processors manufactured by STMICROELECTRONICS. It can encode and decode m-JPEG, MPEG4, H.264 and other formats.

Why are these processors structured differently? There are three main differences between different types of processors:

The first is parallelism, which is to build different parallelism between machines to achieve higher performance. Then there is the memory hierarchy, which largely depends on the type of data structure to be accessed and how it is accessed. Finally, there is the shape and size of the arithmetic operation, which requires a lot of computing power, so you need to determine the type of data and the size of the operation.

AI should be the first application for high-performance processing of low-precision numbers. People like to compare AI to traditional high-performance computing (HPC), but they work in reverse.

What’s different about gpus? Today’s GPU chips support separate threads running separate programs. Each individual thread has a vector data path about 1000 bits wide. Gpus were originally designed to process arrays of pixels or models of the 3D world projected onto them. For tasks that need to be performed, developers can have the parallelism they want. In other words, the GPU’s memory hierarchy helps developers efficiently retrieve 2D or 3D objects in memory, and arithmetic computations help developers efficiently process those objects. Instead, however, one of our current design assumptions is that intelligent processing needs to consider the internal structure of data, not just arrays or tensors.

Pieter: The ability to train larger neural networks with gpus and get better results suddenly happened in 2012. Coincidentally, this was the year I heard that something you said in a private meeting at Marlborough Tavern had given you a career break. What happened then?

Simon: You’re talking about my meeting with Nigel Toon [co-founder of Graphcore]. He’s been a good friend of mine for many years, and before that, we used to get together and talk about the best way to work together in the future, and finally decided to start a chip company.

We actually came up with three directions: the hot Internet of Things (IoT), using ARM cores to build heavy load servers to compete with X86, and artificial intelligence.

At the time, we had a discussion at the Marlborough Tavern and agreed that the IoT direction looked boring. What’s interesting is what you do with the data you collect, and the answer is artificial intelligence. And for the use of ARM kernel to build heavy load server this direction, if you choose to start a business to compete with chip giant Intel, this is undoubtedly a difficult challenge. Of course, we all knew that ARM server chips were coming, and they’re already here, but they’re a little hard to use.

For the AI direction, however, we think it’s special enough and has enough market potential to do something big. This discussion actually happened before AlexNet came up with it. In other words, our ideas are not inspired by neural networks. But interestingly, Hinton and I talked about neural networks in the mid-1980s.

My first job was at a UK government-funded research facility called the Royal Signals and Radar Establishment, which now sounds odd, working on things like liquid crystal displays. However, one of the projects I was involved in at the time was trying to translate human language using neural networks, finding key phrases and things like that.

We built a neural network in the chip. We figured that if we could build a neural network 100 times bigger than that, we could do what we wanted to do, but we didn’t get there. Now, of course, we can do the job, but 100 times is not enough.

Back in the ’60s, it was speculated that neural networks would explode in the’ 80s, but the technology wasn’t there to make it happen. And then in the 2010s, starting with Alex Krizhevsky’s famous paper, neural networks suddenly took off, and maybe the conditions were there for this to happen.

Pieter: Given the need for the next generation of efficient computing, you predicted in the pub that AI would be the next wave of technology, and then the “AlexNet moment” came. Did you say to Nigel, “Remember our conversation, we’re definitely going to work on this”? Has AlexNet influenced your thinking?

Simon: No, the decision we made in early 2012 was to do artificial intelligence and start a chip company. Back then, there was no less fashionable venture than investing in chips. So obviously we have to take our time, pick our time, and try to get started with venture capital.

We had built several successful chip companies before we started, so there were definitely investors who would invest in us. But actually, I did spend a lot more time studying entrepreneurship and trying to figure out how to do it. Around 2014, we started the project and recruited engineers. Graphcore didn’t go public until 2016, but there was a lot of pre-work. We were originally incubated inside another chip company called Acorn.

You might think it’s a little strange, but we actually did that with Element 14 when we first started. This, at least, is the kind of startup we’ve seen before.

As Graphcore got on the right track, we realized that CNN was growing so fast that we had to adapt our architecture to CNN. Of course, our architecture needed more structures that other architectures already had, such as building a matrix engine in the processor.

2

Why do we need a dedicated AI chip

Pieter: Now you can train neural networks with data sets for multiple application scenarios, voice, image, autonomous driving, and you seem to be developing an AI chip for those scenarios. What makes it possible to have dedicated AI chips compared to computing in the past? What is the importance of AI chips?

Simon: If you’re going to develop a new type of processor, you really need to take a 20-year view. Because you don’t just need to develop processors, you need to develop programming tools on which people can build applications. If this direction can’t be sustained for 20 years, it’s not worth it. But if you ask where the processor world will be in 20 years, frankly, no one knows.

A few years ago, we didn’t know that neural networks would be as ubiquitous as they are today; In five years’ time, we may discover something new outside the realm of neural networks. As a result, processor designers can’t just aim for existing needs, as some chip startups are trying to do.

Some projects, like Google’s TPU (Tensor Processing Unit), which is a lightweight package of dedicated processors based on dedicated hardware, only target known needs.

I’m sure Google will do that, because that’s what their business needs right now, and if they need to do something else tomorrow, they’ll start doing something different tomorrow, but they won’t try to design chips for long-term needs like we did. From our point of view, we can’t just be satisfied with the prevailing demand.

So what do we need to aim for? What are the fundamental concerns? First, everything we have now relies on a lot of computing, and the only way to provide a lot of computing power is to use a lot of parallel computing, so we can’t expect individual chips to keep getting faster and faster.

Second, low-precision arithmetic is also fundamental. I think that learning from data is essentially a probabilistic task, and probabilistic construction is not very precise, it’s a lot of very imprecise numbers that get you to do precise things.

Finally, the internal structure of the data is interesting. Some useful data can be easily represented as vectors or two-dimensional matrices. And vectors come in sequences, like sequences of terms in a sentence. The matrix takes the form of an array of pixels. This is the form of natural data, very convenient. That’s why our company is called Graphcore.

But what about neural networks? Start with data that looks like something, gradually convert that data form into some intermediate representation (frankly we don’t know what it is), and eventually, step by step, through many of these transformations, produce some data structure that may or may not be simple 2D or 3D.

It should have a larger perspective, not just the information contained, but a larger perspective than a set of points in a two-dimensional matrix.

In fact, this applies not only to intermediate representations, but even to other types of input and output forms, such as the recommendation system graph built by companies like Amazon, or the huge relational structure of multiple graphs like Facebook.

Nature is full of pictures. And graphs are almost like the mother of all structures in data. So our other principle is that we have to be able to deal with generalized data structures like graphs. Whatever ai application you’re doing, how you’re doing it, you’re probably not doing it on a neural network level, you’re doing it on a graph.

What are the characteristics of a graph from a processor perspective? They are very high dimensional and have a huge impact on the memory structure of the processor. If the data is low dimensional, like one dimensional, and you have two adjacent pieces of data in the sequence, then you can put those two adjacent pieces of data in the memory that you’ve created and locate them. But now, you can’t do that in a thousand dimensions. So, when you embed a high-dimensional data structure into the low-dimensional memory that is now available, the data gets scattered. If you want to deal with the neighborhood in the diagram, you must first gather the things you need together. This is a classic sparsity problem.

That’s why we’re betting on the future. We decided that the final data structure needed to be a very complex graph.

We didn’t expect the field of neural networks to become so ubiquitous, but now it seems to be having its AlexNet moment. So far, I think there have been two AlexNet moments. One is the popularity of AlexNet neural network itself, the other is the breakthrough of Transformer network, but what is interesting is not only Transformer itself, more importantly, it can adopt the mode of unsupervised learning.

Pieter: How can we use these neural networks to develop other interesting projects? Can you predict similar needs in the future?

Simon: I’m going to talk a little bit about different types of neural networks, and my ideas are not particularly new. Signal processing has a basic set of key features that you can use in different ways to do useful things. These key functional components include finite impulse response filter, infinite impulse response filter, sampler, Fourier transform, etc. By combining these basic modules, you can build signal processing architectures in telephony or various communication infrastructures.

Neural networks work in much the same way. Neural networks are layered because we find that learning is best done incrementally. In other words, we find it useful to have different types of layers through multi-layer transformations with relatively little impact. Some layers, like local convolution, are useful for things like images. There are layers that can extract information from basically anywhere in their environment, and this is the attention layer. Some layers completely ignore the dimensions of their environment and just do other things, such as projecting data into a larger dimensional feature space and then shrinking back, which helps to decouple things.

I think these layers are like the components that we build a DSP system, just learning which parts are useful. Once we have a complete set of components, we can fasten them together in various ways.

Right now, I’m fascinated by the migration between traditional attention models like CNN and Transformer. It’s nice to see how well they work with graphics. They use a few layers of CNN in front and a few layers of Transformer in the back, which is probably the right idea.

Pieter: Graphcore has generated a lot of buzz, as its technology goes deep into the basic units that form the core of AI computing. You mentioned earlier that you want to create a program that’s going to work in 20 years, and there’s definitely a relationship between what people are doing today and what they were doing five years ago. But at the same time, the world is moving so fast, how do you make sure that the chips you design today will not be outdated 20 years from now? What is the source of your confidence?

Simon: To be honest, some very interesting areas are growing very quickly. Because if these fields are promising, they attract a lot of smart people. When we started Element 14 to build those DSLS [then called copper broadband], the space was growing very fast. The birth of the Internet helped fuel the tech bubble of the 21st century, which is fundamentally a valuable thing.

The other thing I want to say is that we need to focus on the fundamental issues, not on what’s popular right now. AI is all about low-precision algorithms and massive parallelism. Another fundamental point has to do with chips, and that’s Moore’s Law.

If you look at Graphcore’s IPU chip, the Colossus Mark 2, and compare it to the chip we developed with Element 14, which we founded in 1990, it has about 10,000 times more transistors and runs about 100 times faster. That’s something.

Unfortunately, the rate of improvement in chip performance is about to run out. Chips, as we all know, eventually face power problems. If you want to keep improving performance, you have to consider energy consumption.

How does this problem manifest itself in our chip architecture? Make sure that storage and computation are very close to each other. In traditional chips, most of the power is used to move data around, rather than for arithmetic or memory. Therefore, another basic principle behind the IPU architecture is so-called distributed computing. Otherwise, your chip must have a large number of logic units, and both the logic gates to do the calculations and the memory need many transistors.

Pieter: When most people think of AI chip companies, the first thing that comes to mind is Nvidia, whose Gpus are very popular and provide a lot of computing power in AI. Obviously Graphcore has to compete with them or carve out areas they don’t cover. Besides, they are a big company and you are a startup. How does your competitive advantage play out?

Simon: I don’t expect Nvidia to disappear or be defeated by any newcomer. The idea that computers learn from data is well known, and AI exists in almost any area of future human technology. Can all these needs be supported by an architecture? I don’t think that’s possible, and there will be all kinds of architectures with different strengths and weaknesses. But they are undeniably the dominant force in ai, and I don’t expect that to change.

The first rule of thumb if you’re starting a business is don’t try to make an enhanced version of a big company’s existing product. Because incumbents have so much market power, it’s not enough to have better products that do the same thing. In a way, our second startup, Icera, which was acquired by Nvidia in 2011, did that with Qualcomm. We out-engineered Qualcomm, but it was still a very competitive market.

Of course, it’s inevitable that you won’t perform as well in some areas, but you’ll still perform better in others. So we’re not building a better GPU right now, we’re calling it an IPU, which is a different product, has a different shape, different application scenarios, but will still enable some AI applications.

In the age of AI computing, I’m totally open to other viable ways to deliver value, and I’m pretty sure there are other promising startups out there. However, I am surprised and disappointed that many companies have decided to try to clone GPU chips, which makes no sense to me.

Pieter: IPU and GPU have their own more suitable applications. Can you tell us more about the advantages of IPU for some workloads?

Simon: I think it’s pretty clear now that you have a lot of threads on a GPU chip, so you can do a lot more work. If there is an arbitrary structure of data, these threads can go to various parts of that structure and do useful work efficiently, independent of the homogeneous data structure.

We have shown that where sparsity is present, whether it’s conditional execution, conditional access, or just punched data structures, ipUs do particularly well for these tasks. I think we’ve now demonstrated energy efficiency advantages over gpus.

Our chip storage logic and computational logic are about 50-50. They’re tightly mixed together, so data usually doesn’t have to travel very far to be processed, which is very different from traditional processors, even highly parallel ones like Gpus.

On models such as Transformer in natural language processing, look at performance per watt on standard benchmarks. If your case is on a budget of 10 kilowatts or something like that, given that energy efficiency limit, we can get more performance out of an IPU than a GPU, and that’s based on a chip that has about the same number of peak calculations per second.

Pieter: Can you talk about some specific cases where you might rely on more sparse computing at the bottom and therefore be a good fit for IPU?

Simon: The best use of sparse data structures is molecules. Molecules are rarely arranged in simple lines or squares, and the relationship between them is almost like a graph.

It is also the case that useful molecules tend not to be very large. They can have very complex behaviors and still be quite small. A molecule made up of thousands of atoms is still astonishingly complex. If you’re studying a molecule of a certain structure, predict or take its properties, and try to find structures that can express those properties. This is also a good example.

It has two fantastic properties that make it more suitable for IPU. The first is a data structure small enough to fit all into the chip’s memory, and our chip can access that special memory with bandwidth of thousands of megabytes per second. Second, because IPU has massively parallel architecture, it can manipulate irregular data structures faster than other architectures.

There are many other applications, such as discovering new chemicals, treating diseases like coronaviruses, or even just understanding how those molecules behave, that also have huge potential benefits for society.

In addition, our chips already have considerable appeal in the financial sector. Many financial models turn out to be abstract structures, probabilistic projections of what might happen in the future, which require a lot of random number generation. Each of our machine’s 1,500 processors has a powerful random number generator.

Even in domains that are not graph structures or sparse like the natural language processing domain, IPU has clear advantages. For example, the processing of word block sequences can outperform the GPU, which is an advantage of designing from scratch. Oh, and one of the advantages of a startup is being able to sell itself to a giant like Nvidia.

Of course, we have a lot of respect for Nvidia and they’ve done a great job applying gpus to AI and serving the community, but the GPU does have a downside, it’s still rooted in graphics processing.

If they could build from scratch, I’m sure they could build new products like ours, but you never have that opportunity in a big company because you’re always stuck in the inertia of the business that’s already there.

The technology Graphcore developed does assume a certain amount of computing scale, but it doesn’t assume you’ll have a case full of chips. The reason for this is that we get a lot of power from putting enough memory on the chip. If the chip is too small, or the power budget is too small, it doesn’t make sense.

So you’re not going to see Graphcore chips in mobile phones. That’s a market for other companies. I wouldn’t say we’ve reached the level of application in mobile phones. IPU applications range from cloud infrastructure and enterprise infrastructure to heavily loaded edge devices and smart cameras with power budgets in the tens of watts.

Pieter: I’m sure you’re familiar with Cerebras chips, which is to build a chip as big as possible, which seems to be a bit complementary to what you’re doing. But compared to the more standard approach, I think it’s unusual for them to make chips with full wafers.

Simon: Actually, back in the 1990s, there were attempts to make memory using what was then called wafer level integration (WSI), all-silicon wafers, and of course there was some progress in making processors.

What did Cerebras do? We all make chips the normal way, cutting them off the wafer, packaging them, and reattaching them to a circuit board, whereas they just leave them on the wafer. The downside is that the manufacturing isn’t perfect, and some of these chips can go wrong, so you have to have enough spare parts and fix the ones that don’t work.

Another drawback is that you pack all the processors and pack a lot of processors into the smallest possible space, which means you run into maximum power output and heat dissipation issues.

Solving power and heat problems is important, and Cerebras has the advantage of being able to link the chips together using circuits made from silicon processing technology, rather than mounting them on circuit boards. Now, I don’t think that’s good enough, but it’s kind of a disadvantage, because you can only access the edges, you can’t access the circuits in the middle of the array, which means that if you want to attach more memory to the system, you can only connect it near the edges. But anyway, it’s a very interesting experiment.

I don’t want to call it a chip, it is about 5.5 square centimeters, if it is structured, a lot of things on the wafer or with the traditional method, such as they put some lines connect the chip on the top, of course, it’s still a whole piece of wafer, as a processor designer, I think the disadvantage of the component may be more than the advantages, But I’m really impressed that they stuck with it and wish them all the best.

3

How do you successfully develop a chip

Pieter: Let’s talk about Graphcore. What happened from the time you designed the first Graphcore chip to the streaming chip?

Simon: The Colossus Mark 2 is the most complex chip ever made. If this had been our first attempt at a product like this, I’m pretty sure we would have failed.

Now we can make it, because the core of the team members to work together for a long time, a much longer time than Graphcore, it even can be traced back to before the two start-up companies, in 30 years ago, have become our processors five or six people members of the team, and we have been building chips.

We have piloted custom-designed processors with every known process, going from 150 nanometers to 130, 90, 65, 40, 28, 16, 8, 7 nanometers, all of them out of the box.

Chip technology is complicated and you can’t learn it in college, and we were very happy to learn and build chips at a major semiconductor company. Making a chip is like making an airplane. It’s very complex and requires a very diverse set of skills.

If you give a bunch of very smart PhD students $1 billion to build an airliner, they probably need some people who have done similar work before if they want to succeed. There aren’t that many teams in the world with the ability to build chips, and we happen to be one of them.

There are a lot of people trying to build chips right now, but it’s kind of frustrating for me because a lot of venture capital doesn’t pay off, and that’s not good.

Many cloud computing companies have decided to follow Google’s lead in designing and manufacturing their own chips, which means they may not buy our products. Unfortunately, a lot of companies are not able to successfully assemble a team that can deliver these kinds of devices, and when that happens, they will look to specialist chip companies like us to fill the gap.

Pieter: Chip work is a little different than many other AI jobs that we know of today. Jeff Hinton, Alex Krizhevsky, and Ilya Sutskever, for example, made a breakthrough in 2012 by building on previous research and their own, but in chip design, they couldn’t get the chip to work.

Simon: That’s interesting. It doesn’t actually take a lot of people to build a chip. I disagree with those who say chip design is getting harder and harder, chip teams are getting exponentially bigger. Teams like ours that build state-of-the-art microprocessors take teams of about 30 chip designers to produce the first device, and the same number of software engineers to write toolchains to program the device. Thirty years ago, we could build a state-of-the-art complex system with the same number of people, so just a few more people.

It took us about three and a half years from the initial build concept to the development of the first generation of chips, and now it’s taken us another year and a half to get to where we are today. That’s how startups work.

Simply put, if you want to make a competitive new chip and spend $100 million on a first-generation product, you might spend $500 million to build your business, so this is not an area for the faint of heart to invest in. It’s wise to find people who can do it.

Pieter: With the COVID-19 pandemic, your model of collaboration must have changed during this time. What did you do to ensure that the team stayed close together?

Simon: The COVID-19 pandemic obviously means that it’s going to take longer to communicate remotely. We’ve found that online communication is really efficient. You do lose some autonomy in the process. But our size team is tight enough that productivity probably didn’t drop last year. We are very much looking forward to going back to the office, but I think the working model will change forever, with some people working in the office and others working remotely.

The real challenge is integrating new people. There’s a special chemistry that we’ve been expecting with people, and you need more of that chemistry in the early stages of building a team. Of course, we were fortunate to continue to grow and raise more vc funding.

4

The inspiration of the human brain and the necessity of sparseness

Pieter: There is still a big gap between the chips currently designed and the human brain — AND I’m not talking about software, although the human brain runs more interesting “software” and so on — in terms of hardware, the average human brain consumes about 20 watts of energy, which is very low compared to current chips. What do you think of the contrast between the two?

Simon: There’s a school of thought that says we should build silicon processors that are very similar in structure to the biological nervous system. I don’t really buy that idea.

I think there are huge differences in the physical properties of electronics, and differences in substrate may require different methods of generating computational structures. Obviously, it’s interesting to try to study how the brain calculates, and I have no doubt that we learn a lot through this kind of research and apply it to future computers. But I ended up wondering why a silicon computer would have the same morphology as a Wetware computer. So I’m not an advocate of neuromorphology.

As efficiency is now called, there is a limit to computational efficiency in any computational structure that irreversibly executes a program. This is a slightly academic question.

But there is a basic noise-based limit to how we can compute efficiently, and in fact there is a practical limit before we reach this basic theoretical limit.

The brain is now two orders of magnitude more computationally efficient than an electronic chip. But we don’t fully understand the axons in the brain, they transfer electrons and this is fairly easy to understand, it’s a charge-based signaling system, like semiconductors in silicon, and yet we know very little about the computational functions of neurons themselves.

This seems to involve molecular interactions that are more efficient than the charge-based electrons we’re used to. So something different may be coming, increasingly powerful silicon electronics bringing us closer to the capabilities of the brain, but for fundamental reasons silicon will never reach the power efficiency of the brain.

There’s another big problem, and that’s that the brain may have 100 terabytes of state, about 200 trillion synapses. I’m not a biologist, but I’ve seen some interesting studies showing that four or five bits of information can be stored in a synapse.

Now we can build a 100 TB of the state of the computer – in fact, in the space of the size of a matchbox built-in 100 TB data — – but, unfortunately, if the machine will be like in the brain as fast access to state, will consume large amounts of electric energy, does not like the human brain that consume 20 watts to do anything useful.

How the brain achieves such energy efficiency is a mystery to humans as far as energy management is concerned. There are different types of axons in the human brain, and the evolved brain has done a good job of increasing its volume and controlling energy consumption. The more we learned about the inner workings of neurons, the more we realized how complex they are.

Pieter: You mentioned that there are 100 terabytes of information stored in our brains, and my Macbook Pro only has 2 terabytes. But now it’s not hard to store 100 terabytes of data on a computer, which would suggest uploading data from your brain to a computer, but it’s not clear how you could do that.

Simon: Absolutely. In fact, Samsung has succeeded, they are the leader in memory, they have managed to build a terabyte of memory into a silicon chip that is less than one square centimeter, which is amazing. Now the capacity problem is one of the total computing demand.

GPT3, for example, now has 175 billion parameters. It would take about 300 zettaFLOPS to train it today. If you want to build a large computing infrastructure to train it, it costs a lot of money.

Now, what happens if I expand the number of parameters in this model to include brain size? The model will be a thousand times larger, and the energy required to train the model will increase a thousand times. But increasing the size of the model allows it to handle more data, so the training data set becomes larger. If you applied those Scaling Laws, you would end up requiring so much computing power that it might cost $5bn to run it once.

Obviously, then, we have to make sure that we can reason about every piece of data, but it doesn’t have to interact with every parameter in the system. In other words, sparsity is essential, otherwise the neural network paradigm will never achieve that scale.

If neural networks are to be the general-purpose computational structures we wish, then massive sparsity is necessary regardless of their structure. I personally think there are several forms of sparsity that are most promising and could represent the next magical “GPT breakthrough event.”

So is this Gated Computation? Things like sparsely-gated Mixture of Experts in Switch Transformer, token after token passes through the layers of the neural network. They choose a set of weighted states from a very large number of values, a form that has been shown to be very efficient, but some form of sparsity is absolutely inevitable.

If neural networks can reach the potential of the human brain, some applications of AI will only be useful if we have superhuman AI. Suppose you’re interacting with an AI doctor, and if you don’t think it’s smarter than a human doctor, you won’t pay attention. Our minimum standard for AI is that they are smarter than people.

5

Future development of AI industry

Pieter: What is your longer-term view of the future development of ARTIFICIAL intelligence? What role will AI computing play in this?

Simon: It’s hard to imagine any direction in computing. I don’t think it creates new value just by involving computers in the problem solving process.

If you think of programs as just an expression of a solution to a problem, then none of the computers so far have fundamentally solved the problem, even though we tend to think of them as solving the problem.

We had to figure out a way to solve this problem by hand, and put that way into a program, and obviously computers can run much faster than humans, and computers can do it on large amounts of data. But strictly speaking, computing is not part of the problem solving process. Therefore, when computers have the ability to come up with algorithms on their own, we can say that computers can solve problems.

Some things are possible, such as translating from English to Russian. It is reasonable to imagine that machines could do better, economically or otherwise.

Breakthroughs in AI and information technology are seen as the next industrial age, which will fundamentally change the way people live, mainly, I hope, for the better. Go back 250 years and the only way you could get any work done was by hand, and then the steam engine came along, and its inventors couldn’t imagine that now everyone would drive around in a car or get on a plane and go to the beach somewhere else.

So you have to be very imaginative to see that ai has the same potential. What would happen if we built something much smarter than humans is unknown.

This article has been authorized to be compiled, and the translator is responsible for any errors introduced into the translation. Original video link:www.youtube.com/watch?v=Bf2…

OneFlow’s new generation of open source deep learning framework: github.com/Oneflow-Inc…