 Welcome to What's Next, a seminar series by IBM Research where we show you some of what we're working on. My name is David Cox, and I'm the IBM director of the MIT IBM Watson AI Lab here in Cambridge, Massachusetts. And I'm gonna tell you today about some of the work we're doing in efficient AI. Now, it goes without saying that there's a lot of excitement about AI today. And in particular, that's being driven by a technology called deep learning. And in 2015, Forbes declared that deep learning machine intelligence would quote, each the world. And I think to a first approximation, that's pretty much happened. Here's an example of the kinds of things we can do routinely now with AI. And this is an example actually from 2015. It's one of the first examples that really sealed for me as an AI researcher that deep learning was here to stay and that it was going to achieve amazing things. So this is a system that can not only identify objects and images, but it can actually create a beautiful natural language caption describing what's in the image. So you can give an AI this system a picture and it'll produce a caption like, a man in a black shirt is playing guitar or a construction worker in an orange safety vest is working on the road. It's easy to take these kinds of things for granted now because we've had these capabilities for a while, but it's really stunning and the progress in AI and deep learning in particular has been phenomenal. You know, there aren't that many games left that humans are better than machines at starting from Jeopardy, which IBM did in 2011 to the game of Go. Deep Mind created a system called AlphaGo that beat the world champion in Go. Even games like poker, a team at Carnegie Mellon produced a system that beat the world poker champion and even things like debate. We now have AI systems like Project Debater which IBM designed, which can carry on a lucid debate with an expert debater. So if you like AI systems that can argue with you, we've got your coverage even for that. And even things like art that we might have thought might never be the domain of a computer system or an AI system are increasingly, you know, amenable to AI as a tool. So everything from style transfer, where you can take a photograph and then re-render it in the style of any painter you like to images like these, which look like ordinary sort of high res pictures of a dog or a bubble, but are in fact not real at all. These are the figments of the imagination of an AI system completely confabulated out of thin air. So really amazing capabilities. And then of course, in the area of text and natural language processing, you have systems like GPT-3, which you might have heard of, which is able to create sort of spookily good sort of natural language, very, very realistic seeming text. And in fact, the Guardian created an article that was entirely written by the GPT-3 system. So starting with a prompt, I am not human, I am a robot. It then produced the entire rest of the article, which you'd be hard pressed to tell wasn't written by human. So the capabilities are getting really spectacular. But at the same time, the other trends have emerged. You know, we used to have computers on the desk and now increasingly computing is ubiquitous. It's everywhere. It's in our pockets. It's on our devices. The factories of the future increasingly have IoT devices Internet of Things devices, where you have computing all over, you know, many different facets of our life, running our world, communicating. And this trend towards mobile sort of presents a problem for AI. So for instance, this picture I showed you, this amazing model that can produce these pictures of this, you know, not real dog and this not real bubble is a system called BigGAN, which was created by DeepMind. And if we look though, how much compute and how much power it took to produce this model, it turns out it took the equivalent energy of the average US household for six months to train this just once. So an enormous amount of power to produce this AI system. And then meanwhile, this tech system, GPT-3, same story, you know, we're making big progress, but these are really, really big models. So for instance, we look at the number of parameters that you have to fit to train one of these models. GPT-3 has 175 billion parameters. So just think about how many numbers you need to even store to specify this model. And then it was trained with 300 billion tokens. So think about roughly 300 billion words to train this model. So these things are really, really big. And this article in the register did some back of the envelope math to learn that this required roughly 190,000 kilowatts of power, which produced 85,000 kilograms of CO2. And that's equivalent to a car driving from the Earth to the moon and back. So astronomical capabilities, but astronomical price in terms of power. And you know, there are limits, of course. And in fact, when I say there are limits, there are hard limits to how much further we can go down this road. So if we look at the power budget of the entire planet Earth, how much energy comes from the sun, how much thermal energy there is in the core of the Earth, how much stuff we can dig up out of the Earth and burn, we're actually on track to exceed that power budget in the next decade or two. So this was an estimate from the semiconductor industry association, where they looked at our year over year increase in computing power usage. So just each year, we use more and more computing. We're actually on track to hit that world energy production limit somewhere in the 2030s. So we literally cannot continue down this path of keeping to build bigger and bigger models. And if you look then at this sort of cost in terms of carbon, you can see today's even larger than GPT-3 models can consume half a million pounds of CO2 to train them. And you compare that to a car for a one lifetime or a flight between New York and San Francisco. And you can see that the power consumption of AI is actually a pretty big deal. As we do more of it, it's going to be an even bigger deal. And then of course, that's the power consumption cost. But the size of these models also means that we're paying a lot. So the estimates that it costs anywhere between $1 and $3 million to train one of these models once. So this is another sort of aspect of the problem that we have that our models are getting bigger and bigger and bigger. And of course, if you look at the hardware then that these things run on, this is what AI runs on in the data center. So these are a picture of an NVIDIA GPU. These cards are typically stacked then in boxes that are full of the GPUs. And these live in air-cooled data centers that are tremendously power intensive and also tremendously hot. And if you just take a thermal camera and point it inside of a modern computing system, you can just see how much heat they're producing. And someone at Intel produced this figure back in the 2000s that I like, where they were comparing the power density, the number of watts per unit area of the chip. And they actually showed that this was back in the era of the Core 2 processor. They're actually approaching and matching the power density of a nuclear reactor. So the actual number of watts per unit space in the nuclear reactor is actually comparable to that of a modern computing system. So this is a big deal. Computing is a very power-intensive thing. That's in the data center, but we also want to be able to run AI increasingly on the edge. So on our mobile devices, on robots, autonomous vehicles, we're trading off compute with battery life, in the case of all these devices. Think about applications of mobile devices and factories. There's just a lot of things we'd like to be able to do without having to always go up into the cloud. So I'm going to tell you a little bit about what we're working on at IBM Research to try and combat these issues and try and push us to a new regime where we can use AI without using so much power and really still gain the fruits of AI, but in a more sustainable way. So there's three trends that I'm going to tell you about today. First, I'm going to tell you about using AI to trim down AI models. Then I'm going to tell you about doing more with less. So this idea that sometimes less, and in particular, I mean less data, is actually desirable. And then the third thing I'll give you is just a little bit of a glimpse of are entirely new hardware architectures that we're working on that can achieve the benefits of AI but with a much smaller power footprint. So let's start by talking about using AI to trim down AI models. So a deep neural network, which is really the workhorse of this new revolution that we're seeing in AI, is what we call in technical jargon a nonlinear function approximator. And all that really means is it's something that takes complex input, like an image, in this case a picture of an apple, and it maps it to something we want, like an output like a label. So we might want to take pictures of apples and then label what is it. OK, we have this one hot coded vector that tells us it's an apple. Now, the trick with neural networks is we don't hand tune everything in between, but we train it with data. So we take lots of examples of what we have, the picture, and pair them with lots of examples of what we want, the labels that go with the pictures. And we train them, and then we train them, and train them. And then we fit all of these weights in these units that are inside the network. These are the artificial neurons in the neural network. And with lots and lots of data, we don't have to specify how to do it. The system learns how to do it on its own. Now, that's great. But if we wanted to make this smaller, we have a couple of things we can do, which are pretty straightforward. So the network consists of these units, which are the circles, and then they're connected to each other, and those connections have weights. So if we want to put the network on a diet, we can do two things. We can prune the synapses. We can take out synapses. And we can also prune neurons. We can take out the whole units. And by doing so, we just have to do less computation. So we can just get rid of some of the units, get rid of synapses, and put the network on a diet. Of course, this is not something that we know how to do on our own. It's not something even an expert would know how to do. But increasingly, there's a trend where we use AI to help us with the process of designing AI. And we sometimes call this auto-AI or network architecture search. This is an example. You may have seen visualizations like this before. This happens to be an example of an IBM product called NuNets, which does this process. And typically, what you do is you train an AI system to make decisions about how many layers should we have in our network, how big should the layers be, how small should they be, how many connections should we have. And it sort of automatically tunes those parameters, so you don't have to have an expert tinkering with every last detail. Now, typically, this is done to get the network that performs the best, that has the best accuracy at making whatever decisions you want to make. But we can also use these same techniques to help us build more efficient models. And this is one example of a piece of work that's come out of the MIT IBM Lab. It's a collaboration between Sung Han at MIT, a professor at MIT, and Chuang Gan, who's a researcher here in IBM Research Cambridge. And they created a method called once for all. What once for all does is it allows you to take a network and then progressively shrink it in such a way that you can now take that master network and then pick out subnetworks from inside that network for different purposes. So for instance, if you're running in the cloud on a server, you might use a large network, a large subnetwork that takes advantage of the resources available in a server in the cloud. But if you're running on a mobile phone, you might want to use a much smaller subnetwork than you would use on a big server with a big GPU. And furthermore, if the battery on the mobile device is full versus empty, you might want to use a different model again, because you might be less willing to trade your last little bit of battery for a little bit extra performance, whereas when the battery is full, you might be willing to make that trade off. So the magic of the once for all method is you can actually train the thing once and then dynamically pick out these subnetworks without having to retrain the model. And this actually is, so think of it kind of like these nested dolls, where inside the larger model exists many small models that can perform the task with different characteristics in terms of how much, how power-intensive they are for different hardware. Now, this actually makes a huge difference in terms of the power consumption of these models. So we look at MNASnet, which is a pretty standard neural network, you look at how much power, how much carbon it took to train, where on the order of half a million pounds of CO2 go into training, getting the parameters of the MNAS model. Now in contrast, once for all, you can do the same with just 340 pounds of carbon dioxide. And here's some other sort of typical activities like US car, including fuel for a lifetime and around trip flight from New York to San Francisco. You can see these are huge reductions in the amount of power, just by being a little bit smarter and allowing AI to help us do the design of efficient AI models. And we can act, so that was for training. We can also use similar ideas at inference time. So not when we're training the models, but when we're running them, there's a piece of work called MCU net V2, which is being presented at NeurIPS. And what the team has done with MCU net is to jointly optimize for the network, what should the network, how should the network be shaped to get optimal performance, but then also jointly optimizing that with the inference engine that's going to run that network. So you can control the memory for different, memory size, patch size, things like that for different devices. And that allows the team to even get these models relatively sophisticated models down onto a microcontroller like that one, which is a few dollars at scale, in contrast to a GPU, which costs thousands of dollars and lives in the cloud. So we can actually take relatively complicated AI workloads and shrink them down to the point where they could be running in devices ubiquitously. So very exciting things are possible when we allow AI to help us design smarter AI systems. The other theme that I want to tell you about is this idea that sometimes less is more. So you might have heard that data is the new oil and data is the fuel that drives AI and that's true. But sometimes you don't want more data, you actually can get by with less data. And when you can, there are a lot of opportunities to save energy. So for instance, if we have an AI system that's watching a soccer game, a video of a soccer game, maybe it's recognizing actions or recognizing players or whatever, the interesting thing intuitively about this game is, yeah, sure, we could watch every single pixel in the video, but it turns out that you don't need to process all of that data. And in fact, a few key frames here and there that show important parts of the action might be sufficient. We don't necessarily need to watch all of the video in full resolution. Sometimes maybe we can get by with half resolution or maybe even quarter resolution. And we don't have to watch it at full dynamic range. So we have these wonderful eight-bit cameras with lots of gradations of brightness, but sometimes we can get by with much shallower images. So what this team has done, this is a collaboration between Rameshwar Panda, who's a researcher here at IBM Research Cambridge, working together with Oda Liva, who's a researcher at MIT, and she's actually my counterpart at MIT. She's the MIT director of the MIT IBM Watson AI Lab. She and Rameshwar work together to create a system that dynamically decides what data the neural network should process. So again, we're using AI to help AI, but now we're doing it in real time. So we're in real time, we're deciding, do we need to look at this frame or not? Could we look at this frame at lower resolution and save on the amount of data we process? And in doing so, they can actually get up to a 40% reduction in the power and the energy consumption of these models, while not sacrificing the quality of the result at all. And again, just to plug some work that's happening this week at NURAPS, if you have a chance, this same team has a paper called AI Red Square, where they're applying the same idea to vision transformers. Now, you might remember, I told you earlier there was a model called GPT-3, which worked on text. That's an example of a transformer model for natural language processing. There are also transformer models for vision and they work in similar ways. But what the team here has done is to dynamically decide which parts of the image the transformer will process. And in doing so, it can greatly reduce the power consumption of these models. Interestingly, the results are also more explainable because by narrowing down what the model looks at, it gives the human observer a clearer picture of what was important and what was used to make the decision. Okay, finally, I wanna tell you a little bit about how the very hardware that we run these models on is poised to change. And in particular, I wanna tell you about some work that's happening, not here in Cambridge, but in IBM Research in Zurich. We have global labs all around the world and one of them is in Zurich looking at the problem of analog computing. So what is analog computing? Well, I think we're all familiar with what digital is and that's been the watchword of the last couple of decades. The idea that we can represent values as streams of zeros and ones. And that's digital technology has obviously been tremendously important and it's changing the world but that's not the only way to represent a value or a number. We can also represent a signal using analog. So rather than having zeros and ones, we can have continuously varying values that we can represent numbers with and that we can also process with. And think about like an analog watch in contrast to a digital one. The analog watch, it's the angle of the dial sort of varies continuously in proportion to the time. So it's like an analogy between where the dial is pointing and the value, which is this abstract thing, which is the time. That same way that we can represent time with a dial, we can represent values with varying voltages, for instance, inside of a computing device. And you can produce systems like this. So this is a system that you can actually try yourself and I would encourage you to do so. There's a link down here in the corner where you can run not, you can basically train a system to recognize handwritten digits. And rather than running with digital numbers and conventional digital system, you can run them in an analog computer that's actually doing that computation using the physics of electronic devices. And an awful lot of the power consumption, it turns out in traditional digital computers comes from keeping the ones ones and keeping the zeros zero. And when you perform these computations using analog voltages, instead you can achieve even up to 100 or a thousand fold reduction in power. So it's a fun demo and I encourage you to try it. And with that, I'll close. Hopefully I gave you a little bit of a picture, a glimpse of how we're tackling this problem of making AI sustainable. If you wanna learn more about what we're doing and particularly at the MIT IBM Lab, you can visit our website at mitibm.mit.edu. And thanks for taking the time to watch.