 It's a pleasure to be with you today virtually to talk to you a little bit about AI for systems and systems for AI at the edge. In this talk, I'm gonna describe kind of two themes of work that have been really exciting for me. The first is how do these new models, these large language models like foundation models change the systems we build? And what I'll describe here is a set of problems that I call death by a thousand cuts problems. And what I mean by this is these are problems where each individual task actually looks relatively easy but the sheer variety and volume of these tasks makes them challenging for the previous generation of machine learning systems. But these new models, foundation models, things like GPT-3, 4, and so on are able to solve these things in a very exciting way. I'll also talk about what Remy calls the coming tsunami of tiny AI devices. And I'm really excited about how AI is moving out into the edge to interact with the real world. A real challenge in that scenario is how do you develop new models that are efficient enough to run on these huge number of devices? And how can they process all the data and telemetry information that is coming into them? And so we'll talk about that very briefly. So first, I just wanna share with you how I learned to love this new generation of foundation models. So this is work that's due to a bunch of wonderful students in my lab, Ivanka, Anais, and Laurel. Anais popped out after finishing her PhD to start a company called Number Station and Laurel joined her. And this is about using foundation models for a variety of hard data questions. So autoregressive language models are a simple but very old idea that are really powerful. So the idea is you take a huge corpus of data and you start to learn probabilities. The completions of, for example, the sentence, the mouse ate the blank. So you see cheese here is more likely than ate the mouse or the house. Now the idea is that that probability distribution of all those different words and sentences is gonna be learned compactly by one of these large neural nets. We're then going to be able to generate words from this probability distribution simply by saying what's the most likely word that comes next in the sentence? The mouse ate the cheese. And then the other thing that was amazing about these is that they allow every token that's on the web to be an example that you could train on. Now a year or two ago when we made these slides, actually this was not very familiar to people but now with chat GPT, we've all played with these things. So the predecessor to chat GPT, GPT-3, is something I'll talk about for the next little bit. So these foundation models are auto aggressive language models but they have a very large number of parameters. We've scaled them up. They have huge training corpora. One of the bottles I'm aware of right now trains on trillions of tokens. What's amazing is that when they're trained on these trillions of tokens to just predict the next word, something that sometimes people call emergent behaviors happen. So for example, you can ask it to get a capital from a country, give it a couple of examples of capitals and countries and then ask it to complete the next token and GPT-3 will actually complete it. It also learned how to translate languages. It learned how to do simple QA kinds of tasks. And it was kind of okay arithmetic but this has been a study that people have been working on quite a bit since the initial release in 2020. So the question I ask is can foundation models help with these death by a thousand cuts problems in data? So here's a problem that we face in kind of data cleaning and I should emphasize this was a problem that when I worked on it, I wasn't sure that this would get solved in the span of my career. So the goal is someone has given you a table of data and it contains subtle errors and you wanna clean up that data so you can use it in some downstream purpose. And these errors are things like typos and formatting pieces of information. They're conflicting values. Here you see an address that has different zip codes that's not something that should be allowed. An address should have the same zip code. And there are outliers, things that don't obey the data distribution. Now each one of these you can imagine writing a classifier for and being able to solve. But data cleaning has this death by a thousand cuts feel. This is a problem that was very hard that people were working on for decades because each individual problem may seem easy or solvable but it was kind of soft and fuzzy and the sheer breadth and variety you needed to be able to attack this problem was really challenging. Now we did actually do this. These are some folks, Theo, Shu and Ehab who are professors at ETH, Georgia Tech and Waterloo respectively. So they visited my lab a couple of years back. They built a gigantic probabilistic inference machine and they actually went and attacked this problem. And they're real experts in this. They've been entrepreneurs and researchers on this topic for a long time. And it was a huge jump on the state of the art and actually Apple bought their company in 2020 which was really exciting for everyone. Now, Ivanka was a superstar at the time undergrad but now PhD student. And I asked her to take these foundation models when they came out and run them on these data tasks. And we kind of knew the state of the art from our previous work. How well did they work? So she just take every tough one. So is there an error here? Yes or no? You can imagine how it would fill in. So these top line numbers here are the previous soda hires better a hundred percent is a hundred percent accuracy. And you see GPT as we say zero shot with no examples at this time was doing okay. I mean, this is miraculous. It was trained to predict neck sentences. And on some of these it's actually doing quite well but with just a little bit of work GPT with few shot was already able to be state of the art in this problem that we had worked on for decades and in some ways dramatically better. And this model as I mentioned was trained to predict next words not trained for data cleaning at all but it just kind of figured it out. Now you may say, oh, well open AI they figured this whole thing out. They're brilliant and indeed they are they're really brilliant. But we ran another model that was available from AI 21 the J1 model at the time. And it did about the same as well which was really exciting. This is a natural phenomenon. Now, what's amazing is when I gave this talk people said, okay, well that's only the big proprietary models there's been a huge wave of open source and that open source caught up in a year and these models are two to six X smaller than GPT. And in fact, even smaller models are able to do this now and that's something that we're gonna come back to when we talk about how AI is moving on to the edge is figuring out how to get these capabilities in smaller and smaller form factors. So the summary of these kind of data plumbing applications is that there are these death by a thousand cuts problems and they're really interesting to me because I was working on them and I saw a foundation model effectively solve a problem that I was working on entirely almost a subfield. I also wanna point out as an open source advocate that open source caught up really quickly. That's really exciting because it means we're gonna have a wide variety of use cases. Now, you may look at this and say, well data cleaning is kind of, you know, not my cup of tea. I like other things. So yeah, that's okay. But I think this death by a thousand cuts idea actually extends to other problems that have been classically hard for computer science and machine learning but now have a new attack on it. So one that I don't know very much about I'm just an enthusiast and my kids love robots. I love robots. I think they're very cool. My colleagues are brilliant. There was a sense when I would go to these talks that when you talked about robotics they would tell you why it was hard and how that death by a thousand cuts feel. This robot has this idiosyncratic thing in grasping and it was really in the details. And so very recently I saw this paper with some of my friends and colleagues participated on, I didn't, but was this OpenX embodiment data set where they show transfer and foundation models actually getting to use. And so my uninformed speculation is there's a range of these death by a thousand cuts problems out there and maybe robotics will be something as exciting and next where they'll be applied. And it's worth remarking. These things are strange. In computer science we always tried to narrow in and bind that simple core problem and solve that. And these suggest that a new way to build systems is build a gigantic model on top and then on top of that build the specialist model kind of solve a different problem than we ever thought about before. All right, now Remy had this wonderful article that he shared and he's been really a pioneer and thinking about this for a long time. And I've learned a ton from our conversations. He's talking about this kind of coming tsunami of tiny AI and I've been working on something related to this and it's really interesting to share with you. So one challenge I mentioned at the edge which I think is quite clear is as these devices get smarter and smarter they have more information about the world and they need to both read that information about the world and process it and potentially share it with each other. So one of the ways that we've been framing this in the research literature a little bit of Tuesday is what we call long sequence problems. And I'll share just to highlight with you next and I have longer materials online if you get excited. The other issue is compute. I mentioned the model sizes are coming down but you have to realize these models were built to show demonstrations that were scientific prototypes. They weren't built for TCO. We don't know the most efficient way to build these models. That's really exciting. So that's gonna require a combination of hardware, software and algorithmic innovation. And that's one of the things that when I talk with Remy is so exciting is being at the intersection to try and do all of those different pieces. That's something that industry is really I think at the very early stages of a huge growth potential. So might we want these long sequences? Well, we talked about telemetry from edge devices. That's one that's really important for us right now but also even things like text and audio. Even a minute of audio is about a million samples at the traditional rate. Biology, we've been working on things like DNA, video and images. So there are these long contextual sequences that are out there. And what I'll argue is modern AI may need a new foundation and it's interesting to introspect. We actually work on taking the classical foundation and scaling it up a project called Flash Attention which I won't talk too much about. But we'll also describe in this talk why signal processing may be an interesting basis for the edge. So as an AI researcher, the reason I got interested in long sequences wasn't necessarily for practical reasons but I wanna share it with you because it kind of reveals our motivation. The idea is we were interested in what people call inductive bias. How much does it matter that the machine kind of perceived the world in the same way we do. So there was one test that we started to run. We took an image and we would flatten it into strips. It's kind of a weird thing to do. Now we would ask whether or not we could still tell that this was a hippo or a car classification, something that's called sequential CIFAR in the literature. And you would look at this and you would say for a human it would be impossible just a single sequence of pixels. We recognize things spatially. And in fact, the models at the time, the gap was huge, almost 30 points, 30 percentage points in quality. So I started to wonder how fundamental is this? Are there models that could look at just a sequence of pixels and recognize the image? In fact, others were thinking in similar lines and there's a wonderful paper on iClear 21 by some folks who are at Google and DeepMind who were thinking about this and they created a benchmark called LRA or Long Range Arena, which tried to test this and I'll show some tasks. It was really important for us and it was a very insightful benchmark. So the idea here was that they had several operations which were gonna push the dominant architecture transformer. I won't go through all of them for time but two are worth zooming into. The first is that image task. This is actually, this is a sequential CIFAR task I just mentioned. It's a big gap to the state of the art. We can solve this with conventional models or the case when it's not sequential, when it's the actual image with incredibly high accuracy in the 90s. So it's a big gap. The other one I draw your attention to is PathX. You'd ask on an image, are these two dots connected? And the reason it has Xs is that effectively every model was random guessing. So I won't talk about the dominant architecture attention but one thing I wanna point out is what it operates by is taking for every output O that you see here looks back across all the inputs and attends to them or weights them in a way that it thinks are important. And as you can see from the animation here as it gets further along, it's scaling quadratically. It's slowing down as it gets more and more information. And we want things that don't slow down in that way. And what this is in the architecture if you've ever seen these architecture diagrams that we're gonna replace this with some signal processing ideas. So the one that we're gonna use is this idea called S4 which was done by two of my students. Now one who's doing a startup based on this and another who's a professor at CMU. And it's a classical state space model if you remember your EEs what are called linear time invariant systems but weeks for deep learning. The reason signal processing is so critical is it tells us gives us a theory for how to make sure these models are stable that is as we train them are they gonna explode and have the weights go to infinity that actually plagued a lot of the previous generation models in the space. There's also something exciting if you're a real AI nerd which is the classical models RNNs that we used to know and love they're phenomenal for inference. You can look at an edge situation and operate them in a streaming mode. We really wanna run these models and they have theoretically the ability to remember infinitely long. So they're very useful at the edge. What we showed is that this mathematical model could actually be viewed simultaneously as an RNN and a convolution. And the reason that's critical is RNNs are very difficult to train. So we could train it like it was a classical convolutional style model but then use it as an RNN something very exciting. And it's fast asymptotically more efficient than transformers. It doesn't have that quadratic slowdown as you go left to right. So it's very simple and I'll just show you the quality improvement I had on LRA which was at the outside of this field. So here's S4. It was a huge jump in quality a 26 point jump from the previous benchmarks. And it actually got path X from random guessing to nearly solved. Those images suggest that spatial bias matters less than I thought. And that's really interesting because we had been led a lot by the intuition that the spatial biases or inductive biases were really important. And then the edge when you're sensing telemetry you don't have those spatial biases anymore. So this is a really interesting direction to try out these new models. So I do want to point out that there's an entire mathematical canon here. I won't bore you with the mathematical details but basically it's trivial to view these problems in the signal processing framework. And in fact, we really only use two big ideas that you would get in an undergraduate course. You can capture complex models with well studied systems what are called LTI systems. And I can't help but highlight for those of you who come from an engineering background and remember your integrals there's actually an important detail in signal processing that we hypothesize a continuous object that we sample. And those powerful tools allowed us to reason about how to make these models simple and stable and efficient. And so that stability allowed us to train these models again. And as Sasha Rush who's a wonderful professor at Cornell and also affiliated with Huggingface this resulted in a little bit of an RNN Renaissance in the field. There are all kinds of models that have followed on this work which are really exciting and you're sure to see more. This has only happened over the last year or two but a bunch of researchers have gotten really excited and these models I believe because of those properties I outlined could be very helpful at the edge. One other place where they come up is in DNA. These are really long sequences and billions of base pairs. And we started working on this in a paper that will appear in this year's NUREX conference in just a week or so. And Eric showed that he could actually just predict the next token in a DNA sequence and all of a sudden we started learning from it. Something that I find super exciting. So what I hoped I shared a little bit today with you was that tiny AI has a potential to bring AI into our physical world and we're just at the outset. We're gonna need new architectures and models but there are so many people excited about AI I'm really enthusiastic that this is gonna be a big and exciting area. It requires innovation not just on the algorithms which is my specialty and what I talked about today but all the stuff that Remy knows better than anybody about the hardware, the software and putting it all together with folks like SD Micro. Thanks so much for your time today. I really hope I get a chance to see some of you in person sometime soon.