 Good afternoon compute community and welcome back to Denver, Colorado. We're here at Supercomputing 2023. My name is Savannah Peterson, joined here by the ineffable John Furrier. John, what a fantastic time to be alive and computing. You know that the AI converges with semi-conductors and cloud computes creating a perfect storm of innovation. Our next guest is going to break down some really innovative technology. Jonathan Ross is here, the co-founder and CEO of GROC. He also is an inventor at Google on the tensor processing, we cover it in depth at Google Next and everyone knows it was a huge success, which he did on a spare time. Jonathan, great to have you on the queue. Thanks for coming on, really appreciate it. Thank you, it's great to be here. So first of all, congratulations on the TPU project that really unearthed the mission of GROC. Talk about the origination story of GROC. How did it start? And then what are you guys doing today in the market? Well, so GROC started because when I was at Google and created the TPU, we realized even back then that there wasn't going to be enough compute for everyone, for AI. And we're seeing that now today. And we wanted to make sure that everyone had access to the AI economy. So we started GROC in 2016 and we've invented what we call the LPU, language processing unit, and it's really fast. What was that moment you wanted to democratize it? Was there a ticket of time? What was the point you said, you know what? I got to do this. Yeah, so it was when we played the world champion and go at Google. So it was all deep mind software, but about 30 days before the competition, there was a test game that was played and at least it all won. And 30 days before that, we ported to the TPU and we won by a wide margin. We realized, oh, compute really matters. It's the difference between losing badly and winning badly. And we did not want some people to have this and some people not. Explain the LPU real quick because you guys trademarked that and there's been TPUs, data processing units, obviously, you know, TPU, tensor for tensor flow, and then you got CPU. What is the LPU? What specifically is it? What does it do? So the best way to think about it is all of these other architectures you're hearing about are really good at parallel compute. Some of them are really good at sequential compute. The thing is you can't produce the hundredth word until you've produced the 99th. Very much like a game of go or chess, language is the same thing. It's just a larger space. And so we're very good at sequential problems. I, that was a great analogy. I think, I'm excited to hear what you have to say about this. So you, you know, democratization of AI, hot topic for us this week at the show, big time. You talked about the haves and have nots. Speaking of, did Elon steal your name? So, so did not steal, but is trying to- Is he inspired by? I think he was heavily inspired by us. And yeah, so would you like me to- I mean, I think, well, I think it's great because, you know, imitation is the highest form of flattery. You were the grok first. Now we actually have a really exciting demo that shows off grok. This is grok versus grok. Could you go ahead and bring up this fun demo for us? So, famously when Elon Musk started Tesla, he gave Peter Thiel a ride. And right before slamming the accelerator down, he said, watch this. So Elon, watch this. This is your model versus the real grok. That is- Oh, wait, it keeps going. Oh, wait, there's more. That sounded like another famous person in tech right now. I love that, yeah. It is, I'm literally, I wish you could see my arms right now. The hair is standing up. I definitely have goosebumps. When you showed me just a second ago before we were live, first words out of my mouth, wow. And what I loved is you said, wow is our brand. We're literally getting our business cards changed to say wow on them and we're going to hand them over when people say wow. We've never done the demo and someone didn't say wow. Except there are sometimes people use expletives in other words. Yeah, there's some of that. For once I didn't, which is rare for me. What's the speed, where is it coming from? Is it because you guys got the chip? Is it the language model? What specifically, what's the secret sauce? So it's purely the architecture and in particular the system. So rather than building a chip, we built a system. And we started with the software. So at the first six months at Grok, we worked on the compiler. We banned people from designing the chip. We got rid of all the whiteboards in the office because people kept trying to draw pictures of chips. After about six months we had a compiler working. But what we did was we built a factory. So normally when you buy chips, you'll have one chip, eight chips or something working on a problem. But that's not very efficient. You wouldn't build a small factory, you'd build a big one. So the demo that's running on, that's 576 chips. And each one of the chips does a very small part of it and then hands it off to the next chip. So we're actually 10 times faster, 10 times cheaper, 10 times lower power. And because we started with a compiler, we can often get software working 20 times faster, which is crazy. If I hadn't just shown you how fast it was, you shouldn't believe me. But that's real and we're ramping up our production. That's wild. I mean, that's an order of magnitude or more across a lot of different axes within a business. But there's a trade-off. Let's talk about it. The trade-off is it's a factory. Just like if you were building one car, you'd never build a factory. If you were going to compute something for one user or a small number of users, this makes no sense. If you're going to scale, this is the solution. So what types of customers are coming to ask you to help scale in this space? So a lot of finance, a lot of government, a lot of tech companies, but actually interestingly, a lot of startups. And so the website that I showed you, that's actually running on our API. And we sell tokens as a service, but we also sell the hardware, whatever people need. What does this mean for the GPU and what's the role of the CPU in this? Well, GPUs and CPUs are great. You should keep buying them. CPUs, we actually have some CPUs in that system. GPUs are what we're used to train that model. We're no better than a GPU at training. We're better at inference. And so you should continue buying GPUs and you should buy LPUs when you need inference performance. We were at the KubeCon conference last week and one of the Google engineers, Tim Hopkins, Hopkins said, inference is the new web app. What is the big deal about inference? A lot of people are now kind of looking at trainings, okay, it's expensive, GPUs do it great. What's the big deal about inference? Why is it so important for people to get inference right or even consider having that big part of the design? Well, training is a cost center. You make money at inference. And training also isn't that large. It scales with the number of ML researchers you have, which no one has enough ML researchers, but inference scales with the number of users. And this always happened. This is why we built the TPU at Google originally. They had trained a model. It was better than humans at speech to text and we're like, mission accomplished. Oh, we can't put it in production. We would need to double or triple our global data center footprint and that's just for speech recognition. If we want to do anything else, it's going to cost more. That's wild to visualize. Yeah, it's insane. All over Google when we're talking about this. And that was just for one service. That's insane. What's this do for the workflow and for people who are going to get democratized by the AI with the LPU inference engine? What do you envision that's going to be enabled for them to unlock value? So it's going to eliminate a lot of drudgery. You're not going to be doing large templated work. You're not going to be doing any of that. There's some fear is AI going to come and take my job. The reality is AI is not going to take your job. People who know how to use AI, if you don't, will take your job. Preach louder for the people in the back on that one, yeah. But the beautiful thing about this is it can teach you. I'm actually learning things about AI from these large language models. They're graded analogies and you can pick your favorite analogy. You could be like, hey, how do large language models work? Give me an analogy from sports or give me an analogy from reading or whatever you want. And it puts it in the way that you want to understand it. You can ask it questions. And I actually think that these large language models, they're going to be good for society for the following reason. It is difficult to be subtle and nuanced. It takes thought. It takes energy. These models can do that. And so when you're using these models and you're asking them questions, they're going to be like, no, no, no, no. It's not that simple. There's this side and there's this side and you already see them doing that. So I actually think people are going to be better at understanding the subtlety and nuance and I think we're going to be better at getting along. My gosh, first of all, that's an outstanding case for AI and I couldn't agree with you more. As long as we're building them thoughtfully, there's going to be empathy within that that will create new learning and new ways of even thinking about how to ask questions. Which I'm, yeah. You know what Grock stands for? Tell us. To understand deeply with intuition and empathy. There we go. Well, no one, I'm shocked Elon would be playing in that game. I want to follow up a little bit here on the chipset side. We're talking about silicon. For good reason, a few other AI vendors in the market making chips. What makes Grock different? Well, actually one of the interesting things, it's the architecture, right? Yes. It's clearly the architecture. And the same reason that GPUs are better at AI than CPUs, even though they're using the same silicon, we're better for inference than GPUs are. Not for training, just as good. Now, that is running on 14 nanometer silicon. That's ancient. That's fabbed in Malta, New York. That's in the US. We've had this around for a while. Yeah. Yeah. We assemble that in California and it's deployed in Washington state. So this is a fully US thing. Our next chip is going to be built in Taylor, Texas. Wow. What does that mean for you to be able to have production here and have control? On-prem essentially, right? It's easier when you're talking to customers. It's easier in terms of the supply chain. Just everything is easier. So the semiconductor shortage. That's exactly where I was thinking, John. It doesn't apply to us because we have a completely differentiated supply chain. So if you buy a GPU from any vendor, they all use this thing called HBM. It's a special type of memory. And there's a finite amount that's built. And then they put it on top of something called COOS or an interposer. There's a finite amount of that. We don't use any of that. So we have a completely differentiated supply chain. So we could actually scale to millions of chips in the next couple of years. Amazing. The conversation we have on theCUBE is around cloud computing, semiconductors, enablement. What's your vision on the kind of apps that are going to come down? Because you've got the inference that's going to happen at scale. And then small language models come out. There's going to be developers going to come in here. So democratization assumes some enablement. What do you envision apps looking like on top of this capability? Everyone is going to do the obvious, a human being does this, let's automate it. But I think this is like the hammer and we haven't yet invented the nail. The reason that you have a hammer in your house is because of nails. There are going to be a bunch of things that are possible that we haven't envisioned yet because we haven't imagined what you could do if you had a large language model available 24 seven, not just limited to speaking to one person or small group at a time but could speak with a large number. The applications haven't been envisioned yet. They're not even in science fiction yet. Yeah, and it's going to be great. I think the creativity is going to be amazing. Another question I want to ask you as the large language models have all the fanfare, we're seeing a power law of smaller specialty models come out, maybe smaller, maybe more cleaner or more acute about domain expertise. How are models going to interact? How do you see the inference taking advantage of alchemy between models? So we think that there's two different paths here that people are pursuing. We're neutral, we're good at both. One is build a bunch of small models that are specialized and the other is build a couple of very large models. Now here's the thing, counterintuitively, small models are often more expensive to run than the large ones. Very counterintuitive, right? And the reason is there's this thing called a batch. You actually have to process multiple users at the same time to be cost effective just like keeping a factory busy, right? If you have a very specialized model, it's hard to find enough users to keep it busy and we've seen that very often it's more expensive to run that smaller model than one big one. Actually makes a lot of sense. But the bigger ones are slow. We showed that model, the 70 billion, to some people who've made some of these bigger models and the first response that I got was I'm now going to go train a 300 billion parameter model. We got this from multiple customers and the first time it happened I'm like, but we do inference. He said, no, no, no, you don't understand. I can train a 300 billion parameter model. That's not the problem. I can't put it in production, it's too slow. Now I can. So we can do either because we're really good at small batch sizes. We do pipelining more than batching to get our performance. So we max out at batch size 20. That's where we get our best performance whereas GPUs it's more like 4,000. So we can do very small model. We don't care which way it goes but there are trade-offs either way. We're going to see both. So you get the performance and you get the enablement. That's right. On both sides. That's right. So who are your customers right now but you're buying from users? Cloud guys? So already deployed with labs and now we're doing a bunch of benchmarking and testing with the cloud guys, with the banks, with government and so on. But all the people who you would imagine would be at the front of this. And then the interesting one is all these startups who have some use cases that are not just what would a human do but can we have a large language model do it? Those are really interesting. What's a large language model mean to you size-wise? What's the kind of scope that out for us? Well we think of 70 is kind of small. We're mostly talking to people about 180 up now and 300 is a sweet spot and some people want to do a trillion parameter model on our hardware because it's the only way you can get the speed out of it. So more of the merrier there. Yeah and there's a little bit of confusion on this. People think the larger models are harder to train. In a lot of ways they're easier. They take more compute but they're easier. There's a great paper called the scaling laws and you can actually see that every token you put into a model, the larger the model the easier it is to absorb it. And so the larger the model the easier it is to train just takes longer, costs more. So you're going to see a lot of large models being thrown together by people. They're going to need more GPUs to train that but all the inference is going to be done on LPUs. My gosh that's exciting. Such a thrilling time. So we talked a little bit about the verticals that you're touching and I want to go back to batches just for a second. Can you give us some real world examples of what this means for folks? I'm thinking of say fraud protection within seconds versus a nightly batch ran by my bank or whatever might be an example. Well so nightly versus seconds that's the difference between fraud detection and fraud prevention. Exactly. So it changes the nature. These are not, we're not talking about making something a little bit faster. We're changing the nature of it. And so one of the very interesting use cases is tutoring human beings live. People do a lot of various tasks. You don't want to hand over the decision-making authority, no one trusts these models yet. They hallucinate, they make mistakes. That's the reality. But oh my gosh, do they do a lot of the most basic stuff for you? And I actually will often iterate with a model and I'll be like, hmm, tell me what is an orthonormal basis of emotion? That's a math concept and that's really weird. And then it'll come up with an answer. I'm like, oh okay, and I start playing with it. And so it triggers the creative process. You're playing with the left, right, right in there a little bit. Yeah, and I like that. It's sort of a creative aid. So you're going to start seeing people just throw stuff at it that they would have never done all the work to figure out themselves and be like, is this an interesting thing? Does it stick? Yes, and start working on it. It's scaling intellect. Yes. You can get in to bearish the entry to do something. Reminds me of the old cloud days. You go to Amazon, put your credit card down, you're up and running, but here, it's thought. Steve Jobs, computers are the bicycle for the mind. This is the motorcycle for the mind. Ooh. Yeah, that's a great sound bike. And I'm ready to go for a ride, baby, let's go. Elon might want to OEM the chip for his brain implant because that's something that people are really talking a lot about on Reddit these days. Well, congratulations on a great thing. I want to really follow in this inference because there's a lot of software startups out there that are doing quote, inference set up infrastructure. How do you see that? What's the competition look like there? Are they customers of yours? Are they more just peeps and shovels? They don't need to compete. Our architecture is unique, but we're willing to license it and help other people along. There's a lot of things that we're not doing, like automotive, five and six G mobile. And we're open, the software just works. We don't have kernels. So we have over 800 models compiling to our hardware and we can just change the size. So if you want to do a mobile configuration, fine. Why are you dealing with the software? That's the hard part. We figured that out. We want to build a chip work with us. Now what we're seeing is, I think a lot of people misunderstood inference. I don't think they were there and experiencing it and they think it's like this small thing. Sit here for a second and define it out because I do think there's a lot of quick, people are quick to jump over that instead of understanding what inference really means. Everyone realized that for training you needed a large compute cluster because otherwise it just takes too long. It takes years instead of months. But also now the models have gotten so large that individual chips or even individual servers can't solve the problems quickly enough. They can't run. And one of the things is that hasn't even been done yet. Okay, as good as these large language models are, I want you to ask something, some question when you go home and when you're playing with one. And I want you to imagine this. If a human being gave you that answer without having access to the backspacer delete key, how amazing would that answer be? These models are operating stream of consciousness. It is true. And they're that good. Now, if you're lower latency and faster, you can iterate and it's like giving them the backspace and the quality grows. So typically when you reflect- It's editing itself. Another great analogy. You're killing the analogy game today, Jonathan. Thank you. Well, I've been talking to the large language model and it's helped me quite a bit. I was bringing it full circle. I was TNU up for that slam dunk. We're in me out of business. Yeah, yeah. The cube. But here's the rule of thumb. If you iterate three times, it's like a generational model improvement. So if you're on GBD three and you iterate three times, you're on GBD four. You iterate three times, you're on GBD five. We don't know if that keeps going, but it's- That's cool. 29, 27, so it gets more expensive. So you do want the better model, but you can reach into the future and get a better model. Fascinating conversation. I got to ask you for the rest of us out here in the real world who are transforming from the old compute storage networking cloud to an infrastructure, a kind of a new infrastructure, AI system that's emerging, whether it's neural netters with LPUs. I can imagine maybe a whole transformation of infrastructure. How should companies, how should developers and entrepreneurs, the creative class now, configure their data? How should they be thinking about taking their current, which will soon be old and antiquated, to a modern architecture? Because data obviously is key, language is data. What's your vision and recommendation or view of how people here are talking about getting more performance, but data's involved. You've got to store it somewhere. Is there going to be a radical shift or is it evolutionary? What's the transformation journey of the practitioners out there, whether they're entrepreneurs or some enterprise IT or data center? I think a lot of people are looking at AIs. If it's big data, just bigger data. The reality is the model that I just showed you would fit on my phone. So it's not big data, it's big compute. That's running on 576 chips. It's about the size of eight refrigerators. It's the compute. So we're moving from a data gravity sort of situation to a compute gravity. And as an example of that, think of it this way. When training Lama 270 billion, the amount of GPUs that you'd need for that would be about three to five million dollars worth. But if I was to transfer all the data that I was trained on on my terrible cellular plan, that'd be about $40,000. So data's not the problem. There is no data gravity, it's compute gravity. So we're now starting to talk about working with people to build compute centers rather than data centers. The difference is you don't need a bunch of them all around. You can have a couple of them. The demo that I was showing you, that's 1,500 miles away. And it's just as fast. So this decoupling from these systems, from memory, starting to see this compute fabrics. Is that why fabrics are hot right now? I would say so, and I would say that in trying to grapple with this, you are now free. You can put your data wherever you want. It's not a big deal. You can move it around. It's two bytes of data, 180 billion compute operations and then two bytes of data out. So Michael Dell started his company in a dorm room. The mini mainframe was a huge machine. Smaller PC, form factor. Is it going to get smaller and smaller and smaller? That's where it's going, right? That's the next analogy which is smaller, faster, cheaper. I think it's going to get bigger and bigger. So the thing is. On the device side of the compute center. You can serve it on a cell phone just as fast as if it was actually faster because you get more compute. And since it's just a little bit of data, like text data, you can actually serve this from a couple of places in a continent and then service everyone on their mobile phones and so on. It's not a lot of data. So it's more economical, like a factory, right? To do that. Now I do think that you are going to do some of this on the phone for when you're disconnected. But when you want the high quality answer, it's got to go back to the compute center. Got it, got it, got it. We'll be playing at the edge, but there is going to be somewhere we're computing. All right, I got one last closing question for you. Everyone who watches theCUBE knows I'm a marketing background, sucker for swag. You actually brought some living marketing to the show. You brought a live llama. Yes. That is just milling around Denver, Colorado right now. Her name is Bunny. Bunny. She's incredibly soft, you can pet her. I love this. An excellent activation. So it makes sense for anyone who knows anything about AI, why you would bring a llama. It's the name of the model. Exactly. So it makes perfect sense there. Fortunately no one named one, you know, Tiger Shark or anything like that. Yeah, or Grock with the CK. It's something like that. I'm curious, what are you going to be bringing to super computing next year to highlight the trend that is in the pipeline right now being realized? Well, you're going to have to show up next year. You're not going to tease us with a fun prediction. Grock with a cube. You're not going to expect it. Oh. Maybe he's going to buy Twitter for the name. Wow, Jonathan, Jonathan with a cliffhanger. Thank you so much. Wonderful to have you on theCUBE for the first time. John fantastic to sit here geeking out and taking notes with you. I know we both got much out of that as much hopefully as you did our fantastic audience. Thank you so much for joining us here in the mile high city. We are at super computing 2023. My name is Savannah Peterson. You're watching theCUBE, the leading source for emerging technology news.