 From around the globe, it's theCUBE, presenting, Cube on Cloud, brought to you by SiliconANGLE. As I've said many times on theCUBE, for years, decades even, we've marched to the cadence of Moore's Law, relying on the doubling of performance every 18 months or so, but no longer is this the main spring of innovation for technology. Rather, it's the combination of data, applying machine intelligence and the cloud supported by the relentless reduction of the cost of compute and storage and the buildout of a massively distributed computer network. Very importantly, in the last several years, alternative processors have emerged to support offloading work and performing specific tasks. GPUs are the most widely known example of this trend with the ascendancy of NVIDIA for certain applications like gaming and crypto mining and more recently machine learning. But in the middle of the last decade, we saw the early development focused on the DPU, the data processing unit, which is projected to make a huge impact on data centers in the coming years as we move into the next era of cloud. And with me is Pradeep Sindhu, who's the co-founder and CEO of Fungible, a company specializing in the design and development of DPUs. Pradeep, welcome to theCUBE. Great to see you. Thank you, Dave, and thank you for having me. You're very welcome. Okay, my first question is, don't CPUs and GPUs process data already? Why do we need a DPU? That is a natural question to ask. And CPUs have been around in one form or another for almost 55, maybe 60 years. And this is when general purpose computing was invented. And essentially all CPUs went to x86, the x86 architecture, by and large. ARM, of course, is used very heavily in mobile computing, but x86 is primarily used in data center, which is our focus. Now, you can understand that that architecture of general purpose CPUs has been refined heavily by some of the smartest people on the planet. And for the longest time improvements, you referred to Moore's law, which is really the improvements of the price performance of silicon over time, that combined with architectural improvements was the thing that was pushing us forward. Well, what has happened is that the architectural refinements are more or less done. You're not going to get very much, you're not going to squeeze more blood out of that storm from the general purpose computer architecture. What has also happened over the last decade is that Moore's law, which is essentially the doubling of the number of transistors on a chip has slowed down considerably. And to the point where you're only getting maybe 10, 20% improvements every generation in speed of the transistor, if that. And what's happening also is that the spacing between successive generations of technology is actually increasing from two to a half years to now three, maybe even four years. And this is because we are reaching some physical limits in CMOS. These limits are well-recognized. And we have to understand that these limits apply not just to general purpose CPUs, but they also apply to GPUs. Now, general purpose CPUs do one kind of computation. They're really general. And they can do lots and lots of different things. It is actually a very, very powerful engine. And then the problem is it's not powerful enough to handle all computations. So this is why you ended up having a different kind of processor called the GPU, which specializes in executing vector floating point arithmetic operations much, much better than CPUs, maybe 20, 30, 40 times better. Well, GPUs have now been around for probably 15, 20 years, mostly addressing graphics computations. But recently in the last decade or so, they have been used heavily for AI and analytics computations. So now the question is, well, why do you need another specialized engine called the DPU? Well, I started down this journey about almost eight years ago, and I recognize I was still at Juniper Networks, which is another company that I founded. I recognize that in the data center, as the workload changes to addressing more and more larger and larger corpuses of data, number one, and as people use scale out as these standard techniques for building applications, what happens is that the amount of East-Prest traffic increases greatly, and what happens is that you now have a new type of workload which is coming. And today, probably 30% of the workload in a data center is what we call data centric. I want to give you some examples of what is a data centric workload. Well, I wonder if I could interrupt you for a second, because I want you to, I want those examples, and I want you to tie it into the cloud, because that's kind of the topic that we're talking about today and how you see that evolving. I mean, it's a key question that we're trying to answer in this program. Of course, early cloud was about infrastructure, a little compute, a little storage, a little networking, and now we have, to get to your point, all this data in the cloud. And we're seeing, by the way, the definition of cloud expand into this distributed, I think the term you use is disaggregated network of computers. So you're a technology visionary, and I wonder how you see that evolving, and then please work in your examples of that critical workload, that data centric workload. Absolutely, happy to do that. So, you know, if you look at the architecture of a cloud data centers, the single most important invention was scale out. Scale out of identical or near identical servers all connected to a standard IP Ethernet network. That's the architecture. Now, the building blocks of this architecture is Ethernet switches, which make up the network, IP Ethernet switches, and then the servers are all built using general purpose x86 CPUs with DRAM, with SSD, with hard drives, all connected to inside the CPU. Now, the fact that you scale these server nodes as they're called out, was very, very important in addressing the problem of how do you build very large scale infrastructure using general purpose compute. But this architecture, Dave, is a compute centric architecture. And the reason it's a compute centric architecture is if you open this, a server node, what you see is a connection to the network, typically with a simple network interface card. And then you have CPUs, which are in the middle of the action. Not only are the CPUs processing the application workload, but they're processing all of the IO workload, what we call data centric workload. And so when you connect SSDs and hard drives and GPUs and everything to the CPU, as well as to the network, you can now imagine that the CPU is doing two functions. It's running the applications, but it's also playing traffic cop for the IO. So every IO has to go through the CPU, and you're executing instructions typically in the operating system, and you're interrupting the CPU many, many millions of times a second. Now, general purpose CPUs and the architecture of these CPUs was never designed to play traffic cop, because the traffic cop function is a function that requires you to be interrupted very, very frequently. So it's critical that in this new architecture, where there's a lot of data, a lot of East-West traffic, the percentage of workload which is data centric has gone from maybe one to 2% to 30 to 40%. I'll give you some numbers which are absolutely stunning. If you go back to say 1987, and which is the year in which I bought my first personal computer, the network was some 30 times slower than the CPU. The CPU is running at 50 megahertz. The network was running at three megabits per second. Well, today the network runs at 100 gigabits per second, and the CPU clock speed of a single core is about three to 2.3 gigahertz. So you've seen that there's a 600X change in the ratio of IO to compute, just the raw clock speed. Now you can tell me that, hey, typical CPUs have lots and lots of cores, but even when you factor that in, there's been close to two orders of magnitude change in the amount of IO to compute. There's no way to address that without changing the architecture. And this is where the DPU comes in. And the DPU actually solves two fundamental problems in cloud data centers. And these are fundamental. There's no escaping it, no amount of clever marketing is going to get around these problems. Problem number one is that in a compute-centric cloud architecture, the interactions between server nodes are very inefficient. That's number one, problem number one. Problem number two is that these data-centric computations, and I'll give you those four examples, the network stack, the storage stack, the virtualization stack, and the security stack. Those four examples are executed very inefficiently by CPUs. Needless to say that if you try to execute these on GPUs, you'll run into the same problem, probably even worse. Because GPUs are not good at executing these data-centric computations. So what we were looking to do at Fungible is to solve these two basic problems. And you don't solve them by just using, taking older architectures off the shelf and applying them to these problems. Because this is what people have been doing for the last 40 years. So what we did was we created this new microprocessor that we call a DPU from Ground up. It's a clean sheet design. And it solves those two problems fundamentally. So I want to get into that. And I just want to stop you for a second and just ask you a basic question, which is if I understand it correctly, if I just took the traditional scale out, if I scale out compute and storage, you're saying I'm going to hit a diminishing returns. Not only is it not going to scale linearly, I'm going to get inefficiencies. And that's really the problem that you're solving. Is that correct? That is correct. The workloads that we have today are very data heavy. You take AI, for example, you take analytics, for example, it's well known that for AI training, the larger the corpus of data, relevant data that you're training on, the better the result. So you can imagine where this is going to go. Especially when people have figured out a formula that, hey, the more data I collect, I can use those insights to make money. Yeah, this is why I wanted to talk to you because the last 10 years that we've been collecting all this data. Now I want to bring in some other data that you actually shared with me beforehand. Some market trends that you guys cited in your research. And the first thing people said is, they want to improve their infrastructure and they want to do that by moving to the cloud. And they also, there was a security angle there as well. That's a whole nother topic we could discuss. The other stat that jumped out of me is 80% of the customers that you surveyed said they'll be augmenting their x86 CPUs with alternative processing technology. So that's sort of, I know it's self-serving, but it's right on the conversation we're having. So I want to understand the architecture and how you've approached this. You've clearly laid out that x86 is not going to solve this problem and even GPUs are not going to solve this problem. So help us understand the architecture and how you do solve this problem. I'll be very happy to. Remember, I use this term traffic cop. And I use this term very specifically because first let me define what I mean by a data-centric computation because that's the essence of the problem we're solving. Remember I said two problems. One is we execute data-centric workloads at least in order of magnitude more efficiently than CPUs or GPUs, probably 30 times more efficiently. And the second thing is that we allow nodes to interact with each other over the network much, much more efficiently. So let's keep those two things in mind. So first, let's look at the data-centric piece. The data-centric piece for a workload to qualify as being data-centric, four things have to be true. First of all, it needs to come over the network in the form of packets. Well, this is all workloads, so I'm not saying anything new. Secondly, this workload is heavily multiplexed in that there are many, many, many computations that are happening concurrently, thousands of them, yeah? That's number two, so a lot of multiplexing. Number three is that this workload is stateful. In other words, you can't process packets out of order. You have to do them in order because you're terminating network sessions. And the last one is that when you look at the actual computation, the ratio of IO to arithmetic is medium to high. When you put all four of them together, you actually have a data-centric workload, right? And this workload is terrible for general purpose CPUs. Not only do general purpose CPUs not executed properly, the application that is running on the CPU also suffers because data-centric workloads are interfering workloads. So unless you design specifically to them, you're going to be in trouble. So what did we do? Well, what we did was our architecture consists of very, very heavily multi-threaded general purpose CPUs combined with very heavily-threaded specific accelerators. I'll give you examples of some of those accelerators, DMA accelerators, then erasure coding accelerators, compression accelerators, crypto accelerators, compression accelerators. These are just something, and then lookup accelerators. These are functions that if you do not specialize, you're not going to execute them efficiently. But you cannot just put accelerators in there. These accelerators have to be multi-threaded to handle. We have something like a thousand different threads inside our DPU to address these many, many, many computations that are happening concurrently, but handle them efficiently. Now, the thing that is very important to understand is that given the paucity of transistors, I know that we have hundreds of billions of transistors on a chip, but the problem is that those transistors are used very inefficiently today if the architecture of a CPU or a GPU. What we have done is we've improved the efficiency of those transistors by 30 times. So you can use the real estate, you can use the real estate much more effectively. Much more effectively. Because we were not trying to solve a general purpose computing problem. Because if you do that, we're going to end up in the same bucket where general purpose CPUs are today. We were trying to solve the specific problem of data-centric computations and of improving the node-to-node efficiency. So let me go to point number two, because that's equally important. Because in a scale of architecture, the whole idea is that I have many, many nodes and they're connected over a high performance network. It might be shocking for your listeners to hear that these networks today run at a utilization of no more than 20 to 25%. Question is why? Well, the reason is that if I try to run them faster than that, you start to get packet drops because there are some fundamental problems caused by congestion on the network, which are unsolved as we speak today. There are only one solution which is to use TCP. Well, TCP is a well-known, it's part of the TCP IP suite. TCP was never designed to handle the latencies and speeds inside data centers. It's a wonderful protocol, but it was invented 42, 43 years ago now. Yeah, very reliable and tested and proven. Got a good track record, but you're right. Very good track record. Unfortunately, it's a lot of CPU cycles. So if you take the idea behind TCP and you say, okay, what's the essence of TCP? How would you apply it to the data center? That's what we've done with what we call FCP, which is a fabric control protocol which we intend to open. We intend to publish the standards and make it open. And when you do that and you embed FCP in hardware on top of a standard IP Ethernet network, you end up with the ability to run at very large scale networks where the utilization of the network is 90 to 95%, not 20 to 25%. And you end up with solving problems of congestion at the same time. Now, why is this important today? It's all geek speak so far. But the reason this stuff is important is that it such a network allows you to disaggregate, pool and then virtualize the most important and expensive resources in the data center. What are those? It's compute on one side, storage on the other side and increasingly even things like DRAM wants to be disaggregated and pooled. Well, if I put everything inside a general purpose server, the problem is that those resources get stranded because they're stuck behind a CP. Well, once you disaggregate those resources and we're saying hyper disaggregate, meaning the hyper and the hyper disaggregate simply means that you can disaggregate almost all the resources. And then you're going to re-aggregate them, right? I mean, that's obviously- And the network is the key in helping. So the reason the company is called Fungible is because we are able to disaggregate, virtualize and then pool those resources. And you can get for, so scale out companies, the large AWS, Google, et cetera. They have been doing disaggregation and pooling for some time. But because they've been using a compute-centric architecture, their disaggregation is not nearly as efficient as we can make. And they're off by about a factor of three. When you look at enterprise companies, they are off by another factor of four because the utilization of enterprise is typically around 8% of overall infrastructure. The utilization in the cloud for AWS and GCP and Microsoft is closer to 35 to 40%. So there is a factor of almost four to eight, which you can gain by disaggregating and pooling. Okay, so I want to interrupt you again. So these hyperscalers are smart. They're a lot of engineers and we've seen them, you're right, they're using a lot of general purpose, but we've seen them make moves toward GPUs and embrace things like ARM. So I know you can't name names, but you would think that this is, with all the data that's in the cloud, again, our topic today, you would think the hyperscalers are all over this. Well, the hyperscalers recognize that the problems that we have articulated are important ones. And they're trying to solve them with the resources that they have and all the clever people that they have. So these are recognized problems. However, please note that each of these hyperscalers has their own legacy now. They've been around for 10, 15 years. And so they're not in a position to all of a sudden turn on a dime. This is what happens to all companies at some point. Oh, they have technical debt, you mean. Yeah, yeah, yeah. They have, I'm not going to say they have technical debt, but they have a certain way of doing things and they are in love with a compute-centric way of doing things. And eventually it will be understood that you need a third element, call the DPU to address these problems. Now, of course, you heard the term smart mic and all your listeners must have heard that term. Well, a smart mic is not a DPU. What a smart mic is is simply taking general purpose arm cores, putting a network interface and a PCI interface and integrating them all in the same chip and separating them from the CPU. So this does solve the problem. It solves the problem of the data-centric workload interfering with the application workforce. Good job. But it does not address the architectural problem of how to execute data-centric workloads efficiently. Yeah, so it reminds me of, I understand what you're saying. I was going to ask you about smart nicks. It does, it's almost like a bridge or a band-aid. It always reminds me of throwing, you know, a flash storage on a disk system that was designed for spinning disk. Gave you something, but it doesn't solve the fundamental problem. I don't know if it's a valid analogy, but we've seen this computing for a long time. Yeah, this analogy is close because, you know, okay, so let's take hyperscaler X, okay? One name names. You find that, you know, half my CPUs are criddling their thumbs because they're executing this data-centric workload. Well, what are you going to do? All your code is written in C, C++ on x86. Well, the easiest thing to do is to separate the cores that run this workload, put it on a different, and let's say we use ARM, simply because, you know, x86 licenses are not available to people to build their own CPUs. So ARM was available. So they put a bunch of ARM cores, they stick a PCI Express and a network interface, and you port that code from x86 to ARM. Not difficult to do, but, and it does yield you results. And by the way, if for example, this hyperscaler X, shall we call them, if they're able to remove 20% of the workload from general-purpose CPUs, that's worth billions of dollars. So of course you're going to do that. It requires relatively little innovation other than to port code from one place to another place. But that's what I'm saying. I mean, I would think again, the hyperscalers, why can't they just, you know, do some work and do some engineering and then give you a call and say, okay, we're going to attack these workloads together. You know, that's similar to how they brought in GPUs, and you're right, it's worth billions of dollars. You could see when the hyperscalers, Microsoft and Azure and AWS bolt announced, I think they depreciated servers now, instead of four years, it's five years, and it dropped like a billion dollars to their bottom line. But why not just work directly with you guys? I mean, that's the logical play. Some of them are working with us. So it's not to say that they're not working with us. So, you know, all of the hyperscalers, they recognize that the technology that we're building is a fundamental, that we have something really special. And moreover, it's fully programmable. So, you know, the whole trick is, you can actually build a lump of hardware that is fixed function. But the difficulty is that in the place where the DPU would sit, which is on the boundary of a server and the network. It's literally on that boundary. That place, the functionality needs to be programmable. And so the whole trick is, how do you come up with an architecture where the functionality is programmable, but it's also very high speed for this particular set of applications. So the analogy with GPUs is nearly perfect, because GPUs and particularly Nvidia implemented or they invented CUDA, which is the programming language for GPUs. And it made them easy to use, made it fully programmable without compromising performance. Well, this is what we are doing with DPUs. We've invented a new architecture. We've made them very easy to program. And there are these workloads, or not workloads, computations that I talked about, which is security, virtualization, storage, and then network. Those four are quintessential examples of data-centric workloads. And they're not going away. In fact, they're becoming more and more and more important over time. I'm very excited for you guys. I think, and really appreciate, Pradeep, we're going to have you back because I really want to get into some of the secret sauce. You talked about these accelerators, the racer coding, crypto accelerators. I want to understand that. I know there's NVME in here. There's a lot of hardware and software and intellectual property, but we're seeing this notion of programmable infrastructure extending now into this domain. This build-out of this, I like this term, disaggregated, massive disaggregated network. Hyper disaggregated. And so hyper disaggregated, even better. And I would say this, and then I got to go. But what got us here the last decade is not the same as what's going to take us through the next decade. Pradeep, thanks so much for coming on theCUBE. It's a great conversation. Thank you for having me. It's really a pleasure to speak with you and get the message of Fungible out there. Oh yeah, I promise we'll have you back. And keep it right there, everybody. We've got more great content coming your way on theCUBE, on Cloud. This is Dave Vellante. Stay right there.