 OK, let's get started. I've got a countdown timer here, so I want to be on time. Welcome to day three of the Open Info Summit. I'm Toby Owen. I lead product development, product management for Fungible. And we're going to talk today a little more in-depth about data processing units and composable infrastructure. And I'll try to leave some time for questions and answers at the end of the presentation. It's a little hard to see people, but there are mics on both sides. If you've got questions, when we're done with the presentation, I'd love for this to be interactive at the end. So here we go. Yesterday, oh, let's go through. So I'm going to make the case again a little bit without repeating yesterday's talk about why we need a DPU. I'm going to go a bit more in-depth about what a DPU is and how it works, and then we're going to cover in a little more detail how at Fungible, we've implemented the DPU technology in terms of systems that help with composable infrastructure. We made the point yesterday that data is growing. This is one of the big drivers, obviously, for a need to change the way we think about data center architectures. 181 zettabytes in the next three years is what the expectation is going to be. That's a lot of data. We've also seen some major shifts in trends overall. We've seen Moore's law start to slow down. Some people say it's plateaued, but we've gone from a doubling of CPU processing power every 18 months to every three years. And if that, in light of this exploding data, we're starting to see the need or an increasing need for acceleration at the hardware layer. Again, the continued rise of containers were on pace in the next four or five years to see installed containers outpacing virtual machines. Virtual machines are still growing, certainly, but not at the same rate. And the reality is that as we move towards more and more flexibility and agility in terms of how we manage and deploy applications, we really need the infrastructure to respond in the same way with the same flexibility, the same agility. And we haven't really seen that in the market. And so Fungible was founded in 2015 with the idea of taking the economics or the utilization, the efficiency at the data center scale and really driving an order of magnitude improvement in that so that infrastructure can keep up with the rising demands of data and the applications that process that data. We designed what we call the data processing unit, or DPU, specifically for managing that data. And so this is an accelerator that's 100% targeted on data path operations, or IO operations. And really, I think what we're going to see is that the DPU becomes a regular component in modern data center architectures. A little look at history here, we've seen over the last 50 years really a convergence into a single-purpose, general-purpose processor, the CPU, or x86. Everything's converged into that. Over the last two decades, we've really started to see the emergence of purpose-built processors. GPUs are a great example where the CPU is really great at handling a variety of tasks. It's built specifically for that. The GPU handles specific operations much, much better. And we've also seen the rise of FPGAs and other purpose-built accelerators. Because the CPU, while it's great for many things, can be better in specific use cases. And so as this trend, again, starts to diverge into a variety of processors, we believe that data is the key driver of all this growth. And so why not have an accelerator that's targeted for that? We talk about data management. And at Fungible, we like to use the term data-centric computations. Let's look at some trends. In 1987, the CPU was 20 times faster than the network. So it made sense to put the I-O drivers in the software on the x86. If we fast forward to two years ago, the CPU now is 30 times slower than the network. I guess I jumped the gun. OK. However, we're still seeing the I-O drivers in x86 software. Over this time, there's been an aggregate of 600 times change in that ratio of network to CPU performance. Obviously, that creates a challenge for the x86 to be able to manage all this effectively. There's a number of issues here. I'm not going to read everything on the slide for you. But the net is that in a lot of use cases, highly demanding use cases, the x86 just can't keep up with the I-O demands anymore. And so for managing these data-centric workloads, we need a better answer. So it's worth explaining a little bit our definition of what is a data-centric workload. Work arrives as packets. And I'm reading from the bottom here. There's a lot of multiplexing between multiple contexts in the workload. We see a medium to high ratio of I-O to compute. So this is definitely data that's moving around. And the computation is stateful. And that's sort of how we define data-centric. And you look on the spectrum here, the farther we get to the right on the chart, the more data-centric that workload is. So at the network layer, everything's packetized. There's lots of multiplexing. There's a ton of I-O. And as you move to the left, there tends to be less data management requirement. So the idea of having a data processing unit to accelerate these data management functions is what we came up with. At scale, you think about a DPU as an endpoint in the network. However, different types of resources have very different data paths. The data path you might design to interact with the CPU or a server is going to be very different than one you do with storage. And so we built the chip to be flexible so that we can program different data paths that are specific for these types of resources. Obviously, each one of these buckets of resources is a huge market in and of itself. We've seen GPUs reach a $10 billion market size. Flash storage has been around a long time. That's a huge market. CPUs, obviously, are huge. So each one of these is going to interact differently with the network and how it manages data. And so having these fronted by a DPU allows us to improve the efficiency. And we'll spend some time talking about how that happens. So at Fungible, we started with the DPU. We've spent several years in R&D on the chip itself. We've actually got two chips. There's one that fits on the PCIe card, which you see on the left. And that's used as an attachment into the host server. We've built a system around scale out block storage, which we call the Fungible Storage Cluster, which we'll talk about a little bit more in a minute. And that uses the same architecture chip, but a larger chip. And that's an 800 gig chip. And then most recently, about two months ago, we launched a solution called Fungible GPU Connect. And what that allows is the pooling of GPUs into a chassis, and then we're remotely mounting those over ethernet into the host. So this is really kind of our whole technology stack that we have at Fungible. Again, based on the DPU and the accelerators built into that chip, we have our operating system on top, which is where the data paths are programmed into. And you see some examples of those data paths and then the systems. On the top, we've separated the control plane for all these systems. So there's nothing. Datapath is independent. And the datapath can scale in terms of infrastructure and throughput independently of the control plane. So this is really built for the data center scale. So let's dive a little bit deeper into some of these systems and how they work. We'll start with the Fungible Storage Cluster. This is a product which we're on our fourth major software release now. And what this is is a scale out block storage all flash system. In a single node, starting with a single node, we can see up to 13 million IOPS out of a single node. And that's read performance, very, very low latency, and very good throughput. As that cluster grows, that performance also increases linearly. Because we've implemented a lot of these datapath accelerators in the silicon, in the DPU, we're able to provide all the enterprise storage features you'd expect from compression, encryption, erasure coding, clones, thin provisioning. But we can also do that on a per volume basis. So this is really designed to run multiple workloads in a cloud environment that's important. You want to avoid the noisy neighbor, so we've got quality of service. But something unique about the way we approach storage is each and every one of these volumes, you can pick and choose which attributes you want to apply. So if you're running a database and you want it protected from two faults, you can do erasure coding for that. You've got a different workload that you need to encrypt. You can encrypt that volume. And each and every volume, you can tune these parameters individually. Here's a look at what the hardware looks like under the hood. And you see, you kind of divide it horizontally. We've got two DPUs in the box. That's the only processor in the box. And so we're actually directly terminating IP on the left. We've got 600 gigabit ports per DPU in this chassis. And then each one of those DPUs controls 12 NVMe drives directly as well. Again, this is really designed to be a multi-workload storage solution. And so whether you're running instances off of bare metal to power your databases, or whether those are SQL or no SQL, those workloads can run directly next to virtualization workloads. We have support via sender driver for OpenStack. We've got support for Kubernetes and a container storage interface. And all of this was designed to be API-first. So every function and capability of the system is available through the API. We also have a GUI to interact with the system as well. Let's talk a little bit more about erasure coding and some of the benefits that you can get from that. And obviously, we didn't invent erasure coding, but historically, erasure coding has come at a pretty high price in terms of the computation required. It's complex math to figure out the erasure coding algorithm. We've built a specific accelerator for that erasure coding. And so we actually do erasure coding in the storage cluster on a node basis. So if you start in this example, you start with six nodes in your cluster. We can apply 4 comma 2 or 4 plus 2 erasure coding. That means you have four data blocks for every two parity blocks. What that does for you is it gives you a 50% overhead to protect against two failures in the storage system. If you compare that to a typical double replicated or RF2 volume, the requirement there is 100% overhead. So we're saving data, making it overall an aggregate. You've got more usable data. Now as your cluster grows, let's say you add four more nodes to this cluster. You've got 10 nodes in the cluster. Now we can apply 8 comma 2 erasure coding. That further reduces the overhead for durability. You're only now using 25% of that storage for replication. And then you grow the cluster again to 16 nodes, 14 comma 2 erasure coding. Now we've shrunk that overhead to 14%. So you can see as you start to scale out your storage cluster, effectively your cost per usable terabyte goes down. Just a quick note about our involvement in the community. We're a relatively new member, which we're proud to be a silver member of the Open Infra Foundation. We do have a Cinder driver for this product. We support the Victoria and the Xena distributions. And we're working to upstream our driver right now. Hopefully we'll make the target for Zed. We also have a CSI plug-in. And of course everything that I mentioned is also available through our RESTful API as well. One other option when we deploy our fungible storage cluster is to use our DPU and a PCI card. We call this, it's a hardware storage initiator. Now this isn't a requirement for using our storage. Any modern Linux has an NVMe TCP driver already natively. But there are some advantages to using a DPU in the host as well. And the biggest advantage is that we're able to offload the NVMe TCP storage processing entirely off of the host. And then we present what looks like a local NVMe device to that server. There's a big advantage in terms of offloading those CPU cores. It allows that host to do the application workloads it was designed to do. Another advantage in that we have DPUs now on both ends of that initiator target connection is that we can apply additional capabilities to that connection. We can do encryption over the wire as well as encryption on the storage. We can compress that traffic. We can even mount boot volumes and boot directly from the storage cluster. This really helps enable that bigger data center composability vision of a server really just needs to be a combination of CPU and memory. And everything else that it might need from a resource perspective can be composed into that server. In this example, the storage, even local disk, we're seeing performances that are equivalent to local NVMe even though it's accessed over ethernet. And I gave this example yesterday as well, showing the capabilities of just the fungible storage cluster, but also the value of having the DPU and the host. We participated with a couple of research institutions over the last year in some testing. The first test was with Nikov, which is a research institution here in Europe. They were able to, from a single host, get about 6 and 1 half million IOPS attaching to a single storage node, a fungible storage node. San Diego Supercomputer repeated that test, but they added our hardware initiator. And they were able to get 10 million IOPS. But I think the most telling piece of this was the fact that they were able to reduce the CPU utilization by almost 75%. In the Nikov example, while they were able to hit that pretty impressive benchmark of 6 and 1 half million IOPS, the CPU was 100% used. So in a real world scenario, that's not very useful for running application traffic if the whole computer is consumed with processing storage traffic. In San Diego's use case, obviously you've got a lot of CPU left to do real work. This might be a little outside of the normal needs of you're not typically going to see 10 million IOPS from a single box or need that kind of performance, but I think it does illustrate the value of offloading a lot of this traffic into a GPU. All right, so let's talk about our next offering, which is the fungible GPU Connect. Now, we just launched this in April, and there's a number of reasons why you might want to consolidate GPUs into a pool and be able to attach those to hosts only when you need them. One, GPUs are an expensive resource. Two, they tend to be fairly underutilized in a data center environment. You might need it for machine learning training, but when that training job is done, those are going to sit idle. And so the opportunity to drive better utilization of that GPU means you're saving money. Another reason is that if you look at a machine learning workflow, for example, you may need a very different CPU to GPU ratio during training than you would during inference. And so the ability to attach the right kind of GPU to the right host system during training, detach those, and then attach maybe a different GPU for inference can add a lot of value. And this particularly is valuable in environments that aren't just running one ML workload, but maybe they've gone through their proof of concept. Everybody's bought into machine learning, and all of a sudden there's 10 different business units all doing machine learning at the same time. Managing the infrastructure for that can be pretty daunting, not to mention expensive. So this is what the solution looks like from Fungible. We have a chassis on the right, which we call the FX-108. It's basically a bunch of PCI slots in a 4U box, and that box comes with our DPUs. And then the user of this would install their own GPU cards. We connect upstream to standard 100 gigabit ethernet switches, so no configuration required, nothing special at the switching layer. And then in the host server, we install a GPU in the form of our PCI card, which we call the FC200. That's a 200 gig card. And then we've got our Composer, our Control Plane, that we can run in a VM. There we go. So let's give a little example of what this might look like in an environment. We've got several servers on the left here. Each have different requirements for GPUs. And then we've got a GPU chassis that's populated with eight GPUs. So we start the first AIML workload, and it needs four GPUs. So we can attach those to the hosts. The workload runs. Now we've got a virtualization workload that also requires GPUs. It needs two GPUs. Let's attach those to that host. Now our first machine learning workload, no longer needs four GPUs. It only needs three. Maybe it's shifted from training to inference. And so it detaches one of those GPUs. It becomes available in the pool. And now our second machine learning workload spins up. It's got three GPUs available. So you start to see how, over time, you can drive utilization of those eight GPUs across multiple workloads, even as the demands change. Just a couple of quick pictures of our user interface for this. It's a pretty simple workflow. We've got a view of the hosts. We've got a view of the GPUs that are available. And we catalog those by type of GPU, what it's attached to, all the basic metrics you'd expect. And then it's simply attach and detach from the available hosts. So on this screen, you can see a selection box for which GPUs you want to attach. Clicker doesn't work very well. And here's a view of within the GPU chassis where it's populated which card. So this can help from an operational perspective if you want to go add a new GPU card or see what's available, what's an inventory. This is an available product today. We're particularly interested in the service provider community, because we think that there's a big opportunity to add GPUs into the hosts from either a bare metal or a virtualization perspective. We're seeing a big increase in workloads that leverage GPUs. And so we're excited about this offering. We do have, I'll probably repeat a little bit from yesterday, but just some numbers around what we've seen in testing. The idea of remotely attaching GPUs, I think, has been tested in the past and a typical question we get is, what about performance? How do you handle this over ethernet? Obviously, PCIe is pretty latency sensitive. And so we ran benchmarks as we were developing this product. And we used a variety of benchmarks from the machine learning world, in this case ResNet. And this is an example where we compare the performance of a server with a GPU directly installed in it to one using our solution. And across a number of batch sizes in ResNet, we see essentially equivalent performance, whether it's directly attached or attached over ethernet. A second test, second algorithm, this time with two GPUs. Again, we're seeing you can start to see in the bar graphs a little bit of difference in performance, but it's really 1% to 2%. Good to argue that's within the margin of error. But certainly, there's a little bit of expense in traversing ethernet, but 1% or 2% trade-off in performance versus the ability to drive utilization can have a lot of value. And really, to wrap up, I think that data and the applications that are driving the growth of data and consuming that data more and more in a real-time sense are driving the demands of infrastructure. Infrastructure clearly needs to respond. GPUs are going to be a major part of modern servers and data center infrastructure, particularly at scale. And leveraging systems that embed GPUs makes this technology a lot easier to consume. Our approach at Fungible is to make sure everything is accessible over ethernet. We believe a converged ethernet network is where the industry is going. And hopefully, that'll make these technologies easier to consume. And I think that is the end of my slides. So we've got about five minutes left for Q&A. I'm told we can go a couple of minutes over if we need to. But if people have questions, I'm happy to answer. We do have microphones so everyone can hear you. Don't see a lot of questions. Here's one. Thank you. Getting us started. OK, maybe one question. Are you doing maybe a GPU over ethernet? I haven't heard about this. Just curious. I'm sorry, what kind of internet? GPO over ethernet. Because storage is common, right? But GPO, I haven't heard. Maybe you are doing it. I don't know. GPO? GPO? Aren't you? Do you have some solutions for sharing GPUs over ethernet? Oh, sharing the GPU over ethernet? No, GPU, right? Graphical processing units. The GPU. Yes. Yeah, we are sharing the GPU over ethernet. What we've done essentially is virtualize the PCIe switch. And then moving that over ethernet. Oh, great. Yeah, and so from the host perspective, it sees a local PCIe switch and a locally attached GPU. Great, thank you. Yeah. So if you speak of GPU, are you only working with NVIDIA? You were speaking of A10 Ampere. So is it only NVIDIA? Are you supporting over GPUs? Great question. So today in the product in our first release, we've qualified the Ampere line of NVIDIA GPUs. So from A2 to A100, we can support really any PCIe device in the box. But we don't want to claim that that's fully tested yet. So we are testing additional vendors of GPUs and additional generations of NVIDIA GPUs as well. And our expectation is that that qualified matrix is going to grow over time. And so it's a qualified, it's supported by NVIDIA? Because they have a list of certified hardware, is it certified by NVIDIA or not yet? Is it certified by NVIDIA? No, not yet. So actually the chassis, we don't manufacture the chassis for that GPU. So it ships with the chassis supports NVIDIA, and there is a certification at the hardware layer. And so I guess in that answer, the NVIDIA GPUs are certified by NVIDIA to work with that hardware platform. But we are working through that certification process as well. Thank you. So we have MIG support for GPUs that you are passing, because basically you also might want to leverage that. And then you also have, like, somehow license these GPUs? Yeah. Any solution for that? So if I heard you right, you were asking about MIG support. Yes. Yes. So in the product, we haven't certified MIG support yet. We have tested it, and it does work. But again, there's additional work to make sure we're certified for use in that. Now, in this solution, it's worth noting that we don't provide the GPUs. We can as an option, but typically most customers will have their own GPUs that they want to use. And so the licensing for the GPU, the customers. About probably you provide scheduling of GPUs, as you shown. So if you use MIGs, then you can't revoke GPU kind of, I guess. So this should be controlled on software or something like that, right? Yeah. And you do this. Yeah. So right now, in our user interface, the unit of GPU is a single GPU, right? And so the idea of either a fractional GPU or the other version of that question is NVLink. And we support GPUs that are connected directly. We have tested that in the labs. But there's additional work to make sure that all the edge cases are tested and things like that. As a question, you showed the user interface, is there any CLI or API or something you can use for automation, basically? We do have an API. Again, everything in the UI is available in the API for this product. Yeah. Thank you. Good questions. Anyone else? All right. Well, thanks for your time. We're right at time now. So thanks for keeping me on time. I appreciate you coming. And please enjoy the rest of the day and the rest of the conference. Thank you.