 Let's start, you know, a lot of the sessions about the Kubernetes and databases and things, I'll talk about things that are a little bit in the lower level, which you all use, and that is storage, and the method of how to use the storage. So that's, I guess, a very cool picture of me, when I'm not spending my time with databases or performance of storage, usually climbing mountains, been in plenty of startups, Grimplum, Scalio, been in Red Hat as well, and now I'm with a company called Lightbeats. So anyone here knows what is NVMe over TCP? Wow, one dude, beer on me tonight. So I guess, are you using NVMe over TCP? Ah, all right, that's way too much to ask. So I'm going to go through NVMe over TCP, why do we need it, and then some performance benchmarks I did with Postgres on Kubernetes, on OpenShift actually, in AWS, although this is NVMe over TCP, so what we're going to talk about, of course, is not just for the cloud. So 12 years ago, NVMe came to the world, if you want to kind of say that. It's basically a different method, a better method to access flash devices, PCIe SSDs. So when I know how much people remember or go deep down into servers, used to have SATA and SAS, NVMe came in, basically created a different queuing mechanism, different sets of command to interact with these flash-based PCI devices. Five years later, someone, well, many people, a few people came with the idea to actually try to access these fast devices, fast storage devices, NVMe SSDs over network. And that's where NVMe over fabric basically kind of came into life. And they were basically, throughout the first few years, two initials approaches to that. You could use a fiber channel, so NVMe over FC, or you could do an RDMA, which is a remote direct memory access, either in Finneband, ROC, which is called Rocky, and IRAP. So these are different technologies to access NVMe over fabric. So a client or a server can access storage remotely on a different server through some sort of a fabric. In 2019, there was another layer or another protocol and spec that was added, and that is NVMe over TCP. Also in 2019, surprisingly, I was actually not working in a Lightbeats back then, but a taller, younger, and much smarter version of another guy named Sagi wrote the spec and first kernel module for NVMe over TCP. And Lightbeats back then as a startup basically created the first storage solution that will use NVMe over TCP. You have NVMe TCP in the Linux kernel since version 4.10, which means in most of the enterprise Linux versions that you use, whether it's Rail or Ubuntu or Soze, it's already in. You don't need to worry about having any kind of special drivers installed. The same thing would be for any, if you're using any kind of a thin Linux operating system, Tuhan Kubernetes on top. Again, it's all going to be inside. Today, so yeah, I'm from Lightbeats. Lightbeats, of course, our software-defined solution uses NVMe over TCP. But throughout the last three years, we actually saw all the major storage vendors all starting to move into using NVMe over TCP mainly because of the performance advantage it can provide. It really depends whether that storage solution was initially written for NVMe over TCP or not, but it does bring a lot of advantage. So I'm not good. This is, you know, geek level number three in here. I'm not going to go too much into how everything is basically working in the stack, because that will take two hours. But like a lot of things, you have a physical layer, your fabric. Above that, your transport, which is TCP in all the cases that I spoke about, whether it's NVMe over TCP or others. And other layers are addressing the encapsulation of the packets, how things have been sent, queuing, and other things like that. Just this is kind of a recap in terms of the transport. If we're using PCIe Express for NVMe, which is meaning we have an SSD device connected directly to our server, we're going to address directly their memory. If we are using something like a remote DMA, we're basically going to address their memory and some sort of messaging mechanism to communicate because it's either local or remote. And then when we use something like NVMe over FC or NVMe over TCP, everything is done via messages because everything is basically remote from your client. Just to explain at the end of the day what your server or operating system sees is basically a regular NVMe device. The Linux operating system has, I guess, in a very simple way, no idea that the NVMe or the slash dev NVM something that it is using is actually a remote device and not a physical device that is connected into the server. So why we need NVMe over TCP or why it was actually started. So NVMe direct attach for PCIe SSD device is awesome. It's a very good way to communicate with flash-based devices, but it's limited because it's only locally to your server node instance, whatever you want to call it. You cannot reach it remotely. RDMA started the journey of approaching things over the network as well as NVMe over FC, but it's not that they are limited. It's just that you will need specialized network cards. You will need, in some cases, specialized switches. And so you are basically... It's not something that you can easily move from either one location to another, one data center to another, or even between cloud providers. There's a lot of hardware backend that needs to be done in order to run it. And that's where NVMe over TCP came in. Basically, the whole concept and idea behind this was you have TCP anywhere in the world, in any database, whether it's a cloud or your private cloud. And so why not use that infrastructure which is already there to transfer basically data or storage, handle storage to your clients or to your workloads. A few points about NVMe over TCP as I just mentioned. Anywhere you have TCP, you can use it. Also, I did mention there's other vendors besides the light bits that are handling NVMe over TCP storage today. If you are looking into how instances are built in the cloud, especially in the major big three, usually most of the instances have higher network bandwidth than actually a storage bandwidth that the cloud provider is allowing. And that is an advantage because you are paying for the instance anyhow, but with NVMe over TCP, you actually have the method to achieve a bigger or a higher bandwidth. All right, so let's talk about what I did in order to do some performance comparison. And actually, I was asked to talk here about maybe a week ago, so I had to do things very, very quickly. And so I'll actually start from the bottom. Sherlock is a performance benchmark tool that I wrote while I was still in Red Hat. It's a very simple approach to test storage with all sorts of databases, storage in Kubernetes. So any storage that you want to use in Kubernetes has code for MySQL, Postgres, Mongo, and SQL Server. Because that's what I had back then in Red Hat. And it's using either a sysbench for MySQL and Postgres. You can do a pgbench for Postgres. You can do YCSB for MongoDB and HammerDB for SQL Server. So because I did not have a lot of time, I started with Sherlock. I started with sysbench because it works very nice with Postgres. Kubernetes, why? Well, because it's KubeCon and I've used OpenShift. And Postgres is actually my favorite open-source database, so that's why Postgres was chosen. So let's see how all of this is looking. And as I said, I've done this on AWS. On the right side, you basically have three instances, very actually from LightBeats perspective, small instances, that the uniqueness for them, they are in AWS, they are called Instant Store. They have direct-attach NVMe devices to them. So basically, LightBeats run on these three instances. And in a very short description of LightBeats, because it's also defined storage, there's nothing special about these instances. They are running, I think, Vail 8. And it's just a bunch of RPMs that we installed. We definitely consume all the resources on these instances in terms of network and CPU, so you don't run anything on these instances. And we basically, throughout our software, create a logical, a large logical entity that includes all the attached NVMe devices in those instances. For these type of instances, there's two NVMe devices attached to each of them. Again, this is all over TCP, so there's no any kind of a special magic, anything like that. All I did was, I installed the OpenShift cluster on AWS, and then I installed, through the LightBeats Marketplace, the LightBeats cluster into the same VPC. So it's all sitting nice on the same VPC, on the same private subnet. And now, all I need is some method, which is Sherlock, to deploy whatever databases I wanted to test with the performance. So in each, I had only, I think, only a single master, and I think only a single worker. Again, the worker, also not a very big one, R698X. And basically, while this is showing two Postgres, I think I ended up with three Postgres pods in each instance. And you had, for each of these Postgres pods, you have another pod, a matching pod from Sherlock that runs Sysbench, or if it's something else, a different workload. And there's a tiny pod that runs on each of these workers, on this single worker, that also measure performance, using MPSTAT, IOSSTAT, VMSTAT, and things like that. This is basically how it looks. Now, so this is basically, I forgot to mention, actually let me go back. What I wanted to compare this is basically in AWS, again, I didn't have a lot of time, is with the highest available storage at AWS Provide, which is IO2 Block Express. This is basically the high-end storage. When you need a lot of IOPS, you get to the IO2, and then IO2 Block Express, you can also consume it also with the specific instances in AWS. And that's basically what I've compared this to. So there's a lot of numbers in here. I've started to play with basically, as I said, these three databases running on a single worker node, and each of them has a dedicated Sysbench pod. And I started to play with the number of threads that I was basically stressing the Postgres databases, and what you're seeing is basically an average of six iterations of 30 minutes, and totaling all the transactions that Sysbench was able to achieve in 30 minutes time. And you can see the difference between, you know, the gap between Lightbeats and IO2 Block Express. This is not me trying to say IO2 Block Express is not an awesome technology. It is an awesome technology. It's just that in the cloud, what you can achieve with native provider cloud storage is limited, and with network, you can achieve a lot more. Just to conclude this, so by the way, on the right side, one thing to always remember, if you're playing with the Sherlock, there's instructions on the Git repo, but everything that Sherlock does meant to test storage, not CPU and not memory, which is why deliberately there's a very low cache ratio for all these pods in order to force all IOs that Postgres is doing, or any database that you are testing, to go directly or force it to go to the storage layer. Otherwise, in real life scenarios, you actually want your Postgres to have more caching, but this is how you measure storage in databases. The case is where someone is running a BigQuery, or you have 100 people instead of 10 that are suddenly sending queries to a database, and there's a lot more of IO fetching into the storage because there's not enough caching. So this method, you basically force, with a very low cache in the pods, you basically force things to go directly to the storage. And so just one last thing, it's a little bit of a confusing slide, but these are the performance number and actually cost numbers of when I ran these tests. So what you're seeing on the blue is basically the transactions, the average of the transaction per 30 minutes, and then the red is basically how much it will cost you to run this on AWS. Now, this can be completely different in other clouds. And again, NVMeover TCP is not cloud-specific. We started actually, Lightbit started as an on-prem solution and moved it and added the cloud as well. But it's something to think about. You're running databases. I'm sure most of you are. Once you get to a certain threshold, when the low-level SSDs are not a good fit, you want to try and look into NVMeover TCP, not only can boost your performance, but I can also save you money. And I also have that thing for the feedback and everything. And if you have any questions... Anybody have questions? Hi. Thank you for that. I'm wondering about the other two, like RDMA and FiberChannel, are those available like Lightbits in the cloud? And if they are, what does the performance look like? So again, I took this example. I used AWS just because I had access and it was fast to do. So iRap is almost... Not a lot of people are using it. NVMeover FiberChannel, you know, it's there. I don't think it's very successful. And at RDMA, I think you might be able to find some very specific cases. For example, if you are going with... You know, we're going outside of the open-source world here, but if you're doing something like Exadata Storage in Oracle Cloud, that is with RDMA. That's basically how Exadata is working. Performance-wise, I'll be honest. There is a slight better performance if you're going to start to use something like Rocky, for example. But the only guarantee is that you're going to need to have that specialized hardware in order to use it. For your instances on AWS, right, you took those... I can't remember those. R6IN, ADS. And you put Lightbits on there, right, to get that on the other instances. Oh yeah, I4I, yes. And you put Lightbits on there to get that NVMe over TCP, right? So in an on-premise solution, let's say you had a Dell Unity or a NetApp device, and you mentioned earlier that a lot of those already support TCP, NVMe over TCP, or is that something that you need to bring in and put onto, I mean, if you have other hardware, just regular commodity hardware you can do? No, so the whole concept is that it's software-only. So on the Lightbits side, what you saw here, there's really nothing special that... I mean, there's a lot of special things that are running over there. By the way, including... We use ECD on all these instances for a lot of things, but I'm not in a data path, but we use them. But the idea is that the driver, you're not going to need any special driver because it's already part of the Linux kernel. Of course, in that slide, what is missing is CSI. We have CSI drivers, and whether it's a Dell solution that uses NVMe over TCP, you're still going to need to use their CSI driver. So NVMe over TCP will just be the transport of how you're going to access with CSI the storage. Yeah, so cost is like per DB or per data transactions? Yeah. Oh, yeah, yeah. I'm sorry, I didn't hear that. Yeah, so cost is like... I misunderstood maybe. Like, it's a per DB, per instance, or it's like a per transaction, that cost you are showing in the slide. Per DB, you mean like the connection to the storage itself? Right. Connection to the storage itself is irrelevant to the application or the database. The queuing mechanism, the default queuing mechanism will basically create a queue per core in your server, whether it's a physical server or an instance in the cloud. It has nothing to do with, you know, whomever is using NVMe over TCP. And you can play with the queuing mechanism if you load NVMe over TCP driver with different settings. But the different settings basically just creates a single... It creates a queue per core. Yeah, thanks Sergei. And I was going to ask about the CSI driver, but you already replied. So I was wondering... I actually have two questions. One, if you support volumes and F-shirts. And the second one is what combinations can we use to... For example, for ride, right? If we can create ride arrays with the underlying storage. So... Especially for premise installations, I'm thinking. Yeah. So for NVMe, again, NVMe over TCP is just the transport layer. In lightbeats, if you're running lightbeats on-prem, you can actually have us doing... You raise your coding at the instance level. And so you have a protection, not only a number of replicas, which in lightbeats you can actually choose whether you have one, two or three replicas for every volume. It's really up to you. And then you can add another ratio coding level at the node level to do a protection from a failure of a device. So it really depends on the other solutions. And I forgot your first question. Snapshot in... Again, NVMe over TCP is just a transport. Lightbeats, of course, everything in lightbeats is thin provision. We have snapshot, clones. All of these things are part of our software. Hi. I was just curious, how does NVMe over TCP compare with iSCSI? In terms of performance? Yeah. Yeah. Between 10 to 20 times faster. Okay. If not more. I'm being honest and trying not to, yeah. But it's not. And to be fair, iSCSI is a protocol that exists many years. It was written for a very specific type of storage, which is not very fast when it was written to. So things just evolve. Do you see that a CXL will be changing anything, even though it does require specialized hardware? I mean, we, PCIe over, you know, as a network, maybe more performant? Definitely. There's going to be, and has been in a few, for a few years, methods of trying to run PCI networks, if you want to call it. At the end of, you know, like I showed in here, we had, you know, RDMA, which is mainly within Finneband, and we had the iRap, and we had the Rocky. They did not become a big success because it was specialized hardware. And now, in the last few years, when people are moving a lot of stuff to the cloud, or at least want to have the ability to move back and forth, you start to look at the components of your infrastructure to be very common. Otherwise, you're starting to limit yourself. That's about all we have time for questions for. Thank you, Sugi. Thank you.