 Hi, everyone. Welcome to our demo session about the latency and why it's the most important metric of your cloud. How many of you know who Storple is? Storple, OK, very few, right? So first, my role in the company, I'm the chief of product, one of the founders. Cloud Architect, meaning I help a lot of our customers who are cloud service providers, build and design and kind of deploy cloud services. And my background used to be I used to do kind of packet processing and kind of millions of packets per second on standard servers. And even before that, I used to kind of compete, do competitive programming, which is very interesting stuff if any of you know that. What we do is a very kind of fast and efficient software defined storage system. We deliver it as software and services on customers' servers in customers' data center. We started a long time ago in my youth and it's a clean slate design. So it's not based on anything. It's our own design from scratch. And that makes it quite more capable and it has a lot of kind of different capabilities to the storage systems you may be used to. Most of our deployments are with KVM. So even though we do have some deployments with VMware and Zen and Hyper-V, KVM is the most popular hypervisor among our customers. And we do have deep integrations into OpenStack and CloudStack and Openabula and OnUp and also in Kubernetes. So we play a lot in this kind of Linux KVM stack, right? So back to our kind of topic for today. Every transaction processing system without a load balancer, a web server, an application server, a database, a storage system, all of them have a performance characteristic that goes something like this. So you have operations per second and average latency per operation. If you increase the operations per second, you also get slightly higher latency. So for example, one task on four cars takes one second, two tasks one second, four tasks still one second. So this is still kind of one second per task completed, right? Where this model breaks is when you get six tasks on four cars or more, and then it's not six tasks in kind of one second per task, it's six tasks at 1.5 seconds per task. So you didn't do more work. You just hit what we call a saturation point. And these systems tend to have two very different modes of operation. One we call the elastic mode, where you have more demand, you get more work done. And then in the congested mode, you get more demand, you don't get more work done. So in kind of your application, where you want to run your systems to get the best user experience, pages opening fast, et cetera, is in this part of the curve. So imagine you have all kinds of different systems, storage, databases, applications, all kinds of systems, message queues. You want to run them in that part of the curve. Well, you wouldn't run them like some of your systems. You would run higher in the curve because it's more cost efficient. But the problem with running them close to the knee, close to the saturation point, is that the moment you go over the saturation point, it's only pain. Only pain means operations piling up, stuff piling up in queues. And you don't get just slightly higher latency. You get infinite latency, so really long queues that the system is trying to push through slowly. So this characteristic of the system, how high the saturation point is, you can call that the throughput of the system in some metric. So most of system vendors, storage vendors, database vendors, they like to brag about how many operations per second their system can do. Most cloud operators, when they sell you virtual disk, they tell you this virtual disk has this many IOPS. But in reality, for the performance of your application, this doesn't matter at all. What matters is how that system behaves in the part of this curve where you're going to use it. So it doesn't matter how high the line is. It only matters how low the latency is for the number of operations per second that you're going to put on that system. Well, with one exception, so the only reason you care about system throughput is that you don't want to get into the situation of going over the system throughput and kind of exceeding the system throughput. So a system that has a million IOPS helps you avoid going over the knee. So the demo I've prepared is that I have two different storage systems, two different virtual disks. Both of them have 20,000 IOPS, meaning if I take a few and measure how many IOPS per second both of these volumes have, they will both return to say it's 20,000 IOPS per second. So both of them could be sold by, say, in one cloud or another cloud, and they will have the same specification of 20,000 IOPS per second. And I'll show that application performance will be very different on the two volumes, even though they're both 20,000 IOPS per second volumes. So what we have is one virtual machine with 8 vCPUs and 16 gigs of RAM, a virtual disk with 20,000 IOPS per second, and a PostgreSQL database, which is four times the memory. It's a fairly small database, to be honest, like a 64 gigabyte database. And in large web applications, these databases would be significantly larger than that. So it's not kind of a very extreme case. Can you guys see this, or should I make it bigger? Who doesn't see it? You don't see it. OK, I'll try and make it slightly bigger than. Just one second. There we go. All right, so how do we show it's 20,000 IOPS per second? We're going to run this test on it. And this is not the latency test, sorry. This is a random reads and writes, 50% reads, 50% writes, with block size 4K and a large Q-depth, so a lot of parallel operations, on a file which is on that virtual disk. And it didn't even get 20,000 IOPS per second second time. So this is the demo effect. So we can show that volume does 20,000 IOPS per second. So 10,000 reads and 10,000 writes, total 20,000. And PGBench is a benchmarking tool provided by PostgreSQL. The way we use it here is completely following the described test methodology by Postgre. It's not a complete, random benchmark that we wrote. It's an established benchmark. So what we do here is there are 16 clients in parallel on 8 threads. Why 8 threads? Because it has 8 vCPUs. We'll see progress every second. And total benchmark time is 10 seconds. And what we can see here is that with this many clients in parallel, this database can do there is the average transactions per second for the whole test. So I'll try to bring that higher. Let's do that a bit higher. 1,400 transactions per second on this volume. So what this gives us, so the storage volume underneath has 20,000 IOPS. The database we put on top does 1,400 transactions per second. So if we ask that database to do, say, 1,500 transactions per second, what would happen is, so in this common heal lag, this will, well, it didn't increase high enough. So let's try something slightly higher. So you have a queue of operations piling up. And now database transactions are not taking a few milliseconds. They're taking four seconds and five seconds. And a few minutes later, they will be taking a lot more. Because we are asking this database to do more than what it can do. Now, if we switch to another volume, so what this does is switches the storage underneath and restarts the database on the faster storage. That storage also, that's also a 20,000, try a second time, 20,000 IOPS storage. So 10,000 reads and 10,000 writes. And that database without the limit does about 3,000 transactions per second. So what we saw here is two virtual disks, both of them with 20,000 IOPS limit. OK. But the database with the same number of parallel clients, the database can do twice as many transactions per second. Which means that when you go to a public cloud operator and you buy a virtual disk with 10,000 IOPS provisioned on it, that doesn't tell you anything about the performance of the application you're going to get on top. And it's the state of the industry and why I'm doing this talk, that we need to talk. When analyzing performance, we need to talk about the elastic part of the curve. We don't need to talk. The IOPS is interesting, but it's not where we run our applications. And same thing. So if we do rate 1750, a database that can do 3,000 transactions per second, if we ask it to do 1,700 transactions per second, you'll obviously not get a pile up of operations. You get it's pretty happy processing days or something. All right, so moving on. So we are in the process of collecting interesting application benchmarks. So we think it's not well established that services which look identical on specification can be so different from each other. And what we did is it's still in progress, but we've started this process and we got virtual machines from a bunch of different public cloud operators and measured random reads and writes with QDepth 1 on the different operators. In digital ocean, we got about two milliseconds in Dreamhost, the dream compute service. We got about five-minute seconds average latency for 4 kilobyte storage operations, right? And so what do we learn from this? That digital ocean is very good at running SEF, right? The SEF cluster is three times faster than everybody else's. They've done a good job. The problem with this is that if you go to Amazon Web Services and you just buy EBS service from them, you get 0.3 milliseconds average latency, right? So it's not competitive. If I'm an application developer, I run my application on digital ocean and I will outgrow digital ocean at some point and I will move to AWS, right? It's just not competitive at all with the market leaders. And we also measure that in a production store pool cluster. This cluster specifically is running shared hosting services. It's about 80% full. Unfortunately, it's not a public cloud, but it's kind of a fairly well-loaded system. And we're going kind of trying to get a public cloud service in these so you guys can try out the same thing. So I wonder if there is a seven times difference in latency of the underlying storage system. So the way this test is designed, what it does is it tells us how far this end of the curve is from the beginning, right? So it doesn't describe the whole curve. It just tells us where the starting point is. So if you have seven times difference in that, what would be the difference in database performance or application performance? And the difference is kind of night and day. So on digital ocean block storage, which is self-based, you could get, like, if you have an application that needs, say, 500 transactions per second in a transactional database, you may run that on digital ocean and you will get 20 milliseconds latency for your transactions. But if you go to AWS, the same 500 transactions per second, you would get the same thing for 1.5 milliseconds average latency per transaction. And also, just in case you have an application that needs 2,000 transactions per second, you cannot run that on digital ocean. But you can run that on AWS or a faster storage service. So the difference between these is the digital ocean service is also limited at 10,000 IOs per second, the same as the EBS volume. The only difference between the two is latency of storage operations. They're both 10k, 10,000 IOs volumes, right? And I have a couple of minutes for questions. So does anyone kind of want to stand up and defend SAF? There are a lot of SAF fans here, I'm sure. No. Sorry. Any questions? Sorry, I don't hear you. What about SAF running over RDMA? I heard a lot, haven't tried, but. Right, so we don't. We are not experts in SAF, our company. And we don't run SAF clusters. And also, if we were running SAF clusters, it would be considered bad benchmarking because no one trusts us to do SAF well. So the best option is, if you have a SAF cluster and you run SAF with RDMA, we can run the same benchmarks on your cluster. I am completely extremely doubtful that that will make any difference, right? Because I don't think these five milliseconds of latency are kind of the latency to transfer four kilobytes over a network, right? It's kind of somewhere else in the stack. It's not there. RDMA allows to write directly to memory of another host. It does, but. Use performance. So sending four kilobytes over TCP connection, does that take five milliseconds? I don't think it does. It could send even one megabyte at one bucket within 10 milliseconds. So again, the difference between the systems is not both of these services. So the digital ocean service on this chart, the green one, and the Amazon EBS service, they're both limited to 10,000 IOS per second. So it's not about, in this case, it's not about how many IOS can. It's not about IOS, it's about latency. Right, so being able to send one megabyte over the network doesn't help this at all. It's just completely useless to help it. So as a package, like, you need to less amount of input output for operations at first, ability to write to memory gives you slightly lower what they let. It's very interesting and theoretical until I see it running. It's theoretical. Yeah, until I run the same benchmarks on a system like that, I don't believe it, right? I was curious if you have any. No, we don't. So last thing on set, we've heard these kind of performance has been a problem in set for a very long time. We've heard these promises that we're going to fix it in the next version, and Bluestore is going to make it better, and now RDMA is going to make it better, apparently. But it just never happens. It's just years and years and years, and it's not getting better. Bluestore was released. Yeah, Bluestore helped, but not enough. Thank you. Let's take that offline, because I'm being kicked off stage here, like my timer is down to zero. Thank you, guys.