 Thanks, Marisa, for the introduction. So let's start. First, a few words about us, the presenters. I'm Avi. I'm the original maintainer of an external based virtual machine. And now I'm a co-maintenor of CSTAR, which is an IO framework that was used to build CSTAR and CSTAR, and no SQL database that is an open source, no SQL database, my company. And I'm also a co-founder of the company. Hi, my name is Pavel. I used to be a Linux kernel hacker. You probably also know me from the Creo project. And now I'm playing in the database area with CSTAR and CWDB. Okay. So the presentation is not about CWDB, but it will help to understand why we're doing all this. We'll have to understand what CWDB is. It is a no SQL database with a strong focus on high performance, low latency and scalability. It's compatible with Archie Cassandra and Amazon DynamoDB. So you can run existing workloads just with lower cost and better performance and reduced throughput. So because of that, we have a heavy interest in IO performance. A few of our users, I want to read the slide. So let's start with part one of the presentation, mixed workloads. You might wonder why we're starting with mixed workloads and not with read-only workloads or write-only workloads. But the reason is that the modern disks are so amazingly fast that the read-only workloads or the write-only workloads are not a problem. And here you can see specifications for a Samsung SSD and they're just amazing. You can see six gigabytes per second, read throughput. And by the way, Samsung is not sponsoring this stuff, just a random spec I picked up on the web. Although I would love to receive a sample, if anyone from Samsung is listening. So again, SSDs are pretty amazing. You get a million random IOPS if there are read IOPS. And it's pretty hard to exploit a million read IOPS in a workload. So you're not going to saturate the disk. And very often you have multiple disks in a single server. You might have eight of them. So you actually have eight million read IOPS in a server and more than a million write IOPS per server. So the disk itself is amazing. But it's not magic. So you can get either of those specs, but you cannot get all of them at the same time, at least on most disks. So you get some kind of mix. You might get half the read workload and half the write workload. But before we need to understand exactly what kind of mix that this supports and make sure that we're running a mix that is within the disk capacity. We're running a mix that is outside of this capacity, who will just get variable latency or the workload does not complete. So why are we interested in mixed workloads? Well, there's the main workload for our sale of these online transaction processing. There's a real user usually at the other end. So we're interested in providing low latency for our workload. But in parallel with this workload, we also have maintenance workloads that are generally in turn generated internally that they're running in parallel. So we might be scaling out the database, adding more nodes or moving nodes after a scaling out operation is complete and no longer needed. There is a compaction, which is where the database merges multiple files. It reads the files sequentially and writes out the new file. And by reducing the number of files, it improves read response. And there can be a backup operation running. So while the other workloads are running, we're also reading from the disk. The database can also run an analytics workload in parallel, let's say online transaction processing workload. The idea here is that there is spare CPU and disk bandwidth, and we want to use it. We don't want to let the machine go idle. And we want to use it towards something productive like analytics. We don't want it to hurt the main workload. So there are multi-tenancy workloads where you're running several different old CPU workloads together using the same CPU, disk and data. And that's common with the microservices where each microservice can be regarded as a tenant and they're all operating on the same data, but they might not share the same service level agreement. So the challenge is to allow all those workloads to run concurrently, but to prevent one of them, usually the sequential workloads from dominating the disk and hurting the other workloads. It's very easy for a sequential workload to consume all of the disk capacity and makes the random IO behave poorly. And we'll see how we do that. So part two is understanding exactly what kind of mixed workloads that this can support going beyond the specifications that just shows you on these single-use workloads, like sequential reads or random writes. So to do that, we built a tool called the disk floor. It is an open source and the URL is there, so like, share, and subscribe. It's written in Python. It's based on the Yantz-Axibos FIO. And it uses the map plot lib to generate the fancy graphs. And it takes quite a long time to run because it does very detailed experiments on the disk. So if you want to play with it, make sure you set some time for the tool to run. So let's look at the sample results. These results are from an Amazon i3en instance. So those are instances with an ephemeral disk. And we ran this floor on that. And let's look at the results. And the chart is very information dense. So let's look at what the results mean. So on the x-axis, we have, we're varying the write bandwidth from zero to one gigabyte per second, which is what the maximum that this drive supports. And on the y-axis, we are varying the read IO operations per second from zero to around 250,000 operations per second. And this is a matrix. So we're running about, we're running 21 different write bandwidth settings and 21 different read IO settings. So 441 different experiments overall. Each experiment runs for 30 seconds. And the experiment measures the read latency, which you can read off on the color bar on the right. So the cyan color is around 100 microseconds, 0.1 milliseconds. And the colors vary towards blue and then this ugly purple, which is about five milliseconds. And the areas of the graphs that are white are areas where we were not able to push both. So for example, if we try to push 800 megabytes per second and 150,000 read IOPS, then the disk just cannot support it. So even though it can support each one of them independently. But if we try to push 600 megabytes per second and 100,000 IOPS, then it will be well supported. And you can, from the color, you can infer that the latency is above 100 microseconds, but not a lot. So basically it's instantaneous. So if you have two charts for every disk or run the upper chart gives the 50th percentile, and the lower chart gives the 95th percentile. So you can see the 95th percentile is affected much more when you're running closer to the limits. So you might want to stay out of this area where you get those purple blotches. Oh, let me see. There's a question. Okay, there was a question, whether the written writer sequential or random and important question I should have said it. The writes are sequential and the reads are random. And the reason why we, the reason why it's so important is that this is how our database operates. It does not do random writes. So database that uses the B trees will use will generate random writes and they behave very differently from sequential writes. But our database uses the large structured merge tree, which means right activity is done by merging large files. So an initial write is written to the midlock which is sequential. And then the data is dumped into memory to an SS table, which is a large sequential file on this. And then compaction picks up several SS tables that have a similar size. Those SS tables are sorted. That's the first S in the name. And then merge file is written. So basically we have random reads to serve the queries, the read queries, and we have sequential reads and sequential writes. To serve writes and operations like backup and scaling. So, yes, it's a quick question. Okay, so let's move on to the next instance type. This is a new instance in ARM based instance. Again, from Amazon. And it has a mirror disk. So you can see that read and write ranges are similar to the older disk. One gigabyte per second write bandwidth and 250,000 with ops. But the combination of workloads that work and that gives very good latency are much larger than the combinations from the previous instance person your disk they invested some effort into making improving the latency response. This is the skipped over a chart here. Okay, so let's do this out of order. The reason that it's purple here and not white is that this is using an older version of this clause that the dot the dot marks almost the areas of the chart where we failed to perform the work, but that's why I didn't want to spend the hours running their own food again. So let's look at yet another instance type. This is the ice three dot three x large instance. And you could this is a much older instance. So the discs are not as good. And you can see that even the 50th percentile is gives not so good results. And the 91st percentile is pretty bad. Maybe the discs themselves are also older and which reduces their performance. It's hard to say because we don't have access to those metrics. So let's look at the measurements from another cloud provider. So this is a Google cloud platform system. And they're running the local stories provided in a different way so you have SSV slices which are just 275 gigabytes, but you can merge multiple slices together in using the same material and this is what we measure here. So you can see that the results depend on the right throughput, but the dependency is not so pronounced. And basically you have to keep under 400,000 iris if you want to get good latency. The first thing that different discs behave differently. In this case, the Google discs, they're really managed by the hypervisor and not, you're not directly accessing the SSD so it behaves very differently from other discs. Other discs behave more like them. Amazon discs. This is again a Google cloud platform, but here we're looking at the persistent disk. So this is a network attached disk, not a local NVMe. And here we see a similar pattern that we had with the Amazon disk, but notice that the rates are much lower. You cannot, from a network attached disk, you cannot get the same amazing IO per seconds and throughput that you can get from a locally attached disk, so it's about 20 times lower, but we see the same diagonal graph that appears on many discs. There is some strange artifact in the 95th percentile, where you get the bad 95th percentile latency on the lower range of the read IO. I don't have a good explanation why this happens. It could be a problem with a test, it could be, I don't know what it could be, but it's very strange that when you're running with lower, lower odds you get the worst 95th percentile. It would be that you have a fixed number of requests that return with higher latency, and that fixed number of requests when divided by a smaller number of overall requests gives the peak out into the 90th percentile. The 12,000 IOPS, they get drowned out by the higher fast requests. So don't really have an explanation for that. We don't generally use persistent disks or the equivalent in Amazon EPS because there are so much slower than locally attached disks. And finally we have a hard drive. I guess everyone forgot about them, but they still exist. The numbers are really, really low, so you only reach 120, not 120,000, 120 operations per second. And in theory you can get to 200 megabytes per second, but if you do that then your read throughput, your read the IOPS are very low and the read latency shoots on. So use SSDs. So that's it for this floor, and again it's open source and if you want to experiment with it and generate graphs for your disk, then please send the pull requests with your results. And I will incorporate them. The readme page has the same results that I presented and I would love to extend it with results from more exotic disks. Let me check if there are questions and if not then we will move on to Pavel's section. So there is a question, what is the total cost if we migrate to AWS cloud? Well, it really depends on what you're doing. I see a question, performance appears to abruptly cut off on the last slide, so this would probably be the hard disk, the spinning disk. And yes, it's because the spinning disk can only achieve its maximum write throughput. If it doesn't serve any other reads, as soon as it starts to serve reads, then the disk needs to seek and as soon as it seeks it gets extremely slow. So, like I said, don't use hard disks for performance sensitive workloads. Okay, I guess that's the question so far, so let's move over to Pavel's section of the talk where he explains how we architected our IOS scheduler to keep the workload within that nice cyan colored part and the purple parts. Over to you Pavel. Yep. Thank you, Ali. So, once we know that disks behave like this, what can we do with this knowledge? That's part three of the webinar. One of the possible applications of the knowledge is to dispatch IO into disks somehow to get the best latency possible. And that's what our goal in Silla. And here is how we do it. So, Pavel and pushing the request into disk, the dispatcher should stay inside the safety area reported by the displorer. But there are two challenges here. First, the safety area can be quite tricky. It can be convex or be cut off at high write-ups or have some other weird form. But by and large, it's generally more bluish towards zero point and more purplish towards larger bandwidths and IOPS. A decent approximation to treat the area is a triangle with corners in zero and along the axis. In fact, shown on the plot is not a full picture. Discs has four speeds, two IOPS, four read and write and two bandwidths for the same. So, the triangle in question is actually a 4D triangle like maybe space. And diskplorer just drew a flat slice from it. In the form of equations, this linear safety area would look like this. It's the sum of normalized bandwidths and IOPSes, which should be less than some constant. In many cases, the constant is 1.0, but some disks with convex safety area call for smaller value, as well as some other disks maybe call for larger value. And this equation is something we can work with once we solve the second difficulty. And the second difficulty is that the safety area equation defines some relations between IOB bandwidths and IOPS. And both bandwidths and IOPS are time derivatives or in simple words, speeds. They are something per time values. When one observes a flow of requests, it's not possible to get an instant value of a bandwidth or IOPS. One needs to apply some approximations on it, taking sliding averages or alike. And since any math of that kind is a statistics and thus collect some history of the measurement, the speed value that we get is likely missing some spikes. So the decision based on that may jump out of the safety area. Some good news is that we don't need to really calculate those speeds, but we can take a shortcut. The shortcut is the well-known algorithm called token bucket. It takes two inputs. First is the rated flow of tokens, and the second is a chaotic flow of requests. And it generates an output, a rated flow of requests. Thus, so by assigning each request a certain amount of tokens, and once a bucket has at least this amount of tokens for a request, the request grabs the tokens and can be dispatched. And in terms of equations, the algorithm introduce a resource R associated with individual requests. It needs to calculate the sum of R's that flow into it. And it guarantees that the speed at which the R flows out of the bucket is limited by the rate tokens input. And that's exactly what we need. It's possible to take the original equation of the safety area, a piece of paper in the pencil or I don't know a mathematical package, maybe. We can modify the equation a little bit and convert it into a form that's exactly the token bucket equation. This thing in the braces is the resource value that's fairly easy to account. That's what we get combining it all together. The token bucket with one heart and coming great tokens. Each request is assigned a value of one divided by the maximum IOPS plus its size divided by the maximum bandwidths. And the bucket emits the outgoing flow of requests that fully conforms to the original safety equation. It only takes some efforts to get used to floating point tokens, but that's very easy. And to the fact that the request is measured in seconds. But other than this, such a dispatcher can keep the outgoing flow of requests under this green line. And another good news is that it will be possible to control the dispatcher intensity with just a single parameter, the limitation constant. By moving this green line up and down, thus expanding or shrinking the safety area, depending on its exact form. However, one may tell that, okay, we stay in the safety area, but that's an error. The dispatcher is free to stay at any point of it and still get good latency results. What exactly should it do? That's right. One more step here is needed. Before getting there, let's switch to another known dispatcher CPU scheduler. In Linux and actually in the Cstar 2, the CPU scheduler maintains a set of so-called entities and schedules the CPU time between them. And there are two numbers associated with each entity, the runtime and virtual runtime. The former one is literally the total time an entity had CPU time for. And the latter one is the runtime that's adjusted in two ways. First, it's divided by the entity priority in Cstar, we call it shares. So that entities with higher priorities have smaller virtual runtime and the scheduler would give them more real runtime. The second adjustment is when the entity wakes up from being idle. In this case, the virtual runtime can be increased not to give the new entity a non-preemptible boost against others. In the IO, the same thing can be applied. If the request cost is what the request needs from the disk, this sum of normalized one and its length, then accumulating the costs normalized by the stream shares. And of course the IO scheduler should have some concept of streams, flows or groups. It can attribute a bunch of requests too. So if we do this normalization and sum up the results, then we get the exact analogy of the virtual runtime in the CPU scheduler. And using this value, we can first balance the disk throughput consumption between different classes. And the second thing is that this normalized IO time or whatever it can be called. It defines yet another line in the scheduling decision area on the intersection of the capacity limit line and the weights balancing line. There is a point at which the scheduler is going to load the disk, the exact read bandwidth or read IOPS and exact write bandwidths. And of course write IOPS. So this is it from my side. Now we can go to the Q&A part. Yes, there were a couple of questions which I answered in writing. So if there are more questions or if someone wants to elaborate on any questions, then we'll be happy to answer. And if not, then go and download the flow and run it under disks. But be careful because it's a destructive test, so don't run it on disks where you have data. Okay, I guess there are no more questions. So thanks everyone for attending. We'll take out the disclosure for a run or drive and also try out the CELADB and so thanks are pretty interesting. Oh, I see we do have a question. Can you describe the differences between AWS and GCP disks? So let me let me pull them up. Are you, are you on mute? Yeah, we forgot to click that small button but stops audio from the tab. Okay, so keeping in mind that what they have to say here is somewhat speculative. I don't know how Google or Amazon implemented this. So maybe I just inventing a big theory that's completely wrong. I believe that Amazon disks are more or less straightforward path through to regular MVA great discs. And that most SSDs will give a behavior that's similar to the Amazon disk. And the my explanation for this that I don't align is that the disk has an internal bottleneck, which is, it could be either the controller so the small CPU within the disk that manages all the activity, or it could be that the chips itself themselves, the SSDs composed of flash chips, and eventually they all get busy. And when they get busy, you're, you hit the limit of the disk and instead of the two book increasing the latency starts to increase. So this, you know, this, they're going to line shows you where is a limited hit. Let's, let's imagine that the limit is SSD chip. At one point. So every IO consumes some capacity from an SSD chip. You need to open the page and perform a reader performer right and close the page. And if you're doing a sequential reader right. Then that operation is a lot cheaper than the equivalent number of random reads or random rights, but you have to hold the chip open for longer because your, you need to transfer the data. And that's the origin of the sequential line, each, each type of workload consume something from the, from the resource that is the bottleneck, either the controller or SSD chips. And, you know, the other one line shows where all of the resources are fully consumed. For Google, there is a lot more involvement of the hypervisor so there's more software and more involvement of the host hypervisor. And because of that, and the host hypervisor has a lot more resources at its disposal than the disk. And so you see less, less interference between the two workloads. And imagine that the latency is comes from the disk, but the ability of this to sustain recent rights, almost without interference, you do see that there is some, some difference between workloads with low right bandwidth and work with the high right bandwidth, but it's lower than be reduced. And my theory is that more of the workload is handled by the hypervisor, which has a huge number of course it is disposal so it's able to, it doesn't hit the bottleneck. But the actual re-gratancy that comes from the disk itself and there's nothing that the hypervisor can do to cover that, other than cash the entirety of the disk. And I will mention that this floor takes great pains to defeat any optimizations so with pre-rights than terrorist that can take several hours to because the SSD can optimize reading from an area that was not previously written to and just returning immediately without exercising the SSD chips and they avoid choosing a file system so as not to have interference is worse directly with the raw disk. So there was some effort there, expanded to make the results reliable. But of course, there might be surprises and that can always be bucks. Okay, I hope I answered that question. If there are more questions, please ask further questions. I say question, are there any free options to practice with this technology as a student? So both this floor and CSTAR and CDDB are open source and you can practice with it as a student or with their production workload. They're both free. And the question I see on CDDB I suppose the assumption of random reads doesn't always hold. So that's true. And yes, with the time based compaction strategy or when, when, when doing sequential scans instead of random reads, then you will get the random reads. But the scheduler doesn't assume that all the reads are random. The scheduler looks at every eye operation and it examines the size and the direction. So if it's really right. It assigns the, it assigns a token weight according to the equation that Aval presented. So we don't really look at whether it's a sequential or random, but we do assign a much higher rate to the larger IO that comes from where usually you are associated with sequential reads and writes and a lower rate for random reads and random writes, which basically don't really happen. Javi and Pavel, it looks like another question just came through. Yeah, okay, let me read it. Would it be possible to use a reproach in a generic block layer IO elevator? Yes, that's well, one of the questions that expected but wasn't asked was why not use the Linux scheduler. The answer to both of them is that it would be possible if a lot of work was invested, but it's not possible now and it would actually make a very good solution. Our IO scheduler has one piece of information that the Linux IO scheduler does not have, and that is the assignment of eye operations to separate workloads. So our scheduler knows whether a particular read is part of a query that needs to be assigned high priority, or whether it's part of a compaction or backup that can run with much lower priority and really should use just the idle bandwidth. And the Linux IO scheduler doesn't have this assignment. A previous scheduler so the, I think it was called the stochastic fair scheduler stochastic for a queuing. So each process was running a different workload and and tried to be fair amongst them but the first of all that assumption is not true. You often have one process running multiple workloads and you don't have a way to tag that and also the mechanisms to tag is a priority are are not flexible enough you have only a small number of priorities, whereas with our scheduler you have shares which are more or less equivalent to process so it would make lots of sense to push this into Linux, but it's also a lot of work and the API are not ready for it because there is no way to tag. So with the IOs with the priority classes that they, the workload classes that they belong. I guess with the IO urine. It becomes easier now because there is a more generic way to to send I also an IO could be extended with the information about the classification of each IO into separate work so it would be really interesting, but also a lot of work. Alrighty. Okay. Sorry I thought another question came through so thank you so much Avi and Pavel for your time today. And thank you everyone for joining us just a quick reminder that this reporting will be up on the Linux foundations you to page later today. We hope that you will join us for future webinars. Have a wonderful day. Thank you again.