 Hello there. Today I will be talking about hard drive performance testing on bare metal, Docker and KVM. However, this talk will not be about comparing, say, two drives run in some way or form on these platforms. What I am interested on is scaling things up. What happens when you add multiple drives per gas or container versus multiple gas and containers with one drive each? How to investigate which tweaks we need to get the best performance. That will require a lot of testing. I don't know about you, but I don't feel like building a test box, manually running out the tests, collecting the data. Just so after all that, I find out that I forgot something and need to start all over again. I have done that and it's not a pretty sight. There has to be a better way. So how can I be lazy in my testing? Let's begin by defining the problem we are trying to solve. In this case, how to do a ton of hard drive performance tests. Knowing which tests we want to run is important. There is probably some standardized standard we can use. Spending the time to find a way not to only automate this testing as much as possible, but also make it become rather platform agnostic will go a long way in helping us save time. We should be focusing on the data interpretation and coming up with questions to be answered, not on the actual running of the experiment. By then I hope we will have found some kind of conclusion. We still have time to make something up. Finally, we will use the remaining time to go over any questions you may have. Warning, my employee has no idea that I'm talking here. Really, don't blame them. This presentation is not as technical as most of the other talks. We will be focusing on the process, not on the or not on the coding. But there will be lots of pretty pictures. Over here, we have researchers who need to get as close to the wire speed on devices they are working with. GPUs, hard drives, and network cards as possible. The as close to is the name of the game. Ideally would provide each researcher with enough better metal servers to perform the experiment and insanely fast communication between them. I hope that does the enough better metal servers part is the chameleon cloud, but we do not have infinite resources to throw at the problem. Instead, we can cram a bunch of these devices in as few servers as possible. We can come up with some way for the researchers to fetch the devices they want. The problem with abstracting hardware is when using some kind of virtualization is that something is always lost in the transaction or translation. That is the question which ties back to the as close to comment we had before. To show what I mean, let's talk about three different ways to pass a network card to a VM guest or a Docker container. We can pass a virtual or scale down how much scale down it is really depends on the driver that's doing the emulation card to multiple guests out of the same physical card. This method is called part virtualization. It uses a lot of resources to abstract it at the hypervisor or container server level and is as a result, the slowest and on the top of that. It does not expose much of the original card. Most people are fine with that, but since you're doing research, that is not good enough. Other cards have SRROV, which allows us to create fake cards at the card itself and then hand them off as before. So in this case, the card is really a virtual environment, a mini virtual server. But it only serves partitions of itself. A classical example of that is a GPU card and some network cards. As such, it reveals more about itself. There are such cards that allow each virtual device in this mode to be programmed up to a certain point. Then we have PCI passthrough. The speed difference between PCI passthrough and SRROV needs to be checked for our context, but it tends to be slightly faster. However, we plan on completely reprogramming at the very least of network cards, and that includes changing its firmware. How much of that is available through SRROV? I don't know. For now, it's safer to just be able to hand over the entire card to the researcher in such a way that the researcher has full control of within the VM guest. For our case, we are very interested on the PCI passthrough in general, but we will consider investigating other abstractions models to see if we can still perform. We can still achieve the same performance and configure your ability that we can get by just handing over the entire card. With that said, some of you have realized there's something missing here. PCIe version 4. I really would like to be talking about PCIe 5, but as we know, not to zero. Anyway, PCIe v4 can provide the same bandwidth of PCIe v3 using half of the lanes. For the sake of compatibility, many cards scale down their bandwidth usage to match the slot they are in. It may not sound that important, but when you see a 100 gigabit Milanox card, not able to get full speed on using one PCIe 3 slot, in fact, having to use two PCIe 3 slots to achieve its full 100 gigabit speed, when if it was connected to a PCIe v4 slot, we would not need this auxiliary card. You know, it really starts to add up, especially if you have a limit on the number of slots. And then there's Numa, which allows us to say, take this to CPU server with some memory and a few PCIe slots with some nice orange cards, and then break it apart, break the CPUs apart to create individual computers, which we really call them Numa nodes with their own PCIe and memory slots. And these slots can be accessed faster by the CPU. You know, if they are inside the same Numa node, the CPU can access those PCIe slots and the memory slot faster than if it tries to access a PCIe slot that is somewhere else in some other Numa node. The bottom line here is that this direct access can lead to faster VM guests, and by fast I mean we start getting close to bare metal speed. As other talks on this conference will show, there is also a lot of work in the Numa domain being done by AMD. You really should not miss those talks. Now, there are standards for testing GPUs, network cards, hard drives, and so on. For this presentation, we will focus on NVME hard drives, but the concepts really apply forever for the others. The standard protocol you use is the storage network industry association, solid network storage, performance testing, specification version 2.01 from February 2018. Yes, it's a mouthful. But what does it mean? It describes the process of testing, performance testing hard drives, solid state hard drives, which includes how to precondition the hard drive so it doesn't behave like a hard drive that's fresh out of the box, which would look fast and insane. But that's inaccurate in real life, and that would skew the results. It would also specify which sample range to connect. If you're doing the testing in the beginning, we are on the region that hard drives might start still behaving as new hard drives, so the built-in optimization algorithm is working full speed. But then we are going to start reaching a region where it doesn't affect as much because there's so much data that had to go through it that it's now finally reaching the same behavior as a hard drive has been used 24s every day would have. And after that, then there is a point that we call the steady state. And that's where we actually want to take our readings. And that is defined when the average readings do not change much. In the experiments we have run here that used achieved close to a day of testing. This standard also define which tests to perform. Specifically, we're talking about bandwidth, IOPS, IO operations per second, latency, and each of them has specific read-write ratios and block sizes for that specific test. For instance, IOPS, we're going to be doing a random read-write ratios that can be like 100% reads, 0 write, 95, 565, 35, and then turns back to 35, 65, 595, and 0100. The block size can start from a half a kilo byte to 1024 kilobytes. Now, let's talk about the tools that we're going to be using. First of all, one is going to be the flexible IO tester, which is a multi-treaded IO generator tool used for, as its name indicates, testing a give and workload. It's extremely customizable, and I personally consider one of the best testing tools out there. Then there's also the storage performance development kit, which is really a set of tools and libraries to write high-performance, scalable, user-mode storage applications. So what we're really talking about is doing a zero-copy, highly parallel access, directly to an SSD at the user space level. Or drivers that we build using this, really, they should be allowed us to get much faster storage. How fast that is compared to other methods, that's really, of course, a matter of testing. Now, SPDK can be used in conjunction with FIO to test devices using testing devices that use SPDK driver, or use its own performance testing tool. And finally, we have some customer scripts that we created to make our life easier. One is just to find how the NVMe hard drives that they are available in a given computer. The other acts as a wrapper for FIO and SPDK, and deals with all that creating the laundry list for tests to be fed. And then running the tests, saving all the output in some kind of format that's convenient for us to use. In this case, we chose a comma separated files, which then we can pass to other scripts to look for interesting data and make pretty graphs. The host that we've been using for testing this is a old super micro box, which also doubles as my 3D printing filament dryer. It has 48 PCI-EV3 lanes, and also dual Intel CPUs, each of those with 18 cores each. And the way it's set up, each single CPU, it's an independent NUMA node. So we really have only two NUMA nodes. One that takes the cores from 0 to 17, and the other that takes 18 to 35. Also, we have the Intel NVMe hard drives. And the ones we are using on this machine, they go on the front. That's how those drive bays are for. And they are using the U.2 interface, which means as the picture shows on the bottom, they look like standard SATA connectors, but they have a few little pins in the middle. The problem here is that all of those 10 hard drives that we have are attached to one single controller, which is in NUMA node 0. So there will be really no performance gains to be had by running the different gas containers in separate NUMA nodes. If you want any performance, we have to put them all in NUMA node 0. Out of curiosity, this is how we looked for the list and where each of those hard drives was, in which NUMA node it was. We used the fine drive script that we mentioned before, and a bit of help of Varsh. Now that we know the players, let's see if we can grasp what we really want to accomplish with them. What we are proposing is to compare file by itself with live IO, maybe file in SBDK and maybe even SBDK by itself. And that's not, that's even not counting all the optimizations we can do by just messing with SBDK. Now, the optimizations we are going to be doing by using virtual hosts. We also have 10 hard drives. They are exactly the same make-up model. It would be nice to have a different one, but different models. But the thing is how similar they are. If you run the same test on each of those hard drives, would I get the exact same result or is going to be some variation? And then we have, if you're going to do the IOPS test, we have 8 different block size, 1024K, 128K, 64K, 32, 16, 8, 4, and then half a K. And then we have the 7 different read-write ratios that we mentioned before. And that's not even counting sequential read-writes versus random read-writes. And then we have to see the effects, which we really can't test on this machine, but they are available, they are there in principle about playing with which kind of memory we are setting up our virtual machines, the number of CPUs, the number of nodes, and so on. I'll let you take to calculate the possible permutations, but it's a lot, like thousands. You know, using the FIO test, we can cut down some of those tests, but you still have to manually start them, and they still have to be configured, and you still have to process the data. I tried to generate the FIO test config files before. For a few runs was fine, but it soon became very easy to make mistakes. In fact, I made a lot. And that was bad because I had to redo them and sometimes find out what was wrong and then redo them. And that means, since those tests, they take hours upon hours, that means I lost not only hours, but days in the process. I think we can do better. So, as I mentioned before, we wrote a script that takes the parameters that are used to create the config files for FIO and SPDK. Create those files, run the tests. And for each of those tests, we then process, we collect the data, collect the data, not only just in a CSV format, but also in a file name format that helps us identify which run it came from. And we actually do save the config files that were created to run those tests. And that means that, you know, we can go back to them, not only to use for reference for documentation, but also to reproduce any specific test. Because of this automation, you know, we really, you know, we decide why just running some tests for like, you know, bandwidth, some tests for IOPS, why not just run all of them. That will cover all the SNIA test requirements. It's easier to drop the data. Because, you know, by that I mean it's, it's always easy to drop data you don't need. But you collect it, then it is to try to get data that you need, but you're not collect. You know, if it just takes like, you know, an extra day or so, who cares. We can automate it, we can line them up as soon as one batch of tests starts and another one starts and you keep doing that. And that's why that's doing we can we can get to the data, we can process it and then plan the next batch, you know, automation is good. Now, here's an example of how that torture test script creates the config file for to run with FIO. In this case, it is a, you know, it's set up to run to do a to run FIO with Libio. It's set up to run up to 24 hours. But if you see that steady state statement, it will stop running the test when the IOPS slope is less than 3% for a period of 1800 seconds. And usually, so far that has always been before the 24 window ended. And for every single run that torture, my little torture drive script will create a file like that. Note also the file name that is identifies that is the name of the device, which test we're doing and when we started. We also use Ansible to build things. The playbook, we have two playbooks, actually, one is to build the bare metal server and specifically installs Docker and KVM. And those little errors that you're seeing there, they actually represent task files that run by the Docker container, not the Docker, but the Ansible playbook. I like to keep them separate because that allows me to use them for other tasks. In fact, the install Docker task I'm going to be using later on today for another project. We also have a playbook to install the testing software itself. I really can use an Altree bare metal KVM Docker kinds of text boxes. In fact, I have used display book on all three before. The only reason I stopped using it on primarily Docker and nowadays on KVM was speed, specifically that I would have the image already read to go with all the packages installed. And I can just start it and run. And second, by doing that, I don't have to run SSH inside either the Docker container or the KVM guest. And Ansible needs SSH to do its thing. And that doesn't stop me if I want to run the playbook again. As the playbook will just go through it, it's going to realize, oh, you already installed this package, you already downloaded these files, you already compiled and built them so you're good. I don't need to change unless it has a version that's newer than the version that's already installed. Then we also have to talk about our test boxes. The bare metal server, we really have one that right now we only have that super micro. So there's really no point of building a separate custom image just for it. As much as it just to put the machine, do a basic install, let Ansible build it and off we go. In fact, I could do it, you know, create a custom machine and run it through Pixiboot. But for both Docker and KVM, I have an image, you know, as I mentioned earlier, without the package that we need. And once you run this with this image, you start the VM guest or the container. And give them the locations to store the data that will survive the container being gone by that I mean it's a storage. It's a directory that's shared between the bare metal server and the Docker or the KVM guest. And that's where we're going to be storing the results from the experiments. And then when we start the containers, we feed the hard drive, the PCI base, well it's on PCI, but using PCI pass through the NVMe hard drive is fed to either the Docker container or the VM guest at the start time. If those, any of those guys, they need to download something or access the network, both on a network and that means that, you know, if you really need to enable SSH, we can do that and do some port forward and be done. This is how you use an example how to run torture disk against one against three hard drives NVMe 259 and running out those different block sizes, the mixed reads are the read write ratio. And even though it just shows like 10 to zero is actually test the other way around too. And the out the out the director is going to create store the log files is based on the that live IO name plus the name of the hard drive and plus the date. So, also, as before the time here it says that you know that's the limit how far it can go before it stops. And instead state it's set up to do the IO state that we mentioned before. Now, if you are, if you're going to run for instance, start, you know, creating a bunch of Docker containers. We can do that, like as in this example by listing all the NVMe hard drives that are found on the server using find drive, and then looping over them. And then we are going to create the direct on the fly the director is going to store the log files and then pass this directory to the Docker container. And the name of the device and let it do it same. This is very similar to what we do have done in KVM. You know we still have to create an image download package here. And everything else. It just takes a bit longer. It takes much longer than a Docker container. But it really doesn't matter because it's automated. The following graphs show the results for running FIO tests using live IO on just one single drive. Remember when I said before that you know there's a lot of data and there's a lot of tests that we're going to do. Jesus just one single drive and Jesus just as you can see here is just doing the FIO using live IO and just not even the entire list. In fact, you know, I did not show the IOPS and bandwidth. I don't know that we have IOPS. Yes, it's a bandwidth latency in IOPS, but we don't have the half of the tests are not here. But I really like this kind of graph because you know they really show on this case you know you're comparing better metal, darker and KVM. And they really show to others how they with the different outcomes that can be created using different settings. Yes, we still can do like no the F after careful considering the answers provided by bone reading we conclude the best performance is achieved by setting the block size to X on chooses and why for the rest of the week. But just not really a hardcore technical talk and I'm not really concerned about setting that specific drive to the best performance and let it go. I'm going to be testing all the drives I'm going to be testing all the situations I'm going to be adding other things so I was going to have more very variants that they need to deal with. But if you look carefully you'll note that you know the charts and access titles are not really consistent. This happened because you know the way I made those graphs I literally imported the data into library office and I tried to create the graphs. So I really had to rely on my lack of typing skills, instead of having some kind of no smart script that take the data. Do the average for each run for each test box or drive or do a combination whatever you want to do and then create the graphs. Next time, I'm going to see how to create proper nice looking 3D graphs using a new plot. And they look I want them to look good enough so I can put in a technical journal. And this one shows similar tasks but we are doing FIO with SPDK and as we mentioned before we kind of expect increase in IOPS and bandwidth. And if you look at the axis, the scales on the axis you're going to see as we had an increase. I personally I kind of don't like that you know it's outscaling but I don't know what to do about that maybe when I start using new plot I can control that so you actually can see a difference. And there's by the way a Intel article you know from Intel the company. And I will put on the end of the slide that states that the performance obtained using FIO plus live IO is lesson that with FIO plus SPDK. And which is less than running the SPDK performance tool by itself. The, you know, I could have shown the other drives but I'm not going to, because you know I have not automated the plotting part, and I'm not really going to do this manually. No, thank you. And also, you know, now that Jesus is stable, I want to start focusing on the effects of using different settings say huge man CPU, Numa will do to the performance. But remember that you know Jesus was not really about you know the data sharing like oh just the results will be tamed and things like that. This was all about setting up the experiment and how to automate the experiment to act to do some real experimenting, which what we're talking about here, you know, that make your life easier, make it your experience reproducible and be lazy, you know, create you know pre built images for Docker and KVM. It's nice. Have a script to actually do the actual have lifting. And I would like, you know, to take the opportunity to thank the Linux Foundation for the opportunity to participate on the KVM forum. And here are some useful links. You know, you can read the 102 page on the SNIA, how to how to proper test so it's a drive. There's the FISP DK links. There's the Intel art quite talked about the watches fabric is the link to the project that I'm that I'm building Jesus test tools for. And finally, the, the last one is the link to my find drive script. And the torture disk is also available. I'm kind of ashamed of it, you know, the code still needs a lot of cleaning. So I'm not putting it here, even though you can go to my GitHub thing and find it. But just don't tell me you did.