 I think I've emphasized it enough, but if I haven't, it has to be reliable. If it's not reliable, then people get angry. If people get angry, it leads to the dark side, dark side, and you know, the rest of the story. It has to be easy to understand. So when the developer gets the job in CI failed, they have to easily understand what the test was trying to perform and how to debug it. If it doesn't, again, meet either of the criteria, then developers get upset, they get upset, angry, dark side, and so on. But it has to be fast. And before we go to the fast, I want to concentrate a bit because I heard several lectures in the last few days about CI and how the unit tests are great and it tests in parallel and everything's fine. But it's not that easy to get them into being reliable. For example, what happens if a test sometimes fail? So again, I'm emphasizing reliability, we're not going to talk about it. I just want to mention that it's very, very important. And in the context of doing things fast, you need to remember that it's not worth the effort if you sacrifice reliability or debuggability or simplicity. So what can you do in 20 minutes? So this is what today over system test can do in 20 minutes. It can, from scratch, install a complete, over environment. Starting from the engine, the engine is a Jbos application. So it brings 300 something Jbos RPMs, including the engine RPMs. It installs all the RPMs. It then continues and configure the data center, the clusters, the host. The hosts are, again, yum install of tens of packages. After it installs them, it adds the storage domain. So the storage is actually configured as part of those tests. So we configure actually ISCSI and NFS targets as part of the test. And then we add those targets as storage domain to the over system. We then continue on and we add a quota to the storage. We actually run the log collection because part of the basic sanity has to be the log collection runs well. Because if at the end of the day QE are finding issues and they cannot give us the logs because the log collector failed, then it's a basic requirement. It continues on. It does some networking addition of networks and the test of networks with VLAN, without VLAN attaching, detaching. It then continues on. Over it is a virtual machine management system. So it's all about managing VMs. It creates a VM. It configures the VM. It adds a disk to VM. It has a network, a console. It then creates a template of a VM. It launches the VM. It migrates it. It then hot adds a disk, hot add network interface to it. It live migrated. It does a snapshot, live snapshot, actually. It merges the snapshot. There are several tests that I forgot. All of it on my very small desktop. I can do it 18 minutes today. And that's because of a bug in one of the tests. It should have been faster. And about 24 minutes on the laptop that you see here, which is an 8 gig SSD, but 8 gig RAM desktop. So that's how fast it can actually go. How can you do it? So the very basic rule is that you can do anything you want as long as you don't cheat. And cheating, for example, in my opinion, is anything that practically really changes the rules of the games and changes it from a real end-to-end test to something different. For example, if you pre-install the dependencies, then you miss some of the tests. Because in reality, sometimes you break dependencies. Sometimes you have dependencies that are missing, or are wrong, or the dependency chain, or whatnot. If you mock something, which is great. In unit tests, that's all we do. But we need real iSCSI server. We need real NFS server, for example. So we want to really make it as much as possible and to end real-life scenarios there. So again, I said, bend the rules. I gave some examples. It's a good question. Do you want to add randomization to the test or not? So we do want to add some randomization because we do not want to test the same path over and over again because we want to expand the coverage a bit. However, if you want to be able to replay that, then that's a bit tricky. Yes, we could save the seed and so on and so on. That's a bit tricky. So it's an open question. Do you want to introduce some kind of randomization to some of the tests? Why is it important? So the basic assumption here, and let's not tell everybody, but the basic assumption is that it's solid, rock solid. CI, either it passes or it fails because of a bug, not because of anything else. So if it is, it has to be fast. And there are multiple reasons why it needs to be fast. Developers are impatient. Everything that they didn't write is, of course, slow. So that's one reason. And we want to make developers happy because I mentioned earlier the reasons. At least in the project that I'm working on, there's a rebase race. So multiple people submit a patch. They may be in areas that are maybe even conflicting. And now they're waiting. They're waiting for the unit test. They're waiting for the static analysis. They're waiting for more tests. And until everything passes, someone else just pushed the code and they need to rebase. And that's like a race, almost inherent race. It's really, really annoying developers. So as fast as it can go, that's the better. If you want to create it as a gating item, and I just heard yesterday, I think, that OpenStack, for example, are pushing 250 patches a day, then either you have a huge cloud or a huge resource that you can handle and can handle that load, including draft patches and work in progress patches and on multiple branches and whatnot. Otherwise, it's just a matter of resource. If your tests are fast, if they don't consume a lot of resources, then you can run them more and more often. So these are the main reasons why we want those tests to be as fast as possible. People told me that it may be a rectangle, but I draw just a triangle. You will hit, like, any other program, one of those bottlenecks. So you may argue that network and disks are different. I call them IO, or either memory or CPU. So your test program is going to be bound by one or more of those items. And we need to work around those or wait. So the techniques I'm going to show you are mainly dealing with those bottlenecks. Any questions so far? Am I speaking too fast? I'll smile. OK, so how do we optimize? Same thing as we always do. CI tests are just code. Ours, for example, is written in Python. But I've seen tests written in Go or any other language. There's no magic on how to do that. I'll emphasize later why you shouldn't micro-optimize, but don't micro-optimize, although I tried last night to micro-optimize because it's fun and you think you'll gain a lot, but no, you don't. The rule is simple. There are two items that you need to deal with, either the slowest part of your test code or infrastructure, or it may not be the slowest part, but it's a part that keeps on repeatedly doing something in the background. For example, in our cases, we are pushing some API requests, and we're waiting for the status. We run a VM, and we're waiting for it to be up and running. Maybe we can optimize this waiting for up and running, which takes some amount of time. So I was talking about micro-optimization. Here's an example of micro-optimization. Do you want to reduce five seconds if the whole suite takes you 20 minutes? So if you look at the numbers, the numbers aren't very optimistic. You less than half percent improvement. However, think about it. You're going to run this CI test either hundreds of thousand times a day. If you're doing it correctly on multiple branches and multiple patch submissions, then it will take a lot of time, and every time you save is a good save. If you can do simple, well-understood, risk-free changes, that'd be great. I gave two very, very simple ones. I removed the timeout from the grub. So it takes a while to decide which kernel you want. And again, it's automated completely. So there's no one waiting, actually, for the grub. And in Libret, the boot menu of the VM, I just removed the boot menu because it always boots from the first disk. So these are two five seconds independently we saved. So those are 10-second saves. I'll talk about self-containerization. I'll talk about it soon. And the reason is that it allows you, first of all, high performance and second of all, independence from external resources, which, again, leads to more stability, but also leads to higher performance. So we use virtual machines in our tests. And actually, we use nested virtual machines in our tests. Our project is all about virtual machine system management, about the whole lifecycle of virtual machine, deploying hosts, configuring the network, the storage, and running those virtual machines after you configure it correctly, live migrating, and everything else. So we actually need nested virtualization. So I'm not sure how many people are familiar with nested virtualization. It's a pretty cool technology that probably has no use, but for testing. So I'm not sure. Maybe there's another use I've never heard of, but it's very nice. So if you can look at there is a physical host, and you only need a single physical host. On top of it, we provision two virtual machines. Those two virtual machines, the L1, are actually hosts in our system. So using nested virtualization, they can actually run virtual machines on top of them. So for the system, they look like physical hosts with all the attributes, almost all the attributes that we need, and they can run virtual machines, but they're just virtual machines. Lago. We use Lago. There was just a presentation about Lago and how useful it is and what it does. So I will not go into the Lago framework. I'll just say in a few words that it's a framework that allows us to actually run virtual machines, set up everything before that, run our tests, and collect the logs, open source projects. We started it two years ago or so, even before vagrant was ever popular, and vagrant in the time didn't have nested virtualization. If I were to do it today, maybe I would have started from clean with using vagrant and unstable, but we haven't, and that's what we have today, and it serves us very well. So faster storage. The first thing you want to handle is the storage. Storage is usually your biggest bottleneck. It's usually the slowest item that you have, the single slowest bottleneck that you have in your system. My solution, use memory. And people are afraid to use memory, and they shouldn't be. If you misuse it, you crush your system. But if you don't, then you're really gained. There's nothing as fast as your memory. Even on my AdGeek RAM, I can use some of the memory as a storage. So it's very, very simple. I just put everything I can into under Devsham. If you don't have enough, you can use Xerum. So I'm not sure how many people are familiar with Xerum. Xerum is essentially a driver that allows you to create block devices that are compressed within the memory. So for example, on a 16 gig RAM host that also needs to serve the virtual machine, I can create a 10 or 12, depends on the test, gig disk that is actually in memory and compressed. It enjoys the compression. Performance difference, because it's compressed, is again, is nothing compared to either using real disk and even SSD. And that's a great thing to have. Even if you swap and you make it into swapping, you can use Zswap. So Zswap actually tries, and before swapping, it tries to compress pages before putting them into swap. And if it has no choice and it will evict them to the disk, then they're compressed. So they're easy, hopefully faster to read. I got mixed results from Zswap. But it seems to be working fine. Now the trick here is that you can use Devsham within the VMs and outside the VMs. So it's just a good. So here are examples that I performed within the VM. When you install YAM, YAM actually downloads all the packages to the disk, and then it extracts them, and then it saves some metadata, it saves the header, and so on. You can actually do that in the RAM. You just need to remember to delete it later. So after I install all the packages, and they install pretty fast now that they're from RAM, because the main bottleneck was downloading them, which we'll get to in a second, but the second bottleneck was actually expending them, decompressing them and installing them to the disk. If you do it in RAM, it's much, much faster. It conserves disk space. Why do you need to conserve disk space? If you recall, I put the virtual machine disks also in RAM. I don't have a lot of RAM. If I can conserve some of this disk space, that will be very, very beneficial, and allow me actually to use that RAM fully, and not swap or anything. So how do we create the images? The images are QMOK VM images. We use standard, varied builder to use them. We have a sub-project of Lago. It's called Lago Images, which actually prepares the images for us. Pretty standard, pretty standard script, nothing very, very different than most images that you'll see today. Lago takes care of downloading those images, and it actually has a hash, so it caches them. So the first time you run, you may get, I don't know, half an hour of downloading images. The second time, they're already on your computer. Again, if you have enough RAM, they could be on your RAM. So if I have a large machine, like 32 gig of RAM, I can put those images, even those in RAM. In order to make sure that we do it cleanly, we actually take those base images, create a snapshot for them in RAM. And then we use syscript to configure them, and we execute them. So the advantage here is, first of all, that you're, again, independent. Once you have the pristine images, you can always return to that very, very same state. And the second thing is, it's quite fast. We're using virtio scasi for whatever you can. The reason is, theoretically, I've heard it's faster. I didn't see it as any faster than virtio block. But I really like the discard. The discard, if anyone is not familiar with it, is a feature of scasi, where you can actually ask it to discard several blocks that are not in use anymore. That's a nice feature, because I remind you that the disk are actually in RAM, and I don't have a lot of RAM. So if, for example, I install using RPM, it extracts a lot of stuff, it extracts it in memory, and whatnot. And then I issue a small fstream command, and it reclaims some of the space that I need as memory. So that's one of the benefits in virtio scasi. That's a very delicate issue. So Lago actually reposyncs all the packages that we need for the installation, and creates one big local repository. Again, if you have the RAM, put it on RAM. And we can install only from it. Ideally, we can disable all the external repositories. It has many advantages. Hopefully, the packages you have, the packages that you need, you are aware of the dependency. And you don't need to fetch the metadata from whatever mirror that is now fast, is now not very fast. Now it has a missing package. You go to another mirror. You still fetch the metadata from the internet. It's slow. Generally, it's slow. Even if it's several, I know, 20, 30 seconds to get all the metadata of RAM, it's still very slow, as opposed to the local repository that you have. So that's very nice. You're independent of those external repositories. There is one huge downside. You need to keep the list of packages up to date. That's a total nightmare. And to be honest, we gave up on the idea. And we didn't disable the, we used for a long time to disable the external repos we just gave up. Just for example, centers between 7.2 to 7.3 put all the updated packages in the base channel, whereas in the past, they were in the updates channel. So we had to do the whole thinking again. Now you probably could script it, but we haven't gotten to do it. So how do we run VMs? We run them using, hopefully, the best practices. We don't need a lot of the devices to remove as much of the devices. We don't need the graphics device. We do need the RNG device, which helps performance. We use VRIT-RSCSI when we don't need VRIT-RSCSI. We don't have the SCSI controller. We use IO threads. I hope best practices of how LiveRIT is supposed to run. Test, that's the most important thing here. People are afraid to run tests in parallel. So first of all, in real life, customers and users of projects run things in parallel. They're not doing one thing at a time. They do run things in parallel. Parallelism adds complexity. That's the downside, of course. We need to remember that between them, we need to weigh in. But usually, a test that does something and just waits. And usually, there's no bottleneck. It's just a slip here, a slip there. Maybe the CPU is working, but the disk isn't so on. So I try to parallelize as much as possible. There are many items that you can do in parallel. For example, when we add multiple storages, we add them in parallel, simple Python threads. They just do all of them. If one of them fails, yes. It's a bit harder to debug, but it's still. Again, in CI, you hope things are usually stable and usually pass. And it really reduces the time considerably. For example, doing a live snapshot and doing a cold snapshot on another disk, another VM, could run in parallel. They're completely independent. It adds complexity. When there's a failure, you need to understand where the failure is. But it works. And when it works, it really saves a lot of time. I wish there was something even more advanced that would allow me to run completely independent tests alongside multiple tests without me manually doing the, OK, these are two threads that might be related. So let's run them together as one single unit test and so on and so on. But that's what we have today. Diversify is also important. Try and not do the same thing over and over again. Try to do different configurations. For example, in our test suite, we have two VMs. They are configured as much as possible differently. One is using the E1000 network. One is using the VertiOS SCSI network. One is maybe using VertiOS SCSI. One is using VertiO block. So you gain more code coverage by simply not repeating yourself. It adds more in the same time, essentially. It doesn't take more time to configure it a bit differently. Positive flows only. Questionable, because the most interesting stuff is actually negative. Negative is hard for two reasons. First of all, it's always slow, because until the application realizes something bad happens, until it realizes it needs to recover, the recovery takes time, it's difficult. And if things fail, it's difficult for the test itself to actually recover from negative scenario. I believe that you should concentrate really mostly on the positive test cases. But it's debatable. OK, what could we do faster? It's not a wish list. The things that probably we could do, I think my main concern is about YAM. I wish YAM would have been much, much faster. I know there's work on DNF to make it faster, download faster, and so on. But generally, when I'm installing 300 to 500 RPMs, surely there are many RPMs that I could install completely independently. They have nothing to do, no dependency between them. I'm sure there could be some magic dependency graph that could have been created and install whatever can be done independently. So that's the first item that I can think of. Of course, throwing more hardware into the problem is a good solution, so faster CPU and more memory would have solved a lot of the issues, but up to a limit. We're using NOS. I'm not sure if the best testing framework. I know there are better testing framework. I'm sure there's something better that could work with parallelism and, again, a dependency graph of the test and then run them faster. So that's one area we could improve. Microoptimization. So I talked a lot about not to do it, and of course I do it all day long. It didn't really most of the time help, so I tried a lot of small tricks that's supposed to make a difference. And again, because it's more of an art and not science, I'm not really testing the Stein one-to-one, rebooting my laptop, doing three times, finding the outliers and so on, I didn't see much of a difference between them. Again, because I'm writing into memory, then storage is rarely my bottleneck, so maybe a few of the optimization that I've tried to do are more into the storage, which in the beginning was my main bottleneck, but now it isn't. So maybe that's why it didn't really make a difference. If you haven't listened so far, and that's sad, but remember those three items. So fast is nice. It's important. You get great results. Again, 20 minutes, and we test end-to-end, lots of functionality of KVM, of Libret, of Overt, of course. But it has to be reliable. It's not reliable. Developers hate you, and it fails. Or they click retry, and retry, and just overloads the CI system, then it fails. Second rule, use Devchem. People are not using it enough. As an example, I've heard of a lecture about running OpenShift on Rev, and they had the cheapest storage they could buy, which obviously was the bottleneck. But from that 2.5 terabyte system, there was about almost 1 terabyte of free memory, which they could have used for scratch disk for those VMs. There's nothing wrong with it. It's just a scale test. If anything bad, you're a T-ready part. It's all anyway automated. There's nothing manual in the process. You could have used it. Use memory more often. Last but not least, parallelize your simple test. Very simple. Try not to make it complex. Try not to make mistakes. You make it complex. It's fragile. It's fragile. It breaks. Developers hate you, and so on. Questions? Yeah? So you do a lot of testing with? So apart from a small bug or miss feature where we actually use a host pass-through and not everything worked as much as we hoped to, it seems to work fine. We're using, I think, the majority are Intel. I'm sure we have it in AMD. I know that if there are bugs, then the QMKVM guys are looking into them and fixing them. But so far, it has proven touch wood, very, very performant. We haven't hit any bugs. I would say that it was surprising to have KSM or the host and KSM within the level one host competing with each other on trying to merge pages. So that might have been an issue. But it hasn't been an issue for us for a long time. No, no. We're just using X64. And you may be correct. Maybe on another platform, it may not be as mature. Yes? Yeah? Yeah, so I would think that QE should run them on a nightly basis, on a nightly build maybe, on every delivery. It cannot run on any patch. And CI, if it's fast and efficient, it could run on any patch. And there's a gating item. If it's slow, and again, what's slow? Slow is between two to three minutes. And if something else can run in parallel, then that's fine. Maybe you can run a negative scenario, and completely independent scenario. I'll give an example. If you have something wrong with storage, but at the same time, you're just testing that you can log into the UI, which has nothing to do with the storage, then that's fine. Then you're not wasting a lot of time. So you can do this compromise. Of course, if you have a bug, then who needs to debug it? Storage guy on the UX guy, which are completely different teams. So this is really the question.