 Takže všechno v pořádku teryň. Můžu dále. Budeš v nějakému? Ještě jsem nějak ho zeptal. Tak jo, okej. Můžu byť, můžete. Možná se naknávete někoho zeptavstě. Vždyšme, mé se ležíte. Zde zemí? To nejsou rozpovědětou. To můžete svět na něco dokonšit. Já jsem nevím, že se vám závodují. Já? Co jste? Já jsem v 3IBA. V hodině a můžete mít a děláme, a Káberos, Hoda, ale takhle můžete mít. O, OK. O, OK. Já jsem dělal KVM a můžete mít. Já jsem dělal KVM a můžete mít. Já jsem vždy, vždy, vždy, vždy, vždy, vždy, vždy, vždy, vždy, vždy, V holidays talkers about, ja... Jsou mít za vylutedi příjtad. Tak o Boji, tady nejsou ryecti boh다는 a krátko honderali? Můžete děcho, s顨zení na ryeč Mom overtotím Kam se náram z personnel. Taky se z Rachel Instellar dělal hosh že jsem se s tím ryečom intru phaseso trapped a te Manhastan, že z k nuestrasho poslerce. Takže to je to, protože je to zvukovat. A já chci můžu bychom zvukovat, co jsem říct. A mám tím, že je náš 40 minut. Můžu bychom tím, že ještě zvukovat, že ještě zvukovat. Tak, perfekt. Tak, dějme. To je všechno. To je Martins, já se jim říkám. Tak, můžu. Já se to zvukovat. Můžu bychom zvukovat, protože ještě zvukovat. Hudba. Až se jim říkám, že ještě zvukovat. Jo, to je můžu. A ještě zvukovat, protože ještě zvukovat. Tak, můžu bychom zvukovat, protože ještě zvukovat. A můžu bychom zvukovat. Tak, můžu bychom zvukovat. Můžu bychom zvukovat, protože ještě zvukovat. Tenhle je hodně řekl, jak existenci, až si ho předtě zvukovat. Díky, díky, díky, díky. Díky, díky, díky. Real time is a little bit of a difficult topic sometimes. The most important thing that you have to understand about real time is that in a real time system what really matters is the maximum time that an operation takes. And this maximum time is non upfront and it's guaranteed to be the same even if the system is under very high load. So for example, a thread wake up. Supposed to have a thread waiting for something. For example, a packet to arrive, an alarm to trigger, to fire. And then the packet arrives or the alarm triggers and your thread wakes up. As it turns out, there are many things that can happen between the thread being ready to run and the thread actually executing on a CPU. On a real time system, the time it takes for a thread to wake up is non upfront. So I could tell you, well, in this real time system it takes 5 microseconds for a thread to wake up. And even if the system is under very high load, it's going to be 5 microseconds. On a non real time system, you really don't know how long can take. It could take milliseconds, you don't know, it's unbounded. And where you are going to use this kind of system? It's usually used for workloads where missing deadlines are bad. An example of this are telecommunication networks. So, you know, the networks that we use when we use our cell phones so that your device doesn't break up. Vehicle contra and avionic systems usually have to be real time. And stock trading systems also are usually real time. And how do you do real time on Linux? As it turns out, you need a patch. You need to patch the Linux kernel. And the patch is called preempt RT patch, which means real time preemption. We also call it a real time patch, RT patch, but it all refers to the same thing. So you have to patch the kernel and then you apply this patch and the kernel becomes a real time kernel. This patch exists for many years. It's not new at all. And many features that people behind this patch have developed to support real time are already merged in the mainline kernel. So some of the features are having RQ handlers as threads, high resolution timers, PI, Mutex as a support priority inheritance, making RCU read critical sections preentable and having a deterministic real time schedule. So those features, they are very complex, each of them. And they either originated in the RT patch or they were motivated by it. Now there is one feature which is not on the mainline kernel yet. And as it turns out, we can say that it's the core feature that actually makes the kernel real time. So this is, it's not actually a feature. The feature itself is a PI Mutex, on the real time kernel, you convert spin locks into, is leaping in spin locks. In the real time kernel, the spin locks don't spin, they slip. And they support something called, oh sorry, priority inheritance. So this priority inheritance thing is to avoid having a high priority thread waiting for a lower priority thread. This is a problem that has to be solved for real time. And then another thing I'd like to tell about this patch is that it's not a simple patch, we call it a patch, but it's actually a patch set with more than 200 patches. And the core thing there is this spin lock conversion. But there are other fixes that are specific for real time and are not merged yet. So the important thing to understand here is you need this patch and spin locks are different in the real time kernel. So we have the real time kernel, why do we have to make KVM real time? Why do we have to make KVM deterministice? The short answer for this question is that by doing this, it allows you to have real time workloads in your cloud, for example. You get all virtualization features for your workload. But there is a more important reason for this. And it's the telecommunication industry. As it turns out, the telecommunications industry is about to go through the revolution. And the revolution is called network function virtualization. The telecommunication industry networks, the ones that are built so that we can use our cell phones, they are built with proprietary hardware, proprietary hardware and proprietary software. So this has cost issues, it has scaling issues and the telco guys, they want to virtualize all the telecommunication networks and they want to use commodity hardware, open source, KVM and open stack. So they are the most important driver for this feature. So what's real time KVM? When we talk about real time KVM, what am I talking about? I will tell you what real time KVM is not first. So it is not a new kernel module or a module option that when set or loaded, your VMs become real time. It is not that. There were many kernel chains that were needed, but they are all upstream and just a few of them are KVM specific. Most of them are fixes to get the kernel out of the way of KVM. So those changes are upstream already. So as long as you get a real time kernel, when I say real time kernel, I am talking about the Linux kernel with the RT patch applied, when you get a real time kernel in the guest and in the host, real time KVM is really about two things. First, a BIOS configuration that is required. And secondly, it's a very specific host and guest configuration that you have to do. Let's take a look at the BIOS configuration. As far as the hardware goes, we have used a pretty standard X86 box so far. The only detail is the BIOS configuration. So the BIOS has something called SMIs or system management interrupts. Those are special interrupts and they are triggered by the motherboard. When they trigger, the kernel stops executing, the kernel is halted, and the BIOS runs. When this happens, you are not real time anymore because the BIOS runs a mode called system management mode and it can take milliseconds. You don't know how long it's going to take. So you lose the determinism that I talked earlier. So you have to disable this. But the problem in disabling this is that there is no SMI option in the BIOS. As it turns out, any BIOS option that requires the system management mode to activate SMIs automatical. So how do you know which BIOS options you have to disable in order to disable SMIs? Some hardware vendors, they have a document. It's called a system low latency settings and then system name. This document usually has a list of options that you have to disable. You disable those options, you are done. You won't have SMIs. If your vendor doesn't provide this document, then you have to find the option for yourself and it's not easy, it's very difficult. The preemptor T kernel provides a kernel module and there is a Python script to the user space part called hardware latency detector. So this module tries to find if SMIs are enabled. If they are, then let people get in. If they are, then you are not done yet. You have to keep it disabling option. Now the bad news is that some systems are not fixable. They have hardwired SMIs. Those systems are not for real time. You cannot use them for real time. So once you do the BIOS change, the next part is to start the host configuration, the host setup. So the first thing is the host requires the real time kernel, as I have already mentioned. And then you have to do something called host partitioning. Partitioning is the process of creating two groups of cores, of host cores. The first group, it's called real time cores. And the second group, it's called house keeping cores. The difference between those two groups is that the real time cores group, they are aggressively isolated. Being aggressively isolated means that they are going to run only two things. First a single thread and a few CPU bound kernel threads. The house keeping cores, they run everything else. They run user level processes, kernel threads, hierarchy handlers, everything else. So now I'm talking about the concepts, because they are a little bit difficult. And after I show you the concepts, I'm going to show you how to do this configuration. So let's see some diagrams. Suppose you have a host, this is a host, and it has two sockets, two numa nodes. Each numa node has four cores. So to do the partitioning here, what we could do is we are going to take all the numa node zero cores and make them house keeping cores. And then the numa node one cores, they are going to be our real time cores. So we apply the configuration that I'm going to show you sharply. And then you are going to have this. House keeping cores, as I said, and real time cores. As it turns out, the guest requires the same configuration as the host. So it requires the real time kernel as I keep repeating this. And you have to create two groups of vCPUs. So you are going to have real time vCPUs and house keeping vCPUs. In addition to this, in the host, you have to pin each real time vCPU to a real time core and each house keeping vCPU to a house keeping core. Also the vCPU threads all of them need real time priority. And we also reserve huge pages for the real time vCPUs. And then your guest runs in the guest, of course. And the real time threads are pinned to real time vCPUs. Let's go back to diagrams to make it easier to see this. So suppose I have a guest. And this guest has six vCPUs. This guest shown here is already partitioned. So I have two house keeping vCPUs and four real time vCPUs. Let's do the pinning. After you do the pinning, you can see here that the house keeping vCPUs, each of them are pinned to a different house keeping core and each real time vCPU is pinned to a different real time core. And then your real time application, again, runs in the guest and the real time threads, they are going to be pinned to real time vCPUs. Real time kvm is base for that. No, you cannot do that because I'm going to talk about that now, actually. So as I said, the real time vCPUs, they are aggressively isolated. That's the difference between the two. So if you have a thread that is doing something that's not supposed to do on the real time vCPU, you're going to get spikes. So you have to move them to house keeping cores. I don't know if that was your question. But maybe we can... OK, no, because if you have a thread migration, a task migration between real time vcPUs, then this generates a spike. I have a list of things that generate spikes. So it has to be pinned. It's mandatory. So, sorry. So as I said, the real time cores or vcPUs, they are aggressively isolated. How do you do this? It's not very simple. So there are many things that you have to do. The first of them is you have to boot. It's written host here, but it's actually the guest or the host with ISOCPUs and Novartfall. These are kernel features. So ISOCPUs, you pass a list of CPUs and it won't schedule user level processes to the CPUs listed there. OK? Only kernel threads are going to run on CPUs that are listed in ISOCPUs. The other feature is called Novartfall. This feature also gets a CPU list and what it does is if possible, if some conditions are mapped, it disables the tick, the kernel tick. So the kernel has something called the tick that it's a timer that fires 1,000 times a second and it's used to do bookkeeping and it's an opportunity for the kernel to, for example, decide that a process has run for too long and now another process has to run there on the CPU. But it causes 1,000 interrupts per second. If you get only one thread running on a core or on a vCPU, you don't need this interruption and it's bad for real time. So with this feature, the kernel disables it. And then another thing you have to do to completely isolate a core or vCPU is to move all kernel threads out of the core or the vCPU. So the kernel has a CPU bound kernel threads. They are not removed, but usually they serve the thread that's running there. So if you're a thread that doesn't require any service from those threads, they are not going to run. You also have to move all interruption threads out of cores or vCPUs and then you have to do something that I'm calling run on all cores timers because some kernel subsystems create one timer per core and those timers, they do a polling. So they run at every few seconds or every few minutes and they can interrupt your real time application. And if you set up the system in a way that it doesn't interrupt your real time application, it's going to starve, so you don't want any of these things to happen. Different subsystems have different ways to do this. So the MECE driver has a kernel option for it. This lab allocator does this too and the solution here is to use this log, for example. KVM Clock does this too but we have added a new option so that it doesn't do. Sorry to say that, but if you could for the end of the talk because I'm afraid about the time, I'd like to answer questions and make this a discussion but I'm just afraid about the time, sorry about that. OK, so as you can see doing the partitioning, doing this aggressively aggressive isolation thing are not simple. They are not trivial, right? The good news is we have automated all this stuff. The automation here has three components. The first component is called TuneD. TuneD is a tool that implements the concept of a tuning profile. A tuning profile is a file where you list tuning steps. So things like writing to CSS, writing to croc, assigning real time parameters to threads and things like that. When you activate a profile, the tuning steps are going to be all executed and they will be on the system. We have created two profiles. One for the host and one for the guest. What you have to do is you install, tune, you install our profiles and then you pass on the host the guest to pass a list of cores or vCPUs to be a real time. TuneD is going to do the rest for you. It's going to do everything I have described so far. Mostly actually, sorry. So there are some details that it doesn't do, but for example LibVir, it's the other tool that you need to get this stuff automated. LibVir does CPUP and it assigns real time parameters to vCPU threads. And then there is OpenStack. OpenStack is not finished yet. I'm going to talk a little bit about this later, but when it supports for it is read, it is going to provision and manage real time guests automatically for a host. Now let me show you some testing results. There are two important details about testing. We have used a tool that is very popular among real time people, which is called CypikTest. So this tool measures thread wakeup that I have described in the beginning of the talk. And the results shown here are against rail kernel just because it's where I work and it's where I do a lot of measuring, but all the changes that we had to do for the kernel, they are all upstream. So I do expect that you are going to get the same result on an upstream kernel. So we have created three test cases. The first one is called single VM. In this test case, we create one VM with two vCPUs. One for real time and one for housekeeping. The host is fully set up, it's fully partitioned as I said earlier. For this test case, we run CypikTest with one measuring thread and for 24 hours, the maximum result that we actually get is 11 microseconds. So I forgot to say something. CypikTest, it outputs three values for you. Minimum, average and maximum. But for real time, we care about the maximum. It doesn't matter much if your minimum is good. You care about maximum that what real time is about. So for the single VM test case, we get 11 microseconds. This means that in the real time PVM, a thread in the very worst case is going to take 11 microseconds to wake up. When we started this work, it was... If you were running the real time kernel without all this configuration, without the fixes we did, it was 80, 90 microseconds. And if you are not running the real time kernel, it's milliseconds. So the other test case is called multiple vCPU test case. In this test case, we create a VM with eight vCPUs. Two for housekeeping and six for real time. In this test case, we have six measuring threads running, waking up and being measured at the same time they VM. In this test case, we have... And then you get results for all your measuring threads. So we have six of them and the best result is 14 microseconds. The worst is 19 microseconds. Finally, we have a multiple VM test case where we create four VMs, each of them with two vCPUs, one for real time, one for housekeeping. So in this test case, we have four VMs running psychotests in them and they all do 12 microseconds. Now, to get these good results, there are some rules that the gas actually has to do that the gas actually has to follow. So there are some limitations here. So the most important of them is that the real time vCPUs, not the housekeeping vCPUs, the real time vCPUs are not allowed to exit to user space. So vCPU thread, it usually exits to user space to do IO. This means that the real time vCPU it can only do one form of IO and that is network IO with a very unique and V-host or device assignment. Otherwise, the real time vCPU cannot do any other form of IO. So a question that people usually ask is, okay, how do I do block IO? Well, you can use the housekeeping vCPUs. You can pin a thread to the housekeeping vCPU to do the IO, the block IO. The housekeeping vCPU it can exit to user space. But your real time application cannot wait for it to complete because you are adding non-deterministic behavior. It has to queue and then forget about it. Another detail is so far we have used very minimal guests. So for bare metal it is known that some hardwares they generate spikes, latency spikes. We have decided to we don't know if this is possible if this is going to happen with guests but we decided to create very small guests. So our guests so far they don't have USB, they don't have sound card, they don't have a graphic display, they don't have additional PCI slots. They are very small in terms of hardware. Also, there are some operations that if performance or host are in the guests they are going to generate a spike. They cause latency spikes. For example, CPU hot plug. You cannot do CPU hot plug or unplug in the host nor in the guest. The loading of kernel modules or unloading also used to generate spikes but I think this is getting fixed or is already fixed. As a guess it was your question, I guess. We cannot have task migration between isolated vCPUs or isolated cores because this generates a spike so threads have to be pinned. Page faults or swapping are really disallowed in real time because they generate spikes and the host has to use a stable TSE. This is required. This is my last slide. Work that is still in progress. There is a new feature in Intel CPUs called cache allocation technology. This feature allows you to reserve a portion of the L3 cache to threads. As it turns out we need this for real time KVM because with this feature we can have real time vCPUs and real time vCPUs and housekeeping vCPUs sharing the same sockets. Otherwise if you don't have this feature then an application running on a housekeeping core could trash the L3 cache and this will generate a spike. This allows you to reserve a portion of the L3 cache to the real time vCPU and in the end to the real time thread. Patches have been posted for this feature but they are still in discussion for the kernel. This is a kernel feature. The other thing that we are doing is we are measuring network IO latency and we are doing migration measurements. The network IO latency that we are measuring we are trying to create a test case that is probably going to be useful for the telecommunications network. We create a test case where we have a VM and it's running a DPDK application that does packet forwarding. We measure what is the latency for a guest doing packet forwarding. For this test case we are using OVS and DPDK in the host and as I said the application that does the forwarding is a DPDK application. And finally there is open stack support. I am not completely up to date to this but a lot of features that we need on open stack are already merged but there are some bits that are missing and this has been worked out right now. So that's it. I hope you enjoyed it. I know it's a bit difficult topic but that's it. You have questions? A lot of questions. We are going to do more scars. Go ahead. It is wake up latency. We are measuring how long it takes when a thread... What second test does this? A thread setups a timer when the timer fires it wakes up. So it measures how long it took from been waking up to actually running. And this is 11 microseconds. That's a very good question. It's one microsecond, two microseconds. Yeah. Okay, I'm not trying to... We cannot expect that at least not today, maybe in the future with more features, etc. More CPU features that this test case will perform the same. Actually, this is not even the goal. Our goal is not for... And this is interesting. Our goal is not for the guest to perform the same as bare metal for now because our goal is to satisfy the latency requirements. So if the latency requirement is 20 microseconds, we are good. It doesn't matter how better the host does. So that's a good question and a good point. You've got a skirt for yourself. What up? I don't know who raised first. So the short answer is no. I read about this and for very simple real-time systems there was a professor in the US, I guess. He was advocating for trying to prove mathematically that latency cannot be higher than x-value. But for a complex system like this, I would say it's really impossible. And the testing that people do is they run the test case for three, four days and if you don't reach a spike, they say, well, that's probably good. On the one hand, it's not mathematically proven, but on the other hand, most real-time applications, they don't have a very, very complex code path that they take. It's not extremely complex usually. So usually you know more or less which code path that the application is going to take. And for psychotests, we run it millions and millions of times. It's a loop. And it's going to take the same code path. On the one hand, psychotests are too simple, but on the other, real-time applications, you kind of know the code path they are taking. And the test case usually is trust this code path. So if you run this million, if not hundreds of millions of times and nothing bad happened, you assume that at a certain degree, it's not going to happen afterwards. So you've got a scar, too. But you can come here. I don't know how is the time. OK, so we have time. OK, I know this guy there. I remember him. No, not for now. We use this test case because psychotests, we use psychotests because it's kind of the easiest way to get started. And in my opinion, you kind of have to do this first and you have to have good psychotest numbers. But the truth about psychotests is that it's not real. It's not what real-time application we'll be doing. So the networking, the PDK, et cetera, stuff, we are doing this because the people who are going to use this now, it's what they need. So it's where we are working on because of this. Now, one important thing about real-time is that real-time is not like you install it and you can run any application in real-time. It's really not about that. Real-time is really specific about specific real-time applications. So if there is the need, then we do it. But if there is not, then we don't do it. I don't know if I am answering your question, but we are focusing on the requirements we have now about hardware acceleration, for example. If it turns out to be to solve some of the problems we have, then it might be worth looking. But sometimes, because I don't know if this was clear, but it could be a good idea, real-time is not about being fast. It's about being deterministic. You could do something like this that is faster and we improve the number, but it's not deterministic. We run it once and we get one microsecond because it's accelerated. We run it again and we take 20. Not good for real-time. OK, you I guess. So go ahead. So we have worse curves. That's the hardest, hardest, hardest part. Myself, what I'm doing is, I have some trace points. I use F-trace, the kernel tracers. I have some trace points that are a little bit optimized for cycle tests. I know the cold path that cycle test takes and when there is a spike, so instead of getting 11 and get 20, I look at my tracing and it tells more or less where these spikes occurred. So I have to add more trace points to start narrowing down the area. And then eventually, I find it, and it already took three days, four days to find a spike. And usually, it's a bug. For example, for ISO CPUs, we had a bug where you isolated a CPU. But then the kernel was running code to see if it could migrate tasks to that CPU. But you don't need it. You isolated it. You are not going to run anything else there. So you are not supposed to be doing this. So this was fixed. It took 12 microseconds. OK, go ahead. So we are testing on X86 for now. This is kind of new work. It's more or less a little bit more than one year. But I have been talking to an ARM guy who works with ARM and the telcos are interested in an ARM. So he told me that at some point real-time KVM and ARM had some ways out of time, out of time. Sorry, you had a question?