 All right, ladies and gentlemen, thank you for coming out today to come to my friend, Taysian Hyde. Today he's going to be talking about control of major resources in C-group volume two. What he really does is he maintains resources in the C-group algorithms and such and today he's going to give us all a presentation of what he's all about. Thank you. I think this is for a question. So, yeah, thank you. I think I can just leave it on the table. Yeah, so I am Taysian Hyde. I work for Facebook. I've been working on C-group for way too long now. It's kind of getting tiring, but who here has heard of C-group? Awesome, yeah. Who has actually used it? Who has actually used C-group B2? Oh, yeah, you're awesome. So, like, C-group B1, it kind of dates way back. I think it was close to a decade now, maybe eight years, I don't know. But like the thing with C-group B1, C-group B1 was like pure and simple and flexible. In that sense it was kind of beautiful. Like, you just have like this multiple trees of all threads in the system. You just can have as many as you want and you can organize them, you know, whatever way you want. And then we try to put resource control and accounting on top of those trees. And then, I mean, it looked like, I mean, we had a really flexible mechanism and we have like certain goals, like vaguely about resource control. And we saw that it would work out. Turns out it didn't really work out. That's why we started B2. And the thing with B1 was that, you know, we really didn't know what we wanted, right? We had certain ideas. Yes, we need hierarchical something, right? And threads are pretty important in the system, right? I mean, they should be the cost entity. So, I mean, we started there but didn't quite see where it led. So, B2, we started with a lot more concrete goal, which is a single sentence. Comprehensive, hierarchical, control of or significant resource consumptions in the system. And it's a fairly ambitious goal, right? Comprehensive, like all, right? They're like big words. But this is where we wanted to get to, right? I mean, this is a goal and all the designs are geared toward that, I mean, achieving that. Which is different from B1. I mean, we really didn't know what we wanted to do with B1. B2, this is what we wanted to do, what we want to achieve ultimately. And there are a lot of details going into that. But the biggest part, one of the most fundamental change from secret B1 going to B2 is the concept of resource domains. And it's kind of tricky. The thing is that when you use secret B2, you really don't see them. And once you get used to using secret B2, there's no reason to notice them. It's not immediately noticeable in the user interface, right? But like there are, like compared to B1, if you have used B2, B2 comes with a set of constraints in what you can and cannot do. Like in terms of structuring the resource hierarchy, and in terms of like where things can go, it comes with restrictions. And those restrictions are there to primarily enable resource domains. And I'm not going to go into too much detail into it because it kind of gets boring, but I'm just going to try to look at the big pictures around it, and how it is useful in achieving overall comprehensive resource control. So resource domains. So if you have your machine, a laptop or whatever, you put it up, you do something in your machine, let's say you're copying a big file from source file to destination file. Right then you have your CP process, right? And then the system is doing a bunch of things, whatever that may be, right? It's reading the file, you know, it allocates memory, this, you know, fire into the memory, the content into the memory, and it writes it out, right? And then, you know, something kicks in in the operating system, and it writes that down to the hard drive eventually, right? All of those things happen, right? And some of those are directly linked to your process, right? Reading a file content, you know, there's your process doing that synchronously. There's your operation, right? Others might not be. Let's say, I mean, you build a network application, receiving a packet. Sending a packet can be chained to you fairly closely, but receiving a packet, when the operating system receives a packet, it doesn't know who it belongs to, right? It's just, you don't know, right? It's just assistant consumption, right? And then, eventually, it gets routed to your packet queue, your socket. But, you know, for a while, you really don't know. So, I mean, there are a lot of different types of resource consumption in the system, right? In your laptop, there are processes, and, you know, if you run top, like, there's, you know, sys, there's, like, interrupt, right? Those are not really well-classified, and there are, like, these current threads doing something. They also consume a lot of resources. So, all these things, like, resource domain, conceptually, is what contains all these resource consumptions. And with a C-group, right, a single system, right, a laptop, a phone, like, I mean, these are single resource domain, right, they're not split up. And, yeah, that's the second point. And, third, in any C-group V2, resource domains do not nest, right? So, a single resource domain doesn't contain others. And then, I will get to that later. So, in C-group, simplified, this is not the actual hierarchy. Like, the thing is, you can build more flexible hierarchy, but, like, the constraints in C-group V2 allows the current implementation to map the hierarchy that you built into something like this internally. And that's what enables, you know, the benefits from having these restrictions. And resource domains, well, another thing is that, unlike V1, in V2, you only have a single tree, right? In V1, you have, you know, if you use a system D system, you have CPU, comma, CPU accounting, you have memory, you have IO. All those are separate hierarchies. And while in a lot of systems, they are, you know, kept in about the same topology, they don't have to be. They can just vary, you know, wildly. But, on C-group V2, you have a single hierarchy. You can enable controllers to different levels that you still share. Like, if your IO controller and your CPU controller, like, different types of resources, they still look at the same resource tree in the system. And there's one big difference. And the other is, as I said before, right? So resource domains do not match. So which means that there are differences between internal nodes over tree and the list nodes, right? So in C-group V2, like the list nodes in the graph, like, you know, all the Ds, those are resource, these can be resource domains. They don't necessarily have to be, but they can be. But all the internal nodes, right? Not the special, but like all the I nodes, right? Those are, those cannot be resource domains. They cannot contain actual resources. Only the list nodes can contain processes in actual resource domain, resource consumptions. And the internal nodes, their role is distributing resource from top to resource domains. How many, you know, layers that may be? So they have different roles in the tree. And so going back to, like, a single resource domain thing, like, I mean, why, why, why is it a problem to have, like, an IO resource controller and memory resource controller on a completely different resource hierarchy? Why is that a problem? Why do we want to unify that as a, you know, single resource domain graph? We're all familiar with this output, right? And it's the first one. On any Linux system, you, this is from my laptop. You do free. Then it shows a bunch of numbers. And there's a, like, this, like, weird column that people are bothered about sometimes. They used to be NS, they used to be NS. But for cache, right? That's our point two gigabyte of my system right now. Not right now, this morning. But, I mean, like, this thing grows, right? I mean, people, like, get mad that available is not, you know, like six gigabytes. Who's using four gigabytes? People used to get upset about it. But this is called page cache or buffer. And so when you, like, read a file, when you, let's say, when you open an image file, which is really big, which is, like, 32 megabytes, and you have really slow hard drive, right? You open it once, you know, it reads the file from the hard drive, which takes some time, and it shows up, you know, and you can see the delay there, right? You close it, you open it again, then it boop, you know, it's right there. So that's what page cache does, right? I mean, it caches content of file system files in memory. That's its primary purpose. And think about, like, when you're doing writes, when you're doing a write system call, right? Or you mmap a file, and then you change the memory, you know, add content, and the file gets changed on the file system, right? That's how you write a file on a unique system, right? Like, either you do write or you do share that map. And, like, I mean, like, have you thought about what happens after, like, you do a write? Not a direct write, just regular write, right? It doesn't really, you know, like, when you do a write system call, it doesn't really trigger the IO immediately, right? All it does is that it changes the memory, which is mapped, which is caching the actual file in memory, it just updates the content for that, and then the page is marked 30, right? 30 meaning that, you know, like, there's a difference between the memory content that you have and the underlying file. So, eventually, right, the 30 memory needs to be written to the disk so that they, you know, they get updated, and on the expert, you get the, you know, the data that you wrote the last time. So, that's the concept of dirty memory and the operation of putting dirty memory into the backing file. It's called write-back, and it's like a really fundamental part of Linux MMM or probably any Unix, right? When you do a write system call, it gets buffered in page cache and it gets written back eventually. And what governs that are like this, if you do cctl and graph for bm.dirty, these are parameters which governs the behavior in a system, and there's a dirty background ratio and dirty ratio in cctl6. But, I mean, cctl6 is that, you know, if you don't want to keep, you know, dirty memory too long in your memory, you know, if you lose power, then you lose that. So, you want to write that out periodically. That's what governs, it's governed by write-back cctl6. But, I mean, beyond that, there's these two ratios, background ratio and just ratio. And these two are mark the lower bound and upper bound of memory that you can have dirty, right? So, 10%, the background ratio is 10%, which means that if you dirty 10% of your memory, the corner will start writing out memory to disk, right? And 20% means that if you reach a 20, even while we are writing back, then we are going to throw it to you down, right? And you cannot dirty more than 20% of the system because, you know, then, you know, you are changing too much and, you know, memory allocation becomes really difficult. So, that's the range that we set. That's the default value on recent corners, so, 10 and 10%. And let's go back to the, like, copying a big file on a system, right? We forget about C-group. So, you run a CP on a really big file, right? So, what do you do? You allocate memory, open a file, read the content, put that in memory, right? And then, you should have opened the destination file, too, and then you issue the write to the file, right? So, you're issueing, constantly, issueing write to the destination file. And what you're doing by doing that is that the corner, on the receiving end of that write system call, constantly allocates memory, right, to receive that write call, right? And then it fills that up, right? And eventually, it will hit the 10%. If the file is big enough, you're the... And because, I mean, those memories are not in sync with the file system yet. It will pretty quickly hit 10% background dirty ratio. And the write back starts, right? Write back is asynchronous. So, write back comes, looks at the memory and it finds a lot of dirty pages and it starts issuing IUs to disk to the file system so that it gets written out and the pages get marked clean and the ratio of force, no, not really. I mean, if you're doing a CP of really large file to a fairly slow disk, right? The speed that you fill up dirty memory would be faster, right? So, while the write back is doing that, your ratio still would increase and it would hit 20%, right, eventually. And at that point, what happens, right? Your IO is already saturated. Your disk is a slow disk. You try to issue more IO and, you know, disk says that I cannot handle anymore and disk queue says that I don't have any availability cast anymore. I mean, it's all backed up. You've got to, you know, wait until I finish some, right? Then write back looks at it and I cannot issue anymore, right? And then what happens? Then the ratio hits 20%, right? And then what do you do then, right? This is where the, in Linux, we call that dirty memory balancing and it's a throttling mechanism. It's called a dirty memory throttling. This is where we eventually, at the end point, propagates that pressure to the actual CP, right? So, when you reach that condition, then CP tries to write more data to that file in the write path, in the dirty path. We look at the situation of the system and we realize that we cannot let, you know, this thing dirty more pages right now. Then we just stop it, just make it, force it to sleep until some more pages are freed. It's not that simple. I mean, we do more intelligent things to smooth it out, but that's the essence of it. So, the whole chain of pressure propagation is that when you dirty a lot of pages, right? Dirty pressure builds up, right? When it reaches a certain point, it triggers right back, right? Which builds up IO pressure, right? And IO pressure gets built up, built up. And then it gets saturated. And then it builds IO back pressure. Then IO back pressure gets propagated back to memory, which builds up dirty memory pressure, right? And eventually, at some point, it gets propagated back to the process which are dirty in them. And they get throttled. That's how you regulate the speed of something which, you know, writes a lot of data, like CP. That's how your CP process gets throttled eventually. Now, I mean, think about it. So, there's like forward and backward pressure propagation between memory management and IO, right? When IO gets backed up, or when memory gets backed up, it needs to issue IO to somewhere. And from that IO, it gets backed up. It needs to tell the memory domain to slow down, right? I mean, it needs to be able to tell that. And if you have like completely unrelated resource domains defined across IO and memory, you cannot really do that, right? I mean, when your IO domain gets clogged, you don't know who to tell, right? You don't know where to propagate the pressure back to. So, that question, right? There are a lot of operations which are really important in the modern operating system which span across multiple types of resources. And this doesn't... This is not limited to memory and IO, right? I mean, the same thing can be applied to CPU, right? If you span CPU cycles, receiving packets, if you span CPU cycles increasing disk data, which is from right back, right? Where are you going to charge that, right? You kind of have to have these connections. So, that's why SQL v2 insists on having a shared resource domain defined across different resource types. So, that's one thing. The other thing is we want to have always an ambiguous resource configuration. Let's look at this. It's kind of an ugly graph, but if you compare to the previous one, the internal ones are marked I here. Let's imagine that v2 and v3 are nest below d1 and all three contains processes. Let's just imagine that. So, that's this graph, right? And v2 and v3 are fine. They are leaf nodes. So, nothing changes for them. They contain processes and the anonymous consumption produced by the activities of the processes. And they are contained in v2. Another aspect to think about is this happens in a regular system. If you take out C-group, v2 can be a single system, processes and accompanying anonymous consumption. And while we don't have a clear definition, we sometimes have some maps, but we usually don't have a complete control over how the resources are divided across these anonymous consumptions and various processes. We do have a fairly good working convention for them. We have certain priorities we assign to them and it mostly works out. Sometimes people do that, sometimes people need to tune the priorities of countless ways or whatnot, but in most cases, it just works out. We have working convention there. And what C-group tries to do usually, no matter the resource type, is trying to scope the system-level operation and just carve it out to smaller chunks. So, v2, why it contains all different types of resource consumptions, it just kind of maintains the behavior of system-wide behavior. It tries really hard to stay close to that. And it works out most of the time. So, v2 and v3 are fine. Now, if you look at v1, so v1 is a resource type too, resource domain too, so it contains processes and the rejecting anonymous consumptions in it. And at the same time, it contains resource consumptions which are coming from v2 and v3. Now, you have competition between your own resource consumptions, including both direct and anonymous. And your children, your child domains resource consumptions in your too. And it's really unclear how you should distribute your resources among these entities. So, if v1 didn't have its own consumptions, nothing is ambiguous, unambiguous, nothing is ambiguous. Because, I mean, all three, v1, v2, v3 are seagrups. They all have seagrups convicts. And seagrups convicts dictate how they get resources. So, when dividing v1's resources, v2 and v3 know what they want to do. v2 and v3 know with respect to each other how much they want to get. That's what the user configures. But if you start throwing resource consumptions into d1, then, I don't know, you need a set of knobs to control how you're going to divide up the resource consumptions across your own and against your children. And in v1, we didn't have any rule about it. And it sucked because the controllers just approached the problem completely differently. And just the behavior was, you know, surprising. I mean, it would change from version to version. It's just not good. So, that's one of the second reason why we are doing the resource domains. And we have, like, you know, resource domains don't nest rule. I'm not going to go into that here. So, by the way, if you have any questions, you know, raise your hand. I'm not sure I have enough material anyway, so yeah, questions are good. And so, on top of that, so that resource domain based hierarchy, that's the core of secret v2. That's the core concept that we built. And then on top of that, resource distribution computing models, so that different controllers don't invent their own computing method, and they kind of follow the same conventions. So weights, simple, it's proportional to control, right? If C-group A has weight of 3 and C-group B has weight of 1, then, you know, A gets three times more resource than B, right? That's what the proportional control means. So what does that mean? Well-conservating, right? Well-conservating, who heard about well-conservation? Okay. Like when you're scheduling something, well-conservation means that you don't let the total amount of work to reduce. So what it means is that it basically means that let's say you have weight of 3 and the other guy has weight of 1, and then the other guy has something to do. In that case, it doesn't matter, right? The other guy gets everything which is available, which means that the system wouldn't go idle because I have higher weight, right? So the total amount of work is conserved, so there's well-conservation. So the weight control is well-conservating, which means that it only means something when there's contention for the resource. If nobody's requesting it and there's one dude which has really low weight, has something to do, then that dude can get complete consumption of that resource at this time. It only takes meaning in competitive situation. And there are limits. Limits are just like upper bound if pressed in absolute quantity. Memory is kind of easy. While you can say you get three times more memory, we don't really do that. It's not that useful. So we set that limit in absolute terms. You get 64 megabytes. You get 205, 600 megabytes. So that's what you do. And these knobs are either max or high. Max is usually not well-conserving. This means that no matter what, you're not getting more to that. If you set max to 32 megabytes, it's a secret V1. In secret V1's model it's a hard limit. Even if the system has 8 gigabytes of memory, there's nobody doing anything else in the city getting only 32 megabytes. That's max. How high is defined is a little bit different depending on your controller. It usually means something more flexible, something more malleable. For example, memory, if you cannot possibly make forward progress with your max limit, you get killed. You get unkilled. High limit is not that harsh. You can go over high limit slightly for some time. You just get really slowed down. You get killed. It's a lot more easier to use and it's less drastic. High usually not, but it could be well-conserving depending on how it's implemented. This is dependent on the specific design of the resource controller. There are protections. They are the direction of limits. You're trying to guarantee something rather than limit something. It means you've got to have the entity to figure with minimum values that minimum has to be guaranteed. Low is softer the same way that high is soft. If specific meaning could be different, it usually means that anything which is on the low means that it's not getting a soft guaranteed amount gets priority over whoever is over their low limit. It usually works as a priority scheme where if you're on the low, you have priority over anybody who's over not low protected. As such, it's usually well-conserving. Nobody is requesting on the low than somebody else can use it. It's usually well-conserving, not always, but should be. There are locations. Locations are different. Locations are hard. Limits and protections are both boundary setting. You've got to stay between these ranges. That's what limits and protections do. Locations are just like, you get this amount. That's it. It's a harder location. We rarely use them, but we probably will use them in future slices. I'm going to go into some details with memory controller. It turns out that memory control is really hard. It's really, really hard. We spent a lot of energy and time working on it. Memory controller country implements in Secret V2 three knobs, low, high, and max. Low is the... You are most likely to have this amount. Otherwise, the configuration is really bad. If somebody else overload and you're on the low, then you have priority. That's what low does. High is the main control knob. It's kind of similar to Soft Limit in Secret V1, but in Secret V1 Soft Limit was really not usable because its semantics was not well defined and not really usable. High is the Soft Upper Bound that you can go over and you get really slowed down severely once you try to go over that, but you still don't get killed. Max is the same hard limit if you cannot make forward progress because you need more memory than max, then you get killed. Another difference from Secret V1 is that in V2 the single limit whether it's low, high or max covers all the memory consumptions that your resource domain may incur in its operations. Whether there's file system cache, decache entries, inode caches or network buffers, they are all accounted under the same limit. This is different from V1. V1 has different silos for different types of memory and that didn't really work out. So we're trying to cover and we're trying to be comprehensive about memory consumption or any other consumption really. Because now it has the advantage of the shared resource domains it does write back property. It does all that memory pressure, fold and back things that works properly now, which is great and took a lot of work, but does work now. The other thing is that the other thing which is missing is the multi-pressure measurement that is still being worked on. It's right around the corner. This morning, you want to have something like this example that you're playing with and we are actually seeing the working out result. I'm going to go into a bit more detail because this is really interesting. Memory pressure. You have a system. You have a set amount of memory and you have your workload. How do you usually tell whether you have enough memory or you have too much memory or you have too little memory? Can you think of a single value that you can look at to decide that? What ratio? Cache ratio. What if you're copying a big file? Let's say you're copying a 16GB file and you have 2GB of memory. Would you benefit from more memory? Right? The thing is, that's a really good example. If you have a disk it's a really slow disk and you're receiving a packet a really fat pipe from the network and you're downloading a big file really fast and your disk is really slow. The only amount of memory you need at that point is that the window just needs to be big enough window so that you can saturate IO bandwidth and not hinder with network bandwidth. Once you reach that point whether you have more or less doesn't matter. Whether you have 32GB or 3GB speed is the same. You're not memory constrained. On the other hand if you have a in-memory database and you're working at 63MB up until 64MB you're completely fine then you try to release the memory by megabyte by megabyte by megabyte then you get hurt. When your performance drops down really fast if you lose your memory pages of your cache because you don't have enough memory then you have to refold back in and all that stuff. The point that I'm trying to get to is that in Linux or any other Unix I think there's no single number you can look at if something needs more memory or not. They're just non. We don't have it. The kernel doesn't know. It has never tracked that data point. The sizing memory for a given workload has always been more or less a trial and error thing. You try something seems to work. Sometimes if it falls over you give it a bit more and it looks stable. And then it gets really dependent on the type of memory. Whether it's a streaming copy thing or is it a random access workload pattern. It's not an easy problem to begin with even like a system without C-group. But if you deploy C-group and now you're segmenting that memory the available memory in your system to smaller chunks four gigabytes to this guy, three gigabytes to that dude and it's smaller chunks and you have to size them and so sizing now becomes a lot more challenging. It's a smaller silo and you have to do it a lot more and you also want to be more precise so that you don't waste them. So sizing actually became a major problem in using memory C-group. So that's why in C-group B1 visual lens um-handlers if you're using memory C-group in B1 and you're being a little bit aggressive about it then you're seeing a lot of um-kills and you're handling that somehow and that's why there's a reason for that because we really don't have a way of handling memory pressure measuring memory pressure other than getting um-killed. So what you're working on is something a lot better, I think this is actually kind of really cool and so we implemented something which is canonical and it's time-based so it's not implementation-based so C-group B1 actually has a pressure level thing, memory dot pressure level but that measures the efficiency of memory reclaim operations so there's heavily dependent on the workflow type and the implementation so while you can get some kind of measures you cannot really interpret that with any a lot of meaningfulness. You just know that the memory reclaim is suffering, that's the only thing you can tell and that may, what that actually means can change depending on the workload and the implementation across different versions they would behave differently but memory pressure measurement that C-group B2 is going to implement is time-based and what it's asking is in given domain, was everyone blocked on memory? How long was everyone blocked on memory? That's the question it's asking and the answer is in the formal percentage roughly like say for example in the past one-minute time period we spent 20% of time waiting on memory when we could have otherwise run the workload so the answer is that if we had more memory we could have used 20% more time on CPU actually running so that's what it's answering when it's saying 20% it's saying that for 20% of the time I couldn't do anything because I needed more memory and it also distinguishes something which is memory bound and IO bound going back to the copying example if you are let's say uploading a big file you're reading a file from disk and sending it out and your disk is really slow so you're bound to the IO in that case if you're unpacking the file and sending it it's still a memory fault you're still waiting on memory fault but you do know that the IO bound I'm giving you more memory it's not going to accelerate you in any way so the way we tell this is that the Connor memory DM system tracks something called refort distance which means that if you read a page and after a while you end up reading up the same page that means that you didn't need the second read if you had more memory so that your page cache covered it so that's called refort and we track that refort distance for a lot of purposes and so looking at that we can decide whether something is from memory pressure or IO pressure so we can distinguish that part and we can actually meaningfully track when decide when a process or thread is waiting for memory or waiting for IO and so we track that and that's going to be exposed in percentile and the other thing which was really published is that was any block of memory that's not as important as an information but it combined it can give you a better look on the macro behavior but the important number there was everybody block of memory and for how long going to IO IO is a mess right now not quite happy about it but it's really hard to so we have like different parts implementing different policies and CFQ this is the elevator implementation one of the elevator implementations in Linux IO system and so the weights weights is the same, it behaves the same way as any other weights and it's really nice to use think about weights, they are really nice they're easy to configure easy to conceptualize and CFQ provides that but it is problematic in two directions one is that the implementation is kind of not that good and it involves a lot of heuristics and its behavior kind of becomes fairly surprising in certain cases and BFQ is probably not going to replace that but Eddie there's an MQ another elevator but that probably would be a better but there's another more fundamental problem which is that when you say I get twice more IO than the other guy what's your unit I mean how do you quantize that so CFQ is kind of easy if you have a hard drive which is really slow and interesting you can approximate that by time you occupied on that desk you issue IOs from only one group at a given time and you measure how much time that took and then you move on to the next guy and then time is great time cannot be called everybody agrees to it as long as you're not traveling too fast this is a great absolute measure of your resource consumption the problem is that with SSDs it becomes impossible to do I mean it is possible to do just becomes too expensive because you cannot really time slice things that way and get reasonable performance out of it and also to mask the back end operations there's idling periods between these slices otherwise you don't get any isolation and those idling periods get really expensive on SSDs so we cannot do that so that's where we are at with weight so it'd be great if you can actually implement working weight for SSDs but I don't know I'm not too hopeful about that I mean somebody can figure it out it's great I have no idea and there's another part which is called block throw that implements limits right now it's only max which is not well conserving you just said this guy can get 32 megabytes per second or 100 IOPS whatever comes first and even if the IO device is completely idle otherwise once that group hits that limit it's not going to get more and that's fine for some use cases but not for a lot of others you paid for that device you want to be using that so we are working on not IO.high it's IO.low we are working on IO.low shawa is working on it and this is really finicky to configure it's bad interface so what you're basically doing is because we don't know how to decide what IO resource is how to quantize that we just expose all the parameters you just base and let you just figure that out so it's not a good interface not happy about this either but I don't know this is the best we can do right now so we are there with IO and memory and IO work together to control right back we're just one of the major upsides of secret v2 and we're going on to CPU CPU is not merged yet there's a lot of arguments around it and the scheduler maintainers are now quite happy with they have different ideas about how secret v2 should be designed so we have to argue over it but eventually it will get there and for the time being I'm maintaining outside out of three branch so you can pull that pull that and use that if you want really not a big problem like systemd for example like neural versions support all of that so you just need to patch the kernel and everything is going to work but CPU control in terms of interface is one of the good ones because CPU time is time time is awesome time is easy to compile and we can agree with it so we have weight, weight based control I guess we times more memory CPU cycles than you do we can do that and it does well conserving it also implements limits bandwidth it means that for any, for example 100 millisecond time period this group can get 20 millisecond so it sets the max upper limit on the consumption so you can do that the only problem we have with CPU right now is that there are two problems one is that performance overhead is not too good it's kind of too high you're seeing fairly high scheduling latency overhead when it is being actively used and also just general overhead it's a bit too high so we we're looking into it and the other part is that because CPU has been merged yet and not a lot of attention has been paid to it on other parts so this doesn't yet work with other controllers to account for resource consumption for example CPU cycles spent while receiving packets CPU cycles which are trying to reclaim memory so those are not properly accounted now but we're going to get there that's one of the long term course that we have so this we're secretly to use that any questions yeah so I don't oh sure so the question is that given that there is a bandwidth control I mean the same works for professional control too so there is always a possibility of a priority inversion where role priority or the same growth which is already because of bandwidth limit is holding a lock and a higher priority or the same growth which has bandwidth to excuse and the waiting on that lock priority inversion and the question is whether there is any plan to implement a priority inheritance I think the answer is no because I mean I don't know right now I mean people try that it's just really hard but I mean there are ways to work around that for example I'm not into the scheduler side of it but if you think about it happens in other places too like if you're writing if you're doing a journal IO if you have a high standard metadata update and you are low priority and you're trying to do a journal update and then you get kicked out higher priority comes in tries to issue IO then gets blocked on the journal so the similar things can happen in other places too but also one way and it's on the IO side the way that we are planning to handle that is that always handle journal IO in high priority but charge it back the overhead after the fact we do the same thing we're probably going to do the same thing with the network packet receptions like so when you receive a packet you just have to receive it right I mean otherwise you're just wrapping packet so you receive it and in the process you may go over after the fact you may go over your own limit of the target signal but we just still charge it up absolutely like you know ignoring blowing over the target and then we can make that pay later so I mean we can have that you know charge after the fact kind of mechanism to handle some of those priority versions to avoid them but in generating manner like lack priority inheritance through new texas and I don't know people try that and you know most were not happy okay the question is that what are the things which are not yet resource controlled and you also mentioned the interest CAD I forgot what they're standing for but cash allocation technology something like that yeah so CAD is not part of C-groups it just didn't really fit so it has its own like configuration thing but in terms of C-groups the things which are not controlled are like the things that I had been mentioning right so for example general IO we are constantly charging that to root C-group no matter who does that which is bad like C-Psycho spent during packet reception which is being charged to root C-group doesn't make any sense C-Psycho spent while encrypting right back IO getting charged to root doesn't make any sense memory claim the same story so like those things so all the system operations so like we are taking account of some of them like for a lot of system operations we are not quite accurate in how we account them and charge them so that's one of the main missing pieces at this point and unfortunately it seems that we are going to have to employ like more case by case approach because like different problems need different solutions I mean you cannot like charge a packet by packet right it's not going to work it's too many packets and too small thing so you kind of have to aggregate them so you kind of have to be smart about each case so yeah that's going to take some work yes sure I mean that's going to be the basic approach right but the thing is there are a lot of packets and we don't want to be transmitting each packet's reception I mean also we cannot do that like packet by packet so we are going to have to aggregate the total C-Psycho consumed and then charge that according to a number of packets which are sent to different domains so it's just about aggregating the overhead and just you know making that manageable but yeah that's where we are headed what is that recording type yes that's going to be over yeah it has to be aggregated in some way so the question is that how do you implement minimum or low limits if there's not enough resources in the system right in the configuration so if you add up for example all the low conflicts in the system and that's more than what you have in the system what do you do right that's the question I think and the thing is I mean for low it's kind of easy right low it's a priority scheme right what low means is that if you're under low you have priority over groups which are over low right so if there's somebody who's under low in the system you're not getting resources over your low limit that's what it means so if it means if you're everybody's under low and you still don't have any resources to give out then you just fail to give that up and you know that's what your conting means for minimum I don't know the thing is that we don't have any minimum actually implemented so yeah it's to be decided but I think it's going to come down to that right you know we try to give that we don't have resources and that's illegal configuration and we just don't give that out so in the back what I said what's the intuition so the question is why do you slow down processes when you reach memory high limit is that the question so it's not that we are slowing down the processes but that's the thing also we don't necessarily slow down the processes what you do is that when it tries to allocate more memory if it doesn't want to allocate more memory we don't mess with it you know it runs fine when it tries to allocate more memory what we do is that we make the process which is trying to allocate memory reclaim more memory so if you reclaim more memory and go under high then you can get more memory until then you cannot get new memory and what that means is that it spends most of its time trying to find a free more memory so that it can get more and that's what ends up slowing down but it's not about slowing it down it's just tasking it with reclaiming memory other than running its own workload so the question is that if you artificially create a noisy neighbor then how comfortable is the group about guaranteeing the isolation I don't know it should be fairly okay CPU yeah I think like so for CPU and memory I think we are fairly safe because no matter what the other guy does as long as that guy is contained ignoring the cases where we are not covering yet things like if you cause a lot of metadata IO then it gets transferred to it breaks down the isolation but I mean those are the cases that we are not you're eventually going to cover with those coverages as long as we stay inside that coverage it's fairly isolated you cannot really mess up with the other guy that much for this CPU and memory for IO it's kind of more fuzzy because our method of control is not that reliable yet but once the memory is that low gets usable and more established I think we would be able to say something like no matter who does what if you have the right configuration you can guarantee certain level of latency to this C group up to certain IO capacity so we should be able to guarantee that all the guarantee that you wouldn't need in terms of IO sure sure yeah that gets more tricky so the counter question was that if you reuse things like NQ and deep queuing then the neighbor can create a lot of things in the queue and then it doesn't really help whether the other guy gets the chance to queue or not behind a really long queue I think that that combines two occasions but one thing is that have you heard about buffer block on the network side? that's exactly the buffer block on the IO side and you're actually trying really hard to address that so even without C group even just right back tends to build this huge queues and that's what you get when your CPU files to a slow disk and then you try to add as a file that takes like 5 seconds that's what you're getting you build up this huge buffer queue on top of your disk and your next IO gets queued on top of it and it doesn't matter what your priority is so what we're trying to do and that buffer the realization of network buffer block was that that long buffer doesn't do anything it doesn't help you with anything it just increases latency so you can still make it a lot smaller get the same performance lower latency so that's the basic idea of controlling buffer block then you're doing the same thing with IO and the IO with that low implementation actually is based on latency for that reason so what it tries to guarantee is that you get a certain level of latency so it is really aggressive about controlling what commands get queued how many gets queued okay yeah sure the tip part what do you think IO CPU okay CPU isolation should be fairly reliable IO is a lot harder the thing is that memory and IO IO is harder because it's slower and the devices are a lot more ambiguous in its behavior you cannot really get a lot of data out of them how they operate memory is hard because it's stateful stateful means that once you give certain memory to this guy and you want to relocate that to that guy that operation is not trivial you have to write back it might not be able to clean the pages it might womb that gets really difficult CPU it's not bad you can do underfly everything only your current scheduling cycle matters so in that sense CPU scheduling isolation is fairly robust conceptually simpler that should stay that way I don't think that's going to change and then we want to achieve a similar level of isolation across other resources too right thank you very much yeah I enjoyed it it was fun check one two testing one two check check four bars on check check good as you wish I could do it or as you prefer alright ladies and gentlemen thank you for your time thank you for coming in today Yannick Brozio is going to give us all a speech please take it away hi everyone in many industries we know what testing means on the other we just make sure people don't buy so we just crash their car and verify so in this industry do some education tests make sure people don't die that's also an easy one to test if we get to a domain we know a bit more software development we have many many topics to ensure many methodology to ensure that testing is won either TDD doing unit testing doing good integration testing having a good process is good kernel testing that's what I'm going to talk about today I'm going to see how we can make kernel better for everybody quick word on me I'm a production engineer at Facebook been there for quite a bit now I'm working on the kernel team and my main focus is in two way first one is to make sure we can get our kernel the new kernel in our fleet on every machine the fastest as possible but the second part which is actually the first part is to make sure that all the kernel that we do and develop and all the feature that we have the kernel works well and don't crash our fleet we have developed a set of tests a set of infrastructure to test that but most of that rely on all the tool that the open source community have developed over the years to validate the kernel and that's what I'm going to cover most of that and cover a bit of the Facebook infrastructure we use to and reuse we develop to reuse all those components but the first question is what are we trying to validate and what is difficult with testing the kernel and what is different than any other software that we are used to test in general the first tricky part is that we have a vast variety of use cases Linux is used kernel is used in almost everywhere like you get low power device that control your temperature all the internet of things that can use a really big botnet or different use case you have these huge data center that Facebook use and other big company to give you a lot of cat pictures a really important use case it's used in cars it's used in ships on your TV almost everywhere you can think of speaking of all these use case it's also used in vast variety of hardware something simple as your phone your laptop bunch of racks rockets so you see like all these very use case and hardware make it really really vast to test the kernel and make it really really different so the same software need to work for everybody the other big difficulty is like the number of just changes that we need to validate every time just to give you some quick number like in general in one commit you might get about 1.5 million line or even like I think these days like more like 2 million line of code changes around like 10,000 15,000 commits if you go like range between for example 310 and 4.0 kernel it was like 120,000 commits 7.5 million line of code changes how do you as a human being make sure every changes are correct the way we do thing we have like R key a lot of maintainers but it's really hard to make that super good so we need some tools, some tooling to cover everything there so what different about the kernel than your usual software that you run and test if you are developing if you're a software engineer the first one is the kernel and compass the old machine you run it, it run on everything it's not just one part of it it use the old machine we can sometimes use a virtual machine to put that into a smaller subset of your machine but it will not trigger every problem that you might have if you want to run the kernel on specific hardware or specific use case or the timing will be different since you use the old machine to change the kernel you actually need to reboot the machine you restart and you run whatever you need we could use Kexec to just start the new kernel right away but there's still many issues related to that which is mainly hardware support of different drivers that make it unstable in some cases some hardware will work well, some hardware will want so if you use that for testing you will always have some doubt that is it the Kexec problem that is causing my crash or is it actually the new feature I've been developing the other part is there's no really unit test for the kernel usually if you write to the test, you test one function you write a small program that calls a regular function you pass it some input you get some output in the kernel you don't really have this mechanism you basically run something external that's the kernel that run on the kernel calls, says calls and you expect some reaction there's a few samples we'll see that you can load some specific kernel module but it's quite limited and the last tricky part is when you work at this level the kernel works at a really low level and interact with the actual physical world you always the question is is it the software that the kernel has a problem or is it the firmware in the special NIC or the other component in your machine that is actually broken is it the hardware that has some issue or has a type scene caused by some random bit flip in the memory so you always have this dot so you need to make sure you have something to validate the hardware to make sure you don't have the kernel so the first part is what kind of things we're looking for when we test the kernel first part are we as fast as we used to be we introduced a lot of new features a lot of new changes supporting new architecture did my new changes my new cgroup v2 feature broke the world or make things slower we had an example in the past where somebody added a lock in some syscall that created a contention at that point if you're using that syscall a lot and then sold out every operation there so we want to make sure as you get new kernel we don't get slower we always aim to be faster even if we have more code, more features we aim to be more performant in every release the second part is stability issue we want to make sure we don't break the world as I said you want to have something fast but you also want something that works so the basic part is am I delivering the right functionality I might not be crashing but I'm calling a syscall and the side effect is not opening the right file or not sending the packets the right person the third big problem here a lot more I guess these days are all the security problems so it's not really a feature problem, it's not crashing but something will gain access to something it doesn't have to and often these problems are just subtle changes somebody added a syscall or changed the way a parameter is used and forgot to change some boundary limit in the validation process so we want to catch all these main class of problems so what are we doing the community is doing a lot of work into improving that over the years we've seen many bug reports lots of people report performance regression and there's been I would say more conscious effort in the recent years to provide more testing in the kernel world we started in the old age most tests was code review if you're familiar with the kernel development process everybody sends patches to mailing lists and there's a rack a lot of people look at the code so that's the main mechanism and it's still the main mechanism that is used in the community to catch most of the problem a lot of eyes on the code makes code better lots of people will look at different problems some people will look at the security aspects some people will think about oh this thing the memory usage is bad so lots of people looking at the code will really get some good code the second place where a lot of testing happens is each maintainer and the way the kernel development works is each section of the kernel has a maintainer and some maintainer and goes on and when you send a patch you send to one of these guys that will send it to somebody else until it gets reached to Linus and hopefully at that point it will compile and it works we've seen issues recently and Linus was really angry at the maintainer that sent not even compile test patches Linus is a little bit failing at some point but at least it gets to a top and somebody look at that code but usually each maintainer has a set of tools that they use for testing the sad part is those tools are not public or it's not really well defined or there's a bunch of script that they run on a set of one machine of set of machine depending on the area that they work on so it's hard to know what the actual coverage that this test provides usually trust the maintainer to do the writing some will do more some will do less so it's all a bit unequal in that area hopefully we have a few projects that will put that into a more rigorous set of testing so we have a series over the years that creates a test suite that package a bunch of tests I think these days we usually want to have about 6 or 8 of them but remember a lot of that that will do some standard testing the common melody of all these tests is they are a bunch of programs that you run on your machine and will trigger something in the kernel it's hard to figure a specific part because the interface of the kernel is all a bit opaque the main thing you can get access to is syscall if you use a space program you can call syscall and you do something in the kernel and you do enough of that so the first one probably the oldest is the Linux test project that is available it's been there for a long time the focus is really like calling a bunch of syscall and making sure the POSIX semantics is respected there's about 3,000 tests in that project one of the main problems of this project and a lot of the other ones is as kernel evolves kernel change, expectation change and sometimes you break the test we've been running that in our production test system and most of the problems I've found by running LTP is broken tests in LTP so that's a big challenge of having these user spaces and maintaining that that works with L the latest release well it's a good baseline LTP is a good baseline to at least make sure you don't break something really obvious another area where there's a lot of testing is on the block and the IO level XFSF is a great tool and despite the name it's not just for XFS it's actually used for all the main file systems BTRFS, XT4, XFS and a few others and there's like thousands of tests that will create files, delete files do weird directory interaction create radarey, merge them together use create extended feature like snapshot use strange attribute and we'll go and run and extract some information and even do some basic performance information in there it's hard to get real performance information on this kind of unit test because everybody runs that on different hardware and you cannot get the right common metric about what is fast on the SSD versus what is fast on the spinning disk but at least it gives you a good idea that should this take 20 minutes versus 1 minute yes that kind of operation so a lot of tests in that area similar to that the flexible IO tester is another area where it tests more the block IO layer it's a bunch of script that will create that simulate a bunch of IO operation that will write some block in some different schedule some different operation like multiple threads writing to this, multiple process something writing big blocks something writing bad blocks or just chunking a lot of block and just look at the operation make sure you don't create blocks make sure the queue are sent to the block layer properly so with these two parts you can now have a good coverage of their block layer but there's a lot still out there part of the kernel one a bit more recent initiative is the kernel self test the good thing of this project is it's a bunch of system tests that reside in the kernel source tree if you extract the kernel you're going to testing self test you have a set of tests there so they evolve with the kernel so you usually know if you check out the kernel tree all the tests that is there will work with the kernel that you are running it was not fully true at the beginning the maintenance of that was not not everybody was aware of that but it's getting much better with that area where there's a lot of activities since it's really close to the kernel the maintainer don't have to switch to a different tree you don't have to check out LTP to find the test it's right there with you and that's usually a good practice when you develop code in any language if you have your repo just create a test directory and put your test there so you know you're maintaining them with your code so it's a good probably one of the best place we have there because all of that is related to low level driver low level system and mostly new interface that are there it's hard to get all things added there maintain a web if you have like a big system you have this approach I should have write a bunch of tests for my new the old legacy system it always seems cumbersome to add that but if you have time that's probably a good idea to start adding tests there but actually we do it's quite limited, this is the old list of test module these things are a module that we usually never load those because they don't do anything for real but they will just call a bunch of systems, for example the test print test will make sure the print come inside the kernel works properly we use for example the test BPF1 that just create random BPF not random but a set of BPF code, load them make sure they don't create problem in the kernel or create trash or operate properly so that's the closest thing we have as unit test in the kernel but also reside in the kernel tree the third one that reside in the kernel tree is the perf test if you use the perf command line it's a user space side but use a lot of operation with once it's called perf event open that are quite complex so the other bunch of tests all the failed one is because I've run the test as user space so I don't have access to all the interface but just to show you there's a lot of tests and the list goes now to I think about 6-7 release ago this list of tests was 10 really 10 so about a year and a half ago there was 10 tests in there so in the last year and a half a lot of effort was added to create test and now we are close to 50 different tests of different space of perf in there question or? so the question is is it necessary to test the kernel especially internally at Facebook versus using the validation upstream the question unless you're changing so the question is so we do some changes but the big thing is the we want more test more coverage because we have different hardware we have different use case so the upstream test doesn't cover everything doesn't cover all use case so we want to run them on hardware the interaction of the specific hardware with the specific Nick will create a different kind of problem different interaction yeah so we it's mostly the interaction and the reason we need to do it and as also later these tests are not really run by everybody and they are not run automatically in the fleet in the community so some maintainer will run the test on some specific use case but it's not enough it's not enough to get consistency for everything no so the question is when the kernel is released by the it's not an organization it's released by linus it's not linusly tested linus does a bunch of tests but doesn't run all the tests available in the whole world so everybody is kind of the whole community is responsible to run those tests and those tests what we aim to is just to at least automate and running all those tests on all the upstream kernel the next area that is used so far was a bunch of test single unit tests that test some specific feature kernel fuzzing is another area that can get us a much wider area of testing because we don't test a specific aspect what does kernel fuzzing do is basically a bunch of tools that will do a bunch of random interaction or random sequence of interaction if you have a specific syscall for example if you use LTP you open file A read it, close it, move it this is a conspiracy operation what Trinity and syscaller will do is I'm going to open a device or open an interface and throw it a bunch of random numbers and see if they end all the numbers properly you will see out of bound error you will see like a valid memory usage stuff that has not properly evaluated and create crash so as you go these are really good also to extract some security problems if you throw a random stuff and crash the kernel then you know you can get those there's a lot of money in those because as you fix the problem that this random syscall, the four specific syscalls usually for Trinity if Cufffosser will never find the problem there again because this one has been hardened there's always effort to extend more syscall or more sequence of operation sometimes you find only a problem if you need to open something do some operation and then you find a problem so they try to create some random sequence of operation that makes sense that will make some real life one but just to do a wide coverage of problems of usage that you cannot just do by hand the Perfosser is a special use case just for the Perf event open syscall that just takes a bunch of the Perf event open syscall takes a struct as option and the struct is quite big so there's a lot of coverage to do there the last set of tools that are used for kind of validations is the you have some specific analysis tool like Cuxnale so just to say warnings are a good example of problems we aim to have zero warning it's usually true but sometimes especially when you get to a new GCC it will trigger a bunch of new warning and there's some use case where some warning you know it's okay especially like if you use some assembly language in between and do some weird operation the compiler might not know about it so it's a good thing to check of the warnings but you need to have some whitelist to trigger that so it's hard to maintain Cuxnale will look at some specific pattern in the kernel that are bad and like if you know like oh use after you free the memory so you know that oh if you call that this is the structure of the code that is really bad or expose some security problem and we'll highlight some of those we now have the kernel address a memory address sanitizer that you can use you can compile the kernel with that to validate that your memory usage is good the big problem with that is everything gets really slow so it's good for testing but you cannot really run that in production all the time and there's a bunch of other debug option in the kernel that will test some debug add extra information or trigger more check and watch in some specific feature that are really low level and it will print you some information if you trigger some external pad that you have to validate in real time but if you want to debug something just adding those validation those if statement will help you trigger some problem so so far we've seen all the set of tools that the community is really using as available to test the kernel but as I said that having those tools you need to run them too and as any test project even for user space software if you don't run your test your tests are useless so there's a bunch of automation project that will cover some of the testing there's like about 6 projects these days that do something there the first one one that in the kernel community that is probably the most useful so far is the 0D project mainly maintained by Intel what it does is look at every patches that is sent to the mailing list or all the main Git tree and we try to compile them if there's a failure there we'll send an email to the patch author to the maintainer like hey this patches doesn't apply to the latest tree or they do also a bunch of limited testing the big problem is all a bit opaque we don't know exactly what Intel is doing on those tests but at least it does some build testing and report them to the upstream kernel community so when you send a patch you at least have a confidence that it will compile on something that you might have not tested otherwise it's really hard to cover especially on those a lot of those things just do a build and boot testing you might think that that's really just compile testing why not everybody does that but if you see that the .config file that you can generate there's so many options and so many interactions that say it's really really really hard to do coverage pop coverage most people when they do develop they will do like an all yes config or all no config and maybe do some is my patch compile I know compile with my setting maybe on my desktop machine for my distro but what if somebody do disable a lot of stuff or enable some really just specific stuff will my code compile properly so sometimes we miss some of that so those automated tooling it's easier to test a bunch of different configuration options the other project which is more community based is the kernel CI project this one is really mostly targeting and development it's based on the linear tooling and mainly new boot test at this at this moment they have I think a few hundred couple of hundred like target board or simulation that they do so whenever there's a new kernel release and they test a few specific tree like the embedded tree or the real time tree and they will like compile the latest and install that on a bunch of actual physical embedded board in different architectures like ARM, MIPS, even x86 and see if at least this latest config will boot on the system so the question is like is it by the community so this is mainly made by a bunch of community user I think it's backed by the I don't remember the name of the embedded there's an organization that deal with embedded development that deal with the funding and that but people can contribute like build machine or more resources to the project another similar test system is the kernel test kerneltest.org it's also aimed at more testing architecture it's quite simpler because it's used mostly KMU it doesn't do physical hardware on all the architecture but it tests 14 architecture so all the usual one plus some of the more esoteric one that the kernel support it mainly do again build and boot test only it doesn't run a bunch of system tests but again just to get at least does my system system will wake up after after it use and you'll be surprised how much often you detect problem that just the machine doesn't boot especially at the architecture level if you mess the infalization you might get problem it's based on the build bot project which is a build system for the kernel all bit after this day but still usable you might have been to the foreign x website they use about tech in open source mainly and they have this open benchmarking initiative that they run themselves or you can contribute a set of benchmark that's too young of your machine and they track some performance regression or especially some performance or like power regression in the kernel over time they've been out alerting that and they run like validation a lot of release and a lot of people contribute results and you can just search visual by kernel version or type of hardware there the latest newcomer in that domain is the Fuego project mainly driven by the LTSI the long term support initiative which is a consortium of people working on like more embedded development they want to have like a really long supported kernel for like the automotive industry or the media industry there's not much available there yet they have also a bunch of description tests and they want to run automation but they are still working on their infrastructure for the first part if you want to actually implement your own automation testing there's the Avogadu framework which is kind of the new implementation of the auto test framework which can be used to automate your own kernel testing they don't do themselves kernel testing but you can use that to do some automation of your infrastructure we try using that but bunch of complication the Avogadu project was not match enough when we developed that it was not fully adapted to our use case so that's mostly that, at least the thing that I'm aware doing kernel automation we can probably do more but at least there's these projects so we can contribute to those projects if you want to do more automation testing feel free to go there or create your own project I think we cannot have too much testing in the area at Facebook in that area the way we do things and what we do currently we do build on our super old branch which is a 4.0 kernel we do testing on the current branch with 4.3 kernel we'll soon start on the next one it's going to be 4.10 probably and we are working on doing actually testing on the upstream kernel our kernel are not that dissimilar to the upstream kernel we basically eat a back port a few patches the stuff we can do work but the way we do works we have this upstream first that other work we do we actually do it on the upstream kernel and back port them on a more stable kernel that we validated into our tree that we developed it takes quite a bit of time to deploy your new kernel to the old fleet so we do this time to stabilize some of the feature at least some of the preference matrix we don't rush over that as we go and do more testing our hope is to be able to just release more upstream kernel and more often rebase to the latest one so what we decide to do is we won't write a new there's many tests out there we won't go and write a lot of new tests let's try to reuse what is there so what I mentioned the first time I think we try to use other tests and put that into an infrastructure turns out it was really really hard to do because the way the expectation the auto test was having was not fitting the way the structure was then so we decided to instead of using that tool reuse the internal tool that we have for normal software testing we have tool calls and castle that just build on every commit that every engineer does and do build random tests that we have there so we basically plug our kernel to adapt that just do the building on the same infrastructure build a kernel, create RPMs store there somewhere reserve some machine that we have dedicated to kernel testing and we usually try to get a few samples of all the machine type that we have in the fleet so that cover a wide range of different at least CPU generation different memory configuration different network card configuration and this configuration question? the question is that include mobile or is it just data center? the thing we currently testing is just data center hardware we don't cover like android or that kind of stuff on the kernel testing that, this was a few reusing the existing tool kind of like triggered some work at the beginning but may our work much simpler because we don't have to maintain the kernel that actually doesn't have to maintain the whole infrastructure and we can keep our focus on the actual part of testing the kernel but it's not sufficient what we have, there's many tests in all the part I've mentioned before not everything is tested one new thing that is available that would be interesting to use there's not a kernel coverage tool, KCov that could be used to measure what the coverage of all the existing tool and see what part are not test yet as we know test coverage line of coverage is not sufficient metric to know that you are tested but at least we'll give you an idea of what are the tests so that's one aspect of the community is like working on like can we upgrade some metrics of what tool do what coverage but there's a few areas where we want to add more tests now we run all the use we run LTP, we are about to run KCov tests we want a few others we are rather disliking good testing the first one is network testing and the big difficulty here is you cannot do network testing with just a single machine you need to coordinate testing between like 2, 3, 10 machines to get the interaction of the protocol verify if you have speed regression that kind of test so it requires more complex infrastructure work to get that automated a lot of people on the networking side use for example IPERF network traffic generation and do that like they take 2, 3 machines and have a bunch of scripts that will let us say start program and that's it's mostly done manually these days but we want to add more if or then contribute that back the other kind of test we want to do do synthetic test to replay the kind of workload that we have we've seen a bunch of other tests that trigger some part of it but sometime we like HHVM and the way we use HHVM in our fleet it's not the same as like maybe the open source community is using HHVM or Wikipedia using HHVM so we want to maybe HHVM is a good example because it's open source we can just run that and run the test there but there's a few other program that is Facebook specific and require a lot of the infrastructure like other program like all the whole like service there so we want to build like small application we want to do the same kind of operation on the kernel side and eventually like even open source that into one of the test suite the third part is measuring performance regression I said earlier it's kind of hard to compare performance regression over time if you just run one test on one machine it takes 10 seconds, you take the same test another machine takes 15 seconds is that regression, is that because the CPU is slower, it has a different hard disk so we want to find a solution at least for us, get some metric over time and maybe share the metric especially since we have the open compute hardware, we can share that via the project we have like really standardized hardware that we can use and we know like no CP type 1, this is the hardware there so we could share metrics for this specific type of hardware but again as I said it's kind of hard to measure that over time, we'll see what we're going to add and the other extension test we have like when we find a specific type of hardware we usually try to write a use case for that and put that into automation, we are not as diligent as we can and we can share more of that, we'll look at doing that in the next year but if you find a kernel panic we should get and I think it's too probably for the whole community if you find a kernel panic you should probably write a test that triggers that panic and like put it somewhere in case health test or some other test suite but it's not an habit that we have in the kernel community and I think it's probably a good thing if you see a bug write a small repro, maybe contribute it to case health test to LTP, put it somewhere it would be better for everybody so we don't catch the same problem again but one of the biggest area where we get our validation is from doing canneries we basically build a kernel it already scan the data to the final kernel and we deploy it to the fleet in prod we do it in a safe way, deploy it to a small subset the first phase of canneries usually we have some system that can do what we call shadow traffic, basically sending the same request to two sets of machine two similar sets of machine and then we can easily compare the matrix, if you get less requests and all or the request takes 5% more CPU you know that the new kernel have a problem somewhere so that's really good to find initial canneries then we we deploy it to more of the fleet and just let it sit there do we see more kernel panic, do we see more memory usage, because the average CPU utilization is higher it's not a great matrix but at least it gives us a good idea a better matrix is usually if the application will not tell us oh, my latency between the request is higher or I'm doing less requests with this new kernel you really want to have matrix that you care about just aggregating like CPU utilization yes, the kernel the overall CPU utilization might be higher because the kernel do more to schedule your program in the right way for example, we talk about the group earlier do a little bit more about isolating your resources, so each process has the right amount of memory and gives you a better throughput so you need to be careful about what you measure in there so that's basically what we do on the kernel testing at Facebook we do build every hour with that make sure we track that at some point we'll share the result of upstream testing as we actually just really move that since we have less and less internal patches and fully at some point we'll get no upstream patches and we can just run raw upstream kernel it will be easy to just like put like let's say our kernel just broke our use case or our performance we should probably fix that and share that with the community the last question is what can you do I think the kernel quality is not just either linus problem or the maintenance problem, I think it's a a shared problem between all the user of the kernel so the main thing you can do run one or multiple of the test case I've mentioned before and report problems if you see something broken, report that run the latest kernel, run some .config file make sure you report that there's a bugzilla, there's always a question is it really looked at, some maintenance look at it, some less but ping somebody about your problem, ping your distribution and go ahead and make sure it's there add more tests, is there a use case you care about, there's some syscall you care about, there's some use case you want video gaming or video graphics works really well write test case about it and contribute it one of the project I mentioned create a new project, publicize it go talk about it at conferences we just want to have more tests we need to increase the coverage if you can, fix the bugs you find a bug, fix it it's a good way to get introduced to kernel development fix an existing bug, it's easy to prove if you have a good repro easy to prove that it's there for example tj talked about cgroup earlier you have some good tests for him but it's one area where there's some a little bit tests in LTP but there's no a big set of testing that would be really interesting to extend the test, so find an area you're interested to talk to the maintainer maintainer please, usually have a good idea what needs to be tested, what is really done manually that could be automated so there's still a lot to do we need to improve that I think overall the quality is improving but as we've seen there's a lot of news about super important crash or security problem over the years that have surfaced and kind of an issue, we want to make sure we don't get those news anymore so that's mostly it, that's the kind of challenge we work with at Facebook so thank you questions, yeah the question is how long is our test cycle that you allocate I'm not sure exactly what you mean by that like a test run or there's the release process how many weeks so it depends what we validate we build a new kernel every week that we release to the fleet it's a release candidate that we let or deploy to some machine on the fleet, we do a build every hour and run a bunch of tests every hour on every kernel branch that we have and we release a kernel every 4 to 6 weeks internally that's really the same like major kernel and we rebase to a new major kernel every year at Facebook so we did 4.6 last year we're going to do probably treat 4.10 4.11 this year so that's the current cadence where we release kernel and that's the validation period that we go over that some service will take more will take a few release to get a good validation the cycle to fix the bugs and get there some other will like oh we have a really good performance metric and the kernel is good directly there as we add more automation to the testing I think we want to just shorten those periods and just rebase periods and be able to like release to the whole fleet every month that's like a goal we kind of aim to but it's not there yet and the other question so the question is basically as you go to new kernel test validate new kernel you might find personal aggression to gain back the performance or tweak that so yes that's a that's a big challenge where we have some people have set tunable in some previous kernel and then on new kernel the impact is different or some default value change and you forgot to validate that and you need to go change those tunable again or the impact is different it's a big part of the validation process that we do with things looking at the performance metrics then looking did you change some parameter or is it a memory problem maybe we need to tweak the dirty ratio or some of the setting to gain back the performance it's one of like dream to find a way to automate that and like just like go and tweak the knobs and like do some AI to find that part but that's probably like a really far down the road dream to automate that so we don't have a good solution for that new knob that we know about team is experienced like we go as we go especially if our first team did some it and we can apply the same solution to different teams but there's no like holy grail in that area we just need to know you need to have good kernel expert and either look at this story if there's new knob that got added between release and that's the benefits of doing more frequent release there's less things that change between release and you can focus more your investigation to these more specific part but yeah yeah the question is the kernel at facebook the kernel testing process include using the facebook applications yes so the big part of like the canary part yeah that's so testing like basic tooling testing will find some obvious problem but usually the more subtle problem will only happen when you run an actual real workload another time and a few colleagues there can look at me with some of their problem that they have sometime you need to actually get the full workload and you need to kernel will behave kind of differently when you like use all the memory all the CPU all those things together when you stress the machine a lot that's where you can tend to find like the subtleties in the performance world yeah what can I report let me check if I can get one of the dashboard quickly so usually I can make the data and we have like some dashboard and some alert about each test that we run let me see I think I have something that far yeah exactly and one problem that we had recently is it was like test on no good and nobody read the result yeah we had the issue that some of the test and fraud was broken and we didn't have an alert on that area and test got broken for a few days until like somebody went to look at this manually so we need to make sure you have alarming or people looking at the graph to make that work properly let me just get that presentation I don't have it here but we have we have a bunch of dashboard that shows at least a trend like the personal version and like the list of major test failing like in the list like these three tests are failed in the last two hours and we can go and look back at that kind of information can show you maybe later after the talk anything else yeah so the question is like I mentioned that Avocado was not really working for us the main reason was just because we we have already tools to manage our machine Avocado, there's tooling to like reformat the machine, connect the machine install a kernel but we already have toolings internally for some of the automation that do the reformatting of the machine do connect the machine, go install software in the machine so it was simpler for us to just reuse those components we had internally then to adapt Avocado and Autotest to use our internal component the automation layer was already available in other tools that we had like I mentioned in Castle so at that point there was no much thing, there's one component that is interesting we might look into there's a I don't know what's the name of it again but there's a test matrix generator that can generate I want to run this is the hardware that I have this is the set of tests that I have I want to run all the tests on all the hardware generate me like a test matrix to make sure I cover everything so that's a component that could be maybe interesting to reuse and extract from it but the rest was like there was just too much glue to put into it for us since we were not, if you use some of the open source components for like provisioning or connecting to the machine they can be leveraged that already but we are a bit too different to use that but the important part is actually running the test and the execution framework the question is do we do performance testing on VMs on virtualization not really the main reason is we don't use VM in production or almost no VM in production for us the important part is actually the performance on actual hardware so we prefer doing the performance measurement on real life hardware than the virtual machine that will give you different kind of results we might use the VM to test at some point to do more functional testing and do quick regression build boot combo more than like performance regression there's less usage we have really good tool to create a machine if we break it it's really easy to us to do metal testing yeah the question is do we adopt the full Core West container the answer is no we have our own container solution internally that has been developed before Docker and all these things existed so we are really investing that one thing we use but we we use any standard C-group and the C-group V2 inside those solutions and our infrastructure runs on test seven so we use some of the system D system in prod there but not the full Core West anything else I'll thank you Albiah if you have any more questions thank you hello is it only the truly dedicated that's Sunday until three oh ok alright we might deliver on that promise today if each of you keep a finger crossed we'll be we'll be good beautiful alright we'll just a couple minutes after welcome this kind of reminds me of grad school where only the serious have stuck around to do the extra work I sense I'm in the presence of some like-minded nerds this is good we're going to talk about a few things today one is Unicarnals in general a little bit of an intro there and then we're also going to talk about a project called Unique an open source project that is a bit like Docker is to containers Unique is sort of like that to Unicarnals except younger so to the extent that we go through these and if I go through anything too quickly take a quick note of the URL the slides are up there already and so are a couple of other talks that I've given we choose to suffer through more than the next 45 minutes that's up to you okay yeah let me go back real quick yeah so while folks are writing that down quick get to know you a little bit better who are we today Linux systems administrators generally are we developing are we operating we're doing somewhere kind of all in the middle are we consulting things very good you will point out my mistakes today along with Brendan so very good okay good we'll have some fun a little bit about me I gave a talk earlier on Thursday and I made the crack about this being my favorite subject so I won't make that crack again despite it being you know not true anyway I'm you guys have some background on me and can either believe or disbelieve things that I'm about to tell you for the next 40 minutes these are various activities that I'm into I've been fairly container oriented for the last three years I've gotten was at Cisco at the time organized the Docker at Cisco community helped sort steward some new innovation projects there and since went off to Seagate and did some cloudy things did some things with OpenShift v3 with Kubernetes and then I've just generally made myself a nuisance in organizing meetups out in Austin some local conferences out in Austin I am coming to you from Austin I am still working on my Texas accent it doesn't come through but it does occasionally so harass me at will it's a small enough audience you guys will harass in person if you don't feel free to do it online some role is that SolarWinds I've been there coming up on 7 months now focus on technology strategy so we've just got a small team with some bright folks working on innovation projects so I thought I'd just ask real quick folks any of you guys familiar with SolarWinds software SolarWinds yeah, you know got a healthy set of portfolios that I won't necessarily tell you about today but they are very relevant to monitoring and management of infrastructure historically one premise is oriented but has some cloudy some SaaS offerings out there as well so that cut short some of my other questions anybody suffer through the talk I gave on Thursday yeah the guy didn't learn the first time but I thought I'd quickly note to the extent that your containers can enter into your world and if you're looking for maybe a free tool to understand what containers you're running in your environment and get some visibility to those understand what network traffic is flowing between them the volume and direction of that traffic maybe even do some bandwidth tests between them to test out to verify the different types of network drivers that you're running how performant they are it's in very early tool but could be could be beneficial good to you this is the part where I say well I guess where I ask the question so into kernel development but how many of you guys are new to unikernels as a topic very good I am in excellent you are in good company I guess is what I would say so I had said before hey I had a heavy focus on containers and I'm just kind of making my journey into unikernel and so we're going to do some of that today and I thought I'd offer up this perspective this is really my perspective that I've had folks come up and say well you know hey now that unikernels are here I'm just going to skip containers and go right to those and that may very well be true my perspective is that it's really our futures much more about and sort of all of the above that just as I think you guys were talking about container orchestrators and the choice there and trying to figure out which one is the right one I for one while it's painful to figure out that choice I'm glad that we've got choice of container orchestrator as well as just choice of different types of infrastructure because these are just built for and work for different use cases and it isn't the one size fits all here so my perspective is that we'll be living in a world where one or more of these are working in conjunction in the same environment and we're going to see if I can't talk to you a little bit about that today and give you an example of that with Unik so I'll say this of the infrastructure that I just talked about you know I'm not you know depending upon what conversation you're having I don't know that how much people like I would say that people consider infrastructure maybe a secondary concern if we understand why it is an infrastructure this is probably not new to any of you why are we running the infrastructure is to run the applications maybe it's network services whatever those are so to the extent that people care about the application the application kind of becomes king a sort of sensor of focus I tend to think for some of the newer software development projects that are being made those that are done in a continuously way those that are done in a sass way are often development led and it's often the developers that they're going in and defining that project that they end up defining the infrastructure as well I think it's relevant to our talk in terms of them maybe electing for unikernels as the best for that particular application maybe not I guess I'll risk this question just to the extent that the application is king if that's true we just assume that for a second who would be the queen might be the developer I don't know if that's sort of a tongue in cheek question there so anyway the developer and not just the developer the operators everyone involved the systems administrators experience challenges with the way that we do applications today I'm going to see if I can't enumerate these we've got a challenge around fat systems systems that have been that have many layers to them not all of which I think we're going to see today at least in a unikernel world we don't necessarily need if what I said before is true about having that central focus around applications this is probably like these are the things that are front and center and near and dear to the application so is the rest but our hope is with the promise of portability of containers and portability of unikernels that some of these other layers become less and less potentially important there's some other issues challenges that we face today with the way that our applications are built and deployed there's some inefficiencies we've got long you know this is a relative you know I guess long is a relative adjective and so a long time for VMs to start up a long time for containers to start up we've got a couple of minutes of unikernels so we're talking like maybe a couple of minutes with a VM maybe less probably seconds with a container and then getting into milliseconds with a unikernel and that starts to change the game a little bit about how how you're running your applications how you're running your services our systems today have been designed for if you think about them we've got general purpose operating systems that are multi user application they're built for they're built with many users in mind they're not necessarily built to run a specific application under a specific use case we're going to start to contrast that against unikernels which in fact are interestingly a lot of the way in which we've developed software over time that's certainly evolved particularly I think just in the forefront of my mind sort of the difference between on-premises kind of enterprise architected software the way that that's downloaded and installed versus continuously delivered SaaS software that's just iterated upon there's a significant difference in my mind but I'll say that outside of that the way that we package up our applications and deploy them hasn't necessarily changed quite as much as the hardware has evolved over time so if you think back to the 50s and 60s of mainframes with extremely expensive computers the business decisions there had to be around trying to glean as much value out of those as you could and that's not necessarily the case anymore we've got phones that are as powerful interestingly back to the extent that I'd said that our infrastructure world our future is more of an and than an or we've still got mainframes processing millions of transactions a day anyway so there's other issues here there's others have given talks others have written free e-books actually on the topic of Unikernels to the extent that I'm young in my journey here I would point out free e-books like that I'd also point out some other talks that may give this may do this better justice but I'll just kind of highlight some the problems that we're facing and kind of why it is that people are compelled to go to Unikernels so some of the ones that I just said it's also security if you consider the lines of code that are within the Linux kernel if I'm not mistaken it's somewhere around 2 million that's large that makes for a large attack surface as you deploy a given distro like Debian or Ubuntu I think that's on the order of like 4.5 million lines of code that's oh tell me in a Ubuntu distro like a oh okay okay so I've got a zero off okay very good so it's worse than I'm painting okay the other issue I think is tell me if this is true but as you go to deploy a given operating system you'll often find that maybe you're running package managers that are pulling in additional packages maybe you don't necessarily need maybe you're running additional services and you've got drivers for maybe floppy disks drivers that you're not necessarily using depending upon what your application is doing so you've got these risks lying around so when I iterate through some of the promises of unikernels and I for one will certainly say back to what I said before that this isn't a panacea for sure these probably make containers look like more of a panacea but there's some really compelling use cases for containers so let's try to define and just come to understand what a unikernel is I don't know if many of you know Ian I forget his last name the CEO at defer panic ended up trying to give a definition back this last OS con and he said it was a way of cross compiling applications down to very small lightweight secure virtual machines essentially down to bootable images yes sir impressive impressive and those that's actually a good kind of figure to hold in your head as you go to think about what unikernels can cross compile down to some of them get into the kilobytes which is again sort of game changing so if this is kind of our the the stack here as you go to compile a unikernel what ends up happening is that you can essentially describe a unikernel as a library operating system you end up going through a process by which only the necessary libraries that your application is using get pulled into what becomes a machine bootable image and what becomes a unikernel those are bootable often times people are running unikernels on hypervisors like zen sometimes they're running them in VMs I think you hear people argue that files aren't necessarily the best place for unikernels you can certainly run them there but maybe you only get so much of the efficiencies they're also bootable on bare metal as well when you think about how small these are and how secure they are IOT as a use case really comes front and center so raspberry pies and other devices this is arguably not all inclusive but is one perspective on what this landscape looks like right now both for different types of unikernels themselves projects that are within the ecosystem as well as what ends up being described here often is these two this general side of the house being well I'm going to bastardize this probably but it's the unikernels that you see on the general side are or where you might be able to take an existing application and compile it into a unikernel those are posits compliant unikernels and so conceptually much of the code that you've written today is likely to work that isn't always the case but the language specific unikernels are probably more where you would start with a clean slate and you would implement using one of those languages that they lead into interestingly a couple of the projects here like jitsu folks heard of that one have an interesting project but it's was it just in time summoning so it's a forwarding DNS server that as it receives a DNS request as it goes to respond back to the DNS client with the IP address of the server for that service in that time that it's responding it invokes it will invoke and spin out the unikernel to then service that request when the client comes back have a neat micro services in that way so there's an improvement on security to the extent that there is no multi user support in unikernels it's a single user space single address space so there's no passwords or authorization info lying around to be hacked many of the attack vectors that I had intimated towards with the larger operating systems are conceptually gone so there's sort of this in the box security that happens by it being so much smaller really only the libraries that you need are linked and statically pulled in so you're not necessarily you know sometimes folks will go out and turn off SE Linux or maybe not define their sec comp policies or their app armor policies until they've deployed it and then they're going to in good faith come back and define those policies so anyway there's just more security out of the box we also get to a point where unikernels might actually be the canonical representation of immutable infrastructure is that a familiar concept to most of you guys just the pets and cattle stuff not at all anyway so a quick reminder about microservices what they're intended to be unikernels can really begin to look a lot like that I think if you talk to folks about microservices and you're doing it in a containers context if you mention that you're running multiple processes in containers you might get slapped on the wrist for running more than one process in that container now that isn't necessarily the the wrong thing to do there's absolutely call for running multiple processes in a container but what you're going to find in unikernels is that they're going to want to run a singular process so even things like SSH won't be there yes it's a good question I'll just repeat it just for the question was when you're talking about immutable infrastructure how do you how do you deal with states where do you store that stuff my understanding is and we're going to see it with the unique project here is that you'll end up you can statically link to an external volume and in a similar fashion as you might treat volumes today with containers you would do so with unikernels as well alright and then so there's some efficiency savings there's probably some dollar savings as you go to share maybe high end access to a system or I'm sorry access to a high end system you go to utilize that infrastructure more efficiently so lots of promises if we talk about unique for a minute or for the remainder of the talk so unique is a tool for doing that cross compilation I think I try to put it succinctly before and say that unique is sort of like docker for unikernels it helps you build unikernels and deploy them now it for the most part is wrapping the other projects that we were talking about so it is leveraging the Rump Run and OSB and MirageOS it's really orchestrating kind of pulling together the value of those other projects that we just saw but it's just making it easier and putting it into the hands of mere mortals like me so it is kind of akin to how docker builds and deploys it does have many unique does have many what we're going to find that are called providers so it does have many targets that it can deploy to and we're going to look at what those are it does compilation of these languages at least currently and like I said it can target many different virtualization platforms we're going to look at the specific ones it incorporates it's building upon the shoulders of other unikernel projects it's a young project it was first announced this last May no doubt the team was working on it before then so it's about nine months old the project itself is stewarded by these folks here these are links to the Slack for the project the twitter and the github the github has some juicy details the team is always highly encouraging of folks reaching out engaging I think the github has about 1200 stars right now for whatever that means and the Slack channel has just under 100 folks on there it's been growing so there's three major components to the project one is this unique daemon it's an API server we're going to take a look at what its function is there are compilers that unik will take advantage of like we just talked about unik supports the the languages that are listed under those unikernels and then it supports different providers so to the extent that you're that it's building a unikernel it needs to do so specific to your target to your provider to deploy that container and it needs to take into account what that form factor looks like in the case of AWS it looks like an AMI a really small AMI so some pretty decent set of integration there also unik has if you guys are familiar with dockerhub as a community exchange for pushing and pulling images that you build the same deal is happening with this project that's called dockerhub currently there's it is up and available there isn't a UI to it but it is cloud hosted right now it's AWS S3 backed you can download and run on premises if you do out of the box it's still going to want to push to S3 the project itself like we just gone through the list of providers supported but it also supports these two architectures in this case the sweet spot for unik kernels probably the world of the internet of things the IOT the support for ARM becomes important in that regard the team itself has done some demonstrations of running unik kernels on raspberry pies and cooking toast on the stage a couple of very interesting integrations in the cloud native ecosystem if you will so to the extent that there's another cloud foundry if you can't read it and kubernetes for the integration for docker is such that if you such that the team understands docker is on fire and many people are coming to understand that CLI and that way of dealing with containers and that's not legible is it and dealing with containers anyway the project unik has an integration with docker in that if you want to use docker run or docker ps you can you end up switching out you use the dash h flag to point to the unik daemon the api server that we had on the last slide so in essence you can do things like docker run and actually unik will spin up a unik kernel in place of a container if you've invested in that interface and you've got scripts around it you can still use that you can't necessarily do docker build though because that's just such a different process for compiling unik kernels than it is for building containers just one caveat to note unik also integrates with cloud foundry as a runtime so there's a unique cloud foundry build pack that you can take and install in the cloud foundry deployment and that makes that gives you the that gives cloud foundry the ability to essentially orchestrate unik kernels which is very helpful for that's kind of one area that we haven't necessarily talked about yet to the extent that unik is helping you build and deploy unik kernels it's not necessarily sitting there ongoing orchestrating and managing the unik kernel if you deploy the unik kernel and it dies unik isn't sitting there doing a reconciliation loop to make sure that it's up so tools like cloud foundry really help with that not just cloud foundry but kubernetes as well so the integration here works in a similar fashion so now that the team has done an integration with kubernetes not only does kubernetes support multiple run times container run times like docker and rocket but also supports unik and so back to what I've said before about the future being kind of and and you know my belief that sometimes you're going to find that you're running VMs and containers together and VMs and bare metals together and anyway with this integration kubernetes is able to run containers and unik kernels side by side kind of nice there's a I'll say this like I said before it's a young project so there are some caveats you know this functionality but but I thought I'd try to deliver on the promise before about walking through some what the project looks like how do you get it up and going how do you install it what do you need on your box all very fresh for me so actually on that topic in order to get so I've got a macbook pro up here I've got docker on the docker for mac on the macbook which is actually important in terms of in terms of how it is that unique the unique binary that you use to you know build containers and deploy I'm sorry to build unik kernels and deploy unik kernels that make process for the unique binary leverages docker containers just during the initial setup you would need a docker daemon a docker host present so to get started pretty simple you go out to the github URL to get clone on your on your box go in cd into that directory do a make binary it kicks off a long process of going out pulling down different docker containers essentially building the unique binary specific to your environment all really rather straightforward let me see if I can't walk you guys through this on the terminal so to the extent that I've done that and I've got the unique CLI the unique binary let's start from there if we could so the question is hey what are we what's the build art well part of the question is like what's the build artifact that we're coming out with so the process we're about to go through we're going to go write a very small go HTTP server and just and use go to run on and then we're going to take a unique and tell unique to build a unique kernel that runs our small go program and the output that unique will give us is a is a unique kernel I think it's about a 40 meg unique kernel that in this case uses rump as the as the unique kernel base but yeah the end artifact is like a 40 meg machine bootable image and in our case we're going to boot up that image so we do have the choice unique of those different providers aws and open stack and the ones that we want to list but in this case we're just going to boot it up on a virtual box on my mac as a like to your point as a machine as a yeah so going back real quick so we've got the unique we built unique we've got the unique binary on our on our box we do unique configure and this is everybody this is good right this is large so unique configure and it's going to essentially ask us a series of questions around what what our providers are what our target systems that we might want to deploy to are that becomes important because well I guess we'll get into that in a minute but this is well obviously important so that unique can deploy to those you know environments also unique depending upon the environment might want to deploy a small listener daemon a small listener agent we're going to see that it does that on virtual box as a way of as a way of the way I should have diagrammed this out as a way of the binary speaking into in this case into the virtual box environment having a small agent to be able to really affect you know spin up new VMs and affect that infrastructure so I'm not always a required component but but we'll see it here so in this case we're not going to do aws or gce or open stack or qmu or ukvm we are going to do virtual box I press no maybe let me go back and so with respect to virtual box two choices of the type of send by the way virtual box is typically used in like a development test scenario and so but two types of supported networking there whether you were to use host based networking or bridge networking I think the recommendation is that you use host networking so that once your unicolonial is deployed it would be connected to your host network and then that way you can just hit it directly you know without so I type in your host only and then it wants you to type in hey what's the name of that what's the name of that network adapter in this case it's vbox net zero and then we kind of go through saying no we don't have any other places that we want to deploy to and what that ends up doing is in your home directory there's a small hidden folder there's a small file called the daemon config in this case it's a very simple file it just lists what your providers are in this case we just set up it gave it a name to vbox and it's got the two choices just really rather straightforward one thing I will say and maybe I'll go in and show it now so I've got my virtual box here and one of the setups that I've got is just in the network I've gone in and created a host only network and it's called vbox net zero all I did was click the plus side and it created the network and it's going to handle DHCP serving for VMs that spin up on that network so if we go in and just look at that real quick it's got a range starting at 100 going up to 254 so we've configured our environment which looks generally like this now we need to spin up the API server that unique daemon unique daemon in this case we'll put it in debug mode which is going to be horrifically verbose but what ends up happening is that the daemon spins up it sends out a UDP broadcast to find if there are any instance listeners to find if there's any of those agents that I was talking about sitting in this case you know I said hey we configured the binary to use the virtual box provider so it's it knows that hey it's looking for virtual box locally and wants to use that host and base network in it we gave it it's going to go out and look for a listener there if it doesn't find one it's going to go ahead and create a small it'll create a small unikernel as a VM and it's going to be that in fact it did this is just a small unik instance listener ok so now what we have is that unik instance listeners sitting out there listening on port 3000 so we're just going to leave that running as a daemon and just open up a new shell since we've got it up and we're probably connected to the daemon it's probably time to maybe just familiarize a little bit with unik itself so we can start fiddling with it so so obviously the CLI command is unique it takes these different commands if we were to get familiar with the environment a little bit we could run a unique pf is that legible I hope it is and we do see one instance in there if you're familiar with Docker this kind of feels the same there's one instance in there it's that listener instance instance we just saw if we take a look at maybe what images what unik kernel images we've created in the past if we have right now we've just got that one that instance listener if we take a look at what providers maybe we've configured we've configured virtualbox and so you can just kind of run through these commands I said that there was a unique hub where you could push and pull unik kernel images we could do a quick search for the unique hub unique search that goes out to that hosted service and there's not much in the community catalog there's about four or five example unik kernels that you can pull down and run locally so we've begun to familiarize with unik let's walk through that developer workflow that we were talking about before let's make a small little go HTTP guy and create a unik kernel out of it and deploy the unik kernel so within my environment we've got hopefully the smallest amount of code that we could possibly have for creating a go HTTP server so we've got our go our HTTP.go and we've got our go dependencies defined so if we go into HTTP.go very small set of packages we're essentially just spinning up an HTTP response writer we're writing out the smallest bit of HTML here so if we do a go run HTTP we've told I told the go server to do that on port 80 or I'm sorry 8080 to listen on port 8080 and so we don't have any issues in our syntax our code works we can stop that HTTP server from running now what we need to do is is go out and build a unik kernel from that go code step you guys through this command here actually I'm afraid as soon as I paste it it's gonna so the command here is we're saying unique build we're gonna give this particular image a name I've just called it go calcode just to encourage myself I guess and we're gonna tell it where it is to pull the go code from our local path that we're currently in the HTTP guy what unik kernel base to use in this case we're choosing Rump as the base unik kernel what language it is that we're using we're using go and then hey where do we want to we need to tell as it's building this unik kernel what's the ultimate environment that you're looking to deploy this in so that it builds it in an infrastructure conscientious way so the question was and this is one that I'm gonna have a horrific answer on the question is within Java as you go to build your application you've no doubt got all kinds of different dependencies in this environment in a unik kernel environment how is it that it knows which dependencies to pull in specific to your application and I will say I will I will not give that answer the best I'm unqualified there I guess is what I would say that is some of the magic behind that part of the reason that I'm drawn to unique is that it handles there is some specific to this example there is for the go language specifically there was this go dependent this go depth folder if we go into there there is one file it's the go depth it's a JSON file if you go look at that this in part unique itself leverages some of the capabilities that those unik kernel environments have already like in this case it's using romp as its base unik kernel image those individual products they do have the ability to compile into a unik kernel themselves unique as a project is coming in on top to say let me add some let me make this even easier for engineers to come in and use this and add a lot of value around where to deploy to but it absolutely sits on top of the current capabilities I think part of the answer is not necessarily even unique specific it's like hey the way in which those compilers identify which libraries are used by our application is okay good so let's anyway there's a go dependency file here which does have a role in you identifying where some of those dependencies are let me run this build command again okay I'm in the wrong directory okay so we executed the command that we just stepped through before it was just taking our symbol go file the current directory that we're in building it for virtual box it's actually executing that build in the other the other window here that I'm just running in debug mode generally takes about 30 seconds or so ends up looks like it completed in 23 seconds here it just built that small of a guy and so we gave it the name of go calcote it's for the infrastructure virtual box it's been created it's of a size almost 40 megs if we take a so now we've got if we run the unique images command we should see now we have two images we have the one that we just built that 40 meg guy alright so the next step is hey we built it let's deploy it actually yeah I was going to make it a little more pretty but we're going to add some things other than go htp server a little image but that doesn't necessarily matter we built it based on what it said before if we want to go ahead and go deploy that this is the command here so it would be it's unique run we're going to give the instance that we're spinning up a name in this case just scale x15 and we're going to say from what image what image should be used to deploy that new instance and it's the image that we just made the one called locale code so let me clear the screen here and just paste the unique run give it this name from that image and did it prefast the screen kind of goes off here but what you should be seeing is that at the end of the line is the state is pending so it's just kind of you know it received the command it's processing you know spinning up the unique kernel really it's already done so if we do unique ps we should see now that we have that instance running it's spun it up in virtual box on that host network that we talked about before we should be able to go to this IP address under that port 8080 and yeah I was going to change it out with a pretty image but hey it's the anyway if we go into our virtual box environment we've got a new VM running the 40 mega the 39 mega called scale x or scale 15x there are some things of the team so to the extent that unique is intended to help people do that build and do those deployments much more easily part of the weak area for unique kernels is like the ability to debug a unique kernel once you've deployed it so if it's a single thing and you know it's a single process multiple threads that you can run and one of those processes isn't going to be SSH that would be a second process so getting into it to like debug it so we're kind of going back to that immutable infrastructure the way that these things run don't like lean very well into petting it and updating it over time but it also creates an issue to the extent that you're running a unique kernel you're having problems with it, you need to debug it one of the things that the unique project provides is a command called unique logs unfortunately I don't think that this command works in my environment we'll give it a shot and so it's a simple command unique logs and you tell it what instance you'd like to get the logs from actually I used the command wrong so it's just instance equals scale 15x it does work in my environment it pulled back the logs saying that unique was bootstrapping it deployed it, got it up and running got it in IP address which is nice because there isn't that's one way of getting at some of your standard out your standard error another way that unique provides that is if we go back to the unique kernel itself unique will expose a port by default on 9967 slash logs it will spit out those same logs here that's my understanding that there isn't a process manager included in the unique kernel box if you will I will caveat that with I know as much as I like in so much as I have learned that's true so with an ebook that I pointed out Russell Pavlicek who wrote it and I quote yeah I guess the quote was there that he had said there aren't any there are no functions for managing a process in there so start stopping restarting a supervisor if you will so you know what I have to claim ignorance there's why yeah oh okay okay I know the they are these so I guess to answer the first part back to the IOT use case where like unique kernels could fit really well is to be spun up the metal if you will on the IOT device okay yeah I'm not going to venture I that would be my thinking the thing is the reason I didn't say that is because my belief is like hey that's Rump's responsibility but I don't know if Rump is necessarily doing it you know like because it would have the same time bound challenge as well so I yeah there was kind of two ways that unique helps getting at some debug one was using the unique logs command and you just give it the instance name and it pulls those over yeah I think so my understanding is that that's doing it over that it's exposing this port here for each unique kernel it ends up putting out opening up a UDP port on 9967 listening to the world so this is which strikes me as like hey something you're going to want to be aware of and make sure you've got the right security around but I think the initial thought there is for debug over HTTP as well that same port and it will spit out the same logs oh yeah um yeah but it's a pain area of unique kernels in general and so to the extent that they're helping facilitate that I totally agree again like I just wanted to be totally transparent with you guys about my level of knowledge but it might also just be the case that unique is facilitating a mechanism to retrieve logs and then it might be up to your application like it might not be taking into account any common infrastructure metrics that you might expect it to it might be that it's sort of up to you to spit out to standard out standard error yeah I would think that are any of you guys familiar with Prometheus? I just anyways sort of within the cloud native ecosystem a popular metrics tool time series based metrics for and it ends up it really really wants to do that only over HTTP and really wants to pull the statistics so this seems like a perfect way for a system like Prometheus to get at those metrics well I am all tapped out or there might be a couple other nuggets in there that I know but but yeah that's good yeah I wonder if that isn't dictated by the again by like the choice of in this case of Rump let me go back to the slide that talks about yeah that's right yeah that's right moreover if you consider how quickly the project has come together um and just this small handful of folks that have been focused on it full time um yeah but yeah I don't I think it is generally the answer that like hey it is actually the next project that this um group is working on is called they've called it layer X and it's intended to be even further up the orchestration stack um and so yeah they are they are absolutely just leveraging a downstream capability a compiler of compilers is a good it's kind of a well thanks for the education I guess is the okay excellent how good oh yeah it is the URL at the bottom thanks so much guys it was nice yeah who can say what to whom is you know because you know I went to a number of the presentations in IOT and it's like I haven't slept that night you know with all the security fears you know but this looks like this could handle a lot of the stuff I mean you could be you could essentially you know compile something that says this goes into the chip with this network all the chips have like these serial numbers this one goes here this one goes here and you install them and it would be really really hard to break in right right and you could even you could even match them to the nick number and you know it's like if you could break into that I'll buy you lunch don't worry about my buying lunch yeah so I'm not a black hat hacker but I've worked in medical and I've worked in aerospace and I have one core competence I'm really really good at being scared yeah professionally paranoid they pay me for this you know I'll tell you and I'd like to know more about what you guys are doing right thanks everyone for coming to the last scale session so the topic for this presentation is usual ways of using application checkpoint restore and with virtual and the agenda of this talk is first of all we'll see what checkpoint restore is and what checkpoint restore in user space is and then we'll go on to see the like default application for checkpoint restore which is live migration and we'll see some other interesting use cases beyond the live migration and finally I'm going to introduce you some of the software that we either just released or about to release that is also very relevant to checkpoint and restore a few words about myself the name is Kirill Koleshkin and Kirill is just a Greek version of Cyril as they say in France as there was a guy who invented the Cyrillic alphabet, Cyril, Kirill so this I always have problem for people to call me by my name so I have a few alternatives for you first of all you can always call me by my starbucks name or by my seafood name which is not shrimp which is krill alright but well I'm doing containers since you know before the morning world appeared and I'm mostly known by leading the open VZ project that is containers before the term containers appeared I let it right from the start and still today also happened to be a veteran speaker an exhibitor at scale my first scale was 4x in 2005 and well I had a few omissions but there's a lot of scales for me I'm originally from Moscow Russia and I'm now living in Seattle Washington and these scales that are marked in red that are the ones I visited from Moscow came to speak our exhibitor both and the the blue ones are from Seattle so I'm going to talk about checkpoint restore in user space that is krill well you can say CRU but somehow we decided to call it krill and the story of Krill goes as follows about 2004 then we were working on open VZ containers we had our own kernel our own fork of the Linux kernel and we implemented checkpoint restore for containers to be used for live migrating those containers between machines and one goal of the open VZ project was to merge as much container functionality as we have in our own kernel upstream and we succeeded with quite a lot of stuff I mean currently you can run containers on vanilla kernel that's partially due to our work the biggest piece was probably network namespace and processing namespace some C groups functionality etc but it didn't happen that way with the checkpoint restore it never got accepted we just failed miserably it was in kernel code it was pretty complicated it touched a lot of different file systems in the Linux kernel except maybe for drivers so each and every subsystem container hated us for sending this code for making their nice little code crappy and bloated by the checkpoint and restore so we decided to get around it and implement the same functionality in user space that was a crazy idea but now after four years we can definitely say that it works for us and it works for everyone else so the main idea here is to use slash proc file system to gather some information about the processes running there's a lot of info in there about the process so we can checkpoint them by using that information the other thing is we can use some debug facilities we also use some other mechanisms that are about to tell you that worked but in some cases it requires a few kernel patches so far we have about 200 kernel patches that are merged in the scope of the CRIU project in the Linux kernel and starting from I don't know one kernel CRIU works on that kernel so in some cases there is no info in slash proc we have to patch the kernel add some mechanisms and so on and so forth well I should have clicked the space so one another thing is currently CRIU is like a low level tool it lets you checkpoint and restore and it's it's no big use by itself unless you're doing some advanced thing it's better to be integrated in some other things so far CRIU is integrated in virtual and open it is now used for live migration and checkpoint restore instead of that old in kernel implementation that we had it's also integrated in Docker I don't remember which version though there is now say docker checkpoint and docker restore and it used CRIU to checkpoint and restored a docker container also integrated with Alex C and Alex D same thing integrated in RunC which is a low level of docker and so on there are some more projects so we are very well integrated as for CRIU itself we are currently at version 2.11 and we are doing predictable monthly releases so it's mostly it's to make the whole process predictable and make the district vendors more happy so we know the date of the next release and so on so let's take a look at what checkpoint is now the objective of checkpoint is to collect the complete state of a set of running processes everything that the processes are in order to be able to recreate that state later in a different place maybe it all starts with freezing a process tree so when the state is fixed and it's no longer changing so first thing we do we freeze the process for a process tree or a container next state is we carefully collect and dump the complete state of this app of a container that includes lots and lots of things opened files sockets, pipes all the established network connections all the signals, all the memory mappings, CPU register, credentials UIDs, GADs all the troops and the chairs and all these timers there's a lot of things to collect and we collect it all and we save it to image files, just some files on disk and we use three mechanisms as I mentioned earlier there's a lot of info to be a lot of info available from slash proc we use ptrace debug mechanisms for some stuff finally we have very interesting thing that we call parasite code injection there is certain information about a user space process that can't be obtained by anyone else except this process itself that includes the process memory that includes some other things too so at some stage of dumping what we do is we have our own very small code that we call parasite we inject that code into process and we let it run on behalf of that process and it collects some things for us the author of the original idea about the parasite code injection is Taehyun Heo who if I'm not mistaken delivered the presentation this morning right here in this room so kudos to Taehyun finally we collected everything and we can now kill this process set or we let it run it depends on what we want to do so this is a very simplified view of what checkpoint is and the details are available at our website criowork.com restore restore is the same thing in the opposite direction we have those image files with complete state of the processes and we recreate those processes we buy more things itself into this set of processes so criow runs reads all the image files figures out which resources are shared like there can be an open file descriptor which is shared among a few pids we cannot like do open three times we need to do open just once and then you know inherit a file descriptor through inheritance so we need to at this stage we need to carefully figure out which resources are shared and what is the proper order of creating those resources because of proper inheritance next stage is we actually go ahead and fork the process tree all the process that needed to be there to just do the forks at this stage actually we had to recreate the exact process IDs because pids are not you know you cannot change pids so we have to recreate exact process IDs and so on and when the restore some basic resources like those open files we jump into the namespaces if there are any we recreate private virtual memory areas, sockets although it deers in its truth finally there is that little thing that is called restore blob that's another small piece of code not unlike parasite code that we have for check pointing but it exists here because currently what we have that process tree we just created it's crew and we don't want any traces traces of crew in there so we need to unmap all the crew stuff and map all the stuff that has become the application so we need some intermediate intermediary code to unmap the crew you know unmap itself and then it becomes what we restore into and that restore blob also restores some particular resources like timers for example so they won't fire prematurely it restores the exact mappings and it restores credentials because for some things that we've done before we needed root now we don't need it so we do and it also restores the threads again it's a very simplified explanation of what restore is doing and the details are on that URL now the default application for checkpoint and restore is live migration and live migration works on top of checkpoint restore and works like that we freeze the processes or container let's say container and we checkpoint it then we copy everything to the destination server from server A to server B we copy the container file system, container files we copy those crew images we just created we copy some metadata I don't know docker config of a container and on the destination system we run the restore and we unfreeze this and it keeps running so here comes the live migration you can move a database server you can move pretty much any container that is not directly tied to some particular hardware and containers are usually not tied you can live migrate any set of processes that you wish to if it's dumpable and restoreable there is something wrong with this live migration any guesses maybe ok I'll give you a hint it's not really live this copy stage it takes a lot and during the copy the container is frozen it doesn't appear to be live it's not dead not exactly dead it's still there but it's frozen so we have some tricks up a sleeve to make live migration much more live and one trick is to use iterative memory migration that is a lot of that data that we have to migrate is just application memory you can run a huge database server occupying a terabyte of RAM and we need to copy that terabyte over to different server and as this terabyte of memory is not really changing every second what we can do is we can dump the memory of running process or running set of processes and then we copy that memory over and our server is still running fine and then we do the dump again this time we dump only pages that were changed so this is one part we had to modify the kernel to support so called soft dirty bit dirty memory tracking so we can ask the kernel please keep a list of the pages that are modified and then we dump everything and then we ask for that list and then dump it again and we do so iteratively hoping the set would shrink and it usually is shrinking and finally at some stage we decide to freeze it and do one last dump which would be way less than one terabyte we do one last dump and then we copy and restore as usual that way we can save a lot and you can save the frozen time from minutes to maybe seconds there's yet another approach to it it's called lazy migration so you either use iterative migration or the lazy migration so iterative migration is then we pre copy the memory lazy migration is then we post copy a memory instead of copying everything we just copy the freeze and copy the bare minimum and then we unfreeze the container but it doesn't have a lot of memory itself before unfreezing we had a sort of a network swap device with swap being the source server RAM and it tries to access those pages it got page faulted and those page fault handlers they go to the source and fetch those pages and in parallel with that we proactively migrate those pages and the app is still already running there there's another way to reduce the frozen time it's still a work in progress with OCRU there are people working on it as we speak and it's implemented through the user foldfd feature of the Linux kernel now yet another trick to make live migration more live I'm not sure if this is very visible but this is what we do in order to copy the data then we do first of all we do create dump and we write files to the disk then we do like scp or rsync or whatever to copy the data over to the destination meaning we read everything from disk and we write everything on the disk on the destination finally we do create restore that reads the data from the destination disk as you can see the problem here is we do twice amount of reads and twice amount of writes and it's slowing things down especially taking into account that disk is not blazingly fast so we can use couple of tricks to make it better and it's called disk class migration in our lingo before doing migration we either use existing or mount our own tmpfs instance which is basically a file system in rem so we don't have to do any disk IO and we run a very simple thing that is called create page server that just receives the pages and writes them to that tmpfs and on the source we say create dump or create pre-dump hydrate of migration and they say don't dump it to the disk dump it to the page server here's the ip, here's the port number and it dumps it right there to the destination and it goes to rem and then create restore just reads it from tmpfs which is a memory to memory copy so no disk IO and no double reads no double writes I could probably talk about live migration for hours but there is just one final thing I want to tell create is not a live migration create is checkpoint and restore it can be useful live migration right? and the full live migration is done by another tool that we have and it's called p-hole process hauler there's a little humpback course you know carries your process around it's create.org slash p-hole and that concludes my presentation just kidding beyond the live migration if anyone read the description of my talk you know some of the scenarios that create can be used for one such scenario is a safe button for a game that lacks one so before going around the dark corner you just do create dump and then you can return to this place and you can obviously do so with any other app when if you run I don't know some kernel compilation and forgot to move it to screen you can checkpoint it and run the screen and restore it inside the screen or t-max you can save a state of the loan job not for same thing but it's not for the game consider you are bioinformatics or that sort of thing and you run in some huge data number number crunching like I don't know protein folding and there's an application that does lots of computation and it takes weeks to you know finish the task it goes down and you lose weeks of work so you can do periodical checkpoints like every half an hour worst case scenario you lose half an hour of work in case of some disaster so this save the state is for HPC stuff mostly and then you can always do load balancing within the cluster move your containers around the cluster in order to balance the load better on your system and of course it's migrated together with the network connections so your users won't notice it this is live migration application of course not specific real migration and then there's that thing I don't have a good name for so I went for open heart surgery idea here is to do something with images in between checkpoint and restore so you checkpoint your process tree you checkpoint your container then you go to those image files and you modify something you modify a file name that is opened or you modify some ID inside there so we have a tool for that it's called crit and it lets you basically convert those binary images into say JSON and then you can modify the JSON and convert it back to binary and then you restore not exactly what you have checkpointed sometimes sometimes you have to there are some funny cases then you want to do something like that and I'm calling it open heart surgery because actually this is like a live patch and you checkpoint change something and restore and it keeps running but it's a little bit changed now anyone from the audience have any other scenarios in mind any cases yes please well Docker image format is just a file system what we mean here we don't save a file system create doesn't deal with the file system we save the state of the running process and to do live migration you need to copy those files over and in case in order to do a successful restore you have to have the exact state of the file system like if some file is changed or lost you might not be able to do the restore fortunately we have like file system snapshot and mechanisms for that and this is like out of scope for create so you can use like dn snapshot for that for example there's a feature of the Linux kernel you can use zfs snapshot for that yep yep yep that basically that means that you can like pose and unpause everything or yeah that's right yes please yeah of course yeah you can have some like states that you want to start from and you start right there from the state instead of recreating you restore from it right right I'm actually going to talk about this case yeah I have a whole plot about this case that's great yes please so you mean instead of starting from scratch every time you can start from some intermediate state that that applies to simulation or some networks that also can also apply to some qa work like there is that specific state you want to have the application to be in in order to be tested and it takes a lot of resources to go to go into that state you need a lot of you know steps and dances around for the app to be in that state and the next time you want to test it you just test from the checkpoint and it's the same yeah good so I have a few more cases but a little bit more details the first case is really very unusual we use crew to test the Linux kernel the thing is crew is pretty advanced software it does a lot of things with the kernel that no one else does and the other good thing is we have pretty extensive test suite that comes with it's like I don't know a few hundred test cases that around in different scenarios and of course we use this test suite to test crew to make sure we don't you know break everything and we test it on this table kernel to check if we if crew is still working for us but we also use the Linux next from Git that is to become the next Linux release and we run the same test on table crew this time to find potential bugs in the kernel potential regressions so far we have found more than 10 kernel bugs this way just by running some Jenkins jobs with Linux next Git report there's that URL crew or Linux next at least some of those and we decided that we need to document those cases next thing let's call a time travel or if you want to be less fancy that is reverse debugging or rewind same thing basically you do periodical checkpoints and in between you do some work and then you can restore rewind time travel to any old state a good example is that runkit.com service which I believe one of crew developers work for runkit what they do is some web environment for javascript developers and they do have that big rewind button so is a quote from the website next time traveling debugging using the revolutionary new technology called crew which allows notebooks to snapshot the entire environment that means you can rewind some processes the file system really just about anything in your session of course as I said before crew doesn't care about the file system for that I believe they use DN snapshot or something like that speeding up applications instead of starting the application it's time for scratch you just start it once when it's fully initialized you checkpoint it and next time you use restore instead of start well that's I believe the same thing as just mentioned about doing demos we did a quick test with Eclipse IDE which is famous for starting very slow takes about a minute and a half to start and it takes about 15 seconds to restore and in both cases we had the very same state of Eclipse alright you want to restore the SQL server in like a hot state and it caches already filled in I see yeah that's right and I know someone who's using this mechanism for a patch and PHP probably because there are a lot of modules in PHP and it you know does lots of reads and then places these modules in memory do some linking interesting thing is it's faster to just restore it and one last scenario is would be the booting of the mobile phone like you can boot it once and the check pointed at you're in the last restart you instead of cold start you can use hot restore I'm not sure but I think either Samsung or Huawei was working on something like that using Clio although I don't know any details this scenario was also mentioned this is the deferred or remote debug we actually talked to Jynx guys and they said they have that big problem like I don't know they have some big bank as a client and a pretty expensive 24-7 support contract and then the in Jynx server needs behave they want to debug it and the client of course want to restart it and then there is a dilemma if you restart it the bug is not fixed it will resurface and it will bug you if you debug it that means downtime well you can alleviate it of course but sometimes not if it's like a front end or something the only choice is to restart and then yeah you can check point it and then you can have both restart and debug using those saved image files same thing as just mentioned as forensics one another pretty weird scenario we use Travis CI extensibly to test crew anyone in the audience using Travis you guys so Travis is also my love it but one problem is they only provide that specific some that old version of Ubuntu 16.04 and what we want to test some of the latest kernel stuff that we added recently we want Ubuntu 16.04 and we want like a latest kernel and you can have neither you cannot have your own kernel you cannot have any other distro with Travis you're locked you can have it in container but you can't run your own kernel so the initial idea was pretty simple just upgrade that Ubuntu in place you just change trust it to the Nile in sources let's do get upgrade, get this upgrade and then you need to reboot into the new kernel the problem with this approach there is that specific control process that Travis run to keep control of the VM and once we reboot the process is gone and Travis loses this VM and it thinks it's not existent we cannot do anything it becomes unusable after reboot and I guess some garbage collector code just to kill that VM some time later because it's lost the idea is to make that control app survive a reboot it has the network connection to Travis it's like an SSH session and then they run something under that SSH session so we can do all the same thing but before rebooting we check point the Travis control app it's not an app it's like a again it's a small tree of processes and then we restore it it ended up not as easy as I'm telling you there are some interesting well work around that we have to implement because with a massive upgrade from 16.04 to from 14.04 to 16.04 pretty much every file on the system is replaced by something else then there are a few things but this is the article that describes all the magic tricks that we had to implement and it took about a day to make it work but it's working so in a sense we hacked Travis VM and we can now enjoy running latest and greatest Ubuntu plus our own compiled kernel in the very same environment and use this fantastic Travis server and it's working and the good thing is we already found one particular kernel regression using this testing using this testing in Travis so there was a fix that came in two patches and the subsystem container applied the first patch and didn't apply the second one so we got broken and we found this exactly using this scenario alright that concludes the part about different use cases and now I'm going to tell about a couple of things that we are working on lately leap soccer which is socket checkpoint and restore is library that I'd like to tag as LEGO for TCP sockets what it does is checkpoint and restoring of TCP connection of the established TCP connection TCP is pretty complicated and it takes a lot of effort to be able to checkpoint and restore TCP sockets so we had to this is yet another case that we had to patch the kernel for it to work we introduced so called TCP repair socket mode that lets you basically lets you recreate a socket without any network activity it's like when you that you can do sand and it's not sand and it just fills in the buffer you can set various options but that doesn't lead to any network activity so CPU uses this TCP repair feature to restore network network connections to restore TCP sockets and then we figured out that CRIU might not be the only user of the feature we took the code from CRIU and ported it out to a separate nice library which does this thing for you it's basically a library for the kernel functionality that we have it's part of the CRIU now but it's a separate library so if you're only interested in moving network connections between machines or checkpointing and restoring network connections or doing some tricks with network connections the open TCP sockets you can use this lip-socker thing one scenario is you can have a front server accept the network connection and then it moves it to a different machine and then someone else works with it and you don't have to use the full CRIU load for that then there is another thing that we ported out to a separate library and it's called compel that's that parasite code injection thing I mentioned earlier that's the parasite code injection for the masses so anyone can do things like that with compel you can do two things you can perform system calls on behalf of someone else or you can run your own code inside someone else's process and this is what we use during checkpoint stage this thing is currently in development and it's only available from the CRIU dev branch but we are going to finally release a development preview of the feature this month in CRIU 2.12 or you can just check out the CRIU dev branch and CD compel and see what's going on about it again this is still work in progress we can change the API so please don't rely on it being stable it's 0.1 0.1 at the moment and there is much more info about it at CRIU work.com compel I'm not sure if we have much time left if we do I can show you a simple case of using compel so you include compel compel h and then you do this you call compel stop task to freeze the process identified by PAD then you call some magic called compel prepare that prepares the process to be victimized by our parasite code then we can immediately execute some system calls inside this foreign process in this very example we execute close this call so we close any file descriptor inside any process finally we run compel cure to cure the victim and then we resume it once it's resumed that file descriptor that was opened is closed now I'm not sure maybe it will crash maybe it will figure something out but this is a good way to for example do some fault injection actually nowadays S trace can do fault injection as well but this is another way of doing it so that's as simple as that and new stuff here is preparation and you know roll back executing your own code inside a foreign process is a little bit more complicated I mean we tried very hard to make it as simple as possible oh one final thing you write this proggy and you compile it with some specific includes and then you link it with some specific libraries this is how you do it if you're doing it from the command line well makefile example is similar execute in your old code that we call parasite code first of all of course you need to write it you do some proper includes and then you have to have this function which is like a main for a parasite called parasite demon command and it has two arguments which is like a command and then some pointer to some data and you can do it like this if command is closed then you return this clause and the pointer is the pointer to the integer which is the file description and so on so forth so this is the very short example of the parasite code we're going to inject once written we need to prepare it first of all we compile it with some specific C-flex because there are some things that GCC wants to do for us that we don't really want in the parasite code so we have to disable something, enable something that's all hidden under compile C-flex then you link it with some not C-flex but LD-flex of course then we link it together with some plugins that I'm going to tell about in a minute and finally as a result we have some binary blob this is our code compiled and ready to be injected finally currently it works this way you generate a C-heater that contains this blob as well as some auxiliary functions for this thing to work and we end up with parasite.h which contains that binary blob inside it now we are ready to write the injection code the code that is making this parasite code work so of course we include that heater that we just generated and then we do the same thing as with CIS calls like with top task we prepare the compile they prepare the victim then we do some other things like parasite setup the heater is doing some magic to link the parasite then we run compile in fact that actually infects the victim with this parasite code and then we prepare the arguments supplying the type of the arguments it could be like int the type here could be int and then we initialize that in this scenario we are going to close file descriptor number 42 and then we do compile rpc call sync we supply the command and we and that's it the argument is already initialized the argument passion is through the memory you cannot pass it any other way once you're happy with executing those calls you can do as many as you like you can do it for threads and I'm not going to cover that and you can do it in a sync or a sync manner and I'm not going to cover that either once you're happy with your parasite work you cure the victim and you let it continue and this is what we do during the checkpoint stage can I have the next slide please open office some constraints about the parasite called there is no leapsie in there leapsie is in that most obviously most probably it's linked to that binary but we have our own code and it's not linked to leapsie so there is no string functions there's no memory functions there's no printf you cannot do anything from there even call this is called so the solution is we have some very small libraries that we call plugins that lets the parasite do some actual real work and we have the standard plugins so far which is linked by default and with that you can do system calls and with that you can do sterlan or memcopy and it also has some probably ugly and basic printf and vprintf and so on implementation so it's not usual v then you write parasite code but we try to make it as good as possible the other plugin which is still work in progress it lets you to set up a shared memory region between the parasite and the master process and it's still not ready for print time anyway there was compel compel you can use it for some things and I'll just give couple of examples here you can do log rotate without application knowing about it so you can just close file descriptor copy the file over open it and then keep the application running and your log is rotated even the application is not supporting it you can monitor process from inside for that the parasite needs to clone itself so it will keep running and then it will have some pipe and report some vital stats about the application being that application being a threat of that application you can update an application just like a runtime patching of the application you can actually be used to patch a security hole without restarting everything and I know that big vendors like Amazon they have problems with that they cannot restart everything immediately and once some next dirty cow will surface everyone is having a big problem so we had a guy who was trying to work on that and that means you can change some code running app or you can re-link it to a new version of libraries once these libraries are available and then finally you can probably have some application that works badly and has memory leaks and a little threat of yours can do freeze and leak cure those memory leaks so you can keep this cripple thing running forever there's much more again crew.org slash compelling usage scenarios I hope you have some questions is that normal questions yes we work with TTYS so we work with terminal about X application I think we work with X application as long as it's not X server itself if it's just some X application it uses some standard mechanism so we can checkpoint it either well of course we cannot checkpoint and restore everything that's sort of like it can be a problem so one thing you can do is to give it a try and see if it works for you we support more and more cases with every new release like for example we were not able to checkpoint and restore non-established TCP connection that is like in the final state or it's like in the pre-accept state and we still have to have a couple of very patches to do that we cannot checkpoint and restore everything that deals with some real hardware because part of the state is in the hardware and we cannot checkpoint and restore it but it works for lots of things I mean worst case scenario of course is you checkpoint something and then you try to restore it and it fails and if you haven't if you killed it that means you lost it so the way to fix it is to keep it in a frozen state unless you make sure you have restored it it works. So this is how live migration works it keeps the source in the frozen state unless this thing is fully restored once it's restored on the destination it gets killed on the source and if something is not working we still have the source frozen which we can unfrozen say well it didn't work but nothing is broken any other questions yes I believe it's like from 3.10 that is pretty old kernel by standard yeah you can do you can use Krill with that and then we can have Krill check command that tells which features are supported which features are not supported it checks it properly kernel and we can have Krill.org upstream kernel commit page which lists each and every patch that we merge into the kernel and what it does I said it's about I think it's 201 patches so far so but yeah again the best way is just to give it a try and see if it works for you and if not we are here to help I'm here to help use my contacts here it's either college kernel Twitter or Krill at openvz.org and Krill also has its own Twitter and we have pretty good community that is to help each other any other questions yes well I doubt that you mean gdb as new debugger no you cannot run your own code inside a foreign process context you can just take a peek at what's there you cannot just close the file descriptor you can just see what's there you can stop it you can analyze it you can see the variables but that's that's about it you had a question please well Docker is a good example you do docker checkpoints and docker restore and you know inside it's all Krill I can't think of any at the moment alright if there are no more questions thank you everyone for coming and I hope you enjoyed it yo turn it down you hear that buzz that's lovely