 So, okay. Let's start and let's welcome our next speaker, Pavel. Yeah. Hello. I work for Verchosa. Verchosa is a new name of my current employer. We used to be called Parallels, you probably know us. We did and still do the OpenVizet project. One of the feature of which is containers live migration, and this is what I'm going to talk about today. So, the talk will be about why someone might want to live migrate a container, why someone want to avoid doing this, because live migrating a container is a tricky task, and how tricky it is and why it's really tricky. Most of you probably know that live migration of, for example, virtual machines doesn't seem to be that complex. So, during my talk, I will refer to virtual machines sometimes just for comparison to see the complexity of containers live migration. So, live migration is quite simple. It's pretty much like teleportation. You take an object, in our case it's a container, you get its state, send it to the place you wanted to have live migrated to, and then restore the object back. In a few words, it's pretty simple. People use it for several reasons. Examples are, for example, for load balancing. If you have a cluster of nodes which run jobs and you cannot restart job easily, and one of the nodes becomes overloaded with tasks, you can take some of them and move to another hardware nodes to balance a lot between them. Another usage scenario for live migration is to update the kernel. It works pretty simple. You make a hardware node free from critical workloads, from containers or from virtual machines by moving them to another node. Of course, you need a cluster to do that. Then you put the node into a new kernel, and then you can either rely on your balancer to balance the workload back or move the load back manually. This case can actually be done without real-life migration. There is a technology using the saving state and restoring state that can replace the kernel on the running node without actually live migrating, but still updating the kernel uses the very core technology that saves states and restores states of containers. The third example is to upgrade or replace hardware. It's the same as with updating the kernel, but in this case you cannot avoid doing real-life migration because you have to power down the node. And in the end, it just looks great when you know that there are two servers and you saw container on one of them and you press a button and then you see it on another. It looks cool. The technology is nice, but sometimes people would like to avoid live migration because it's really complex and just like in case of teleportation, some weird or unwanted results can be achieved. And if you want to avoid live migration, there are options how to do that. For example, you can try to balance not a load on your cluster, but the cost of that load. For example, if you have web servers that generate load as response on the traffic that they receive, you can try to balance the traffic itself by redirecting incoming requests on different nodes and hope that different requests will cause like equal workloads on the servers. Another way to avoid live migration is to use microservices architecture. This is when your application is written so that it can be shut down at any time and start it probably anywhere else. With this architecture, you can load balance without live migration just by keeping the necessary amount of your application on the list-loaded hardware nodes. Yet another way not to live migration. The last two cases are about updating the hardware or the kernel on the nodes. First way is how we call it a crash-driven updates. We see some people doing this. They continue running all the kernel until it crashes or hangs, and once it does so, they have to reboot it not anyway, so they just reboot it into a newer kernel, which should be pre-installed, of course, on them. It works. People use it. It might not be very nice, but still. A nicer option of doing updates without live migration is to do planned downtime. Like you can send an email to people running workloads on your nodes and say that by the end of this week we will power down nodes sequentially, so get prepared, maybe move your workloads manually or just live for several minutes without them. So let's now see into the live migration, into details to find out why it's really complex. So live migration, taking it simple, as I told, it consists of three steps. You get the state of a container, then copy it to another node and restore it back. One important thing that's worth mentioning is that before getting to the state of a container, the container should be frozen because we should get the state which is correct from the container point of view. We cannot take states of individual processes while letting them run as we do it because in this case, the states we get might be not consistent with each other. Of course, there is a theoretical possibility that we take processes and container processes and live migrate individual processes from one node to another one by one, but that's quite a complex technology. In this case, what we effectively get is a container that is scattered between two nodes and there is no live implementation of that, so to get the correct state, we freeze the whole container. Then we read it, copy it to another node and restore it back, then unfreeze and clean up on the source site. The critical parameter of this process is called the frozen time. It's the time between the container gets frozen and the moment we unfreeze the container because this is what people carry the most when they live migrate a container. If you work with it via the network, typically people work with containers via the network. When the container is being live migrated, you will notice it as a small slowdown of your operations, like the network will seem to get stuck for several seconds or the time the container is being migrated for. So the smaller the frozen time is, the better live migration is. And to reduce the freeze time, there are several options. First of all, we can try to make getting the state and restoring from state faster. But the truth is, getting the state and restoring the state are two operations that are extremely hard to optimize because they consist of fixed steps which should be performed. Like container has several processes, each of them has to be analyzed. For each of them, the state should be read. And there is not much room for optimization there. The biggest portion in this frozen time is actually the time that takes for the state to get transferred from one node to another. And for most of the cases we've seen so far, more than 90% of the state contents is the contents of the memory that processes use. It's not the information about how much process we have, what are the open files, information about connections. All this stuff is like less than 10%. It gets transferred within fractions of seconds. 90% is whatever data sits in the process memory. So if we can move the memory transfer out of the frozen time, we can shrink it significantly without even trying to optimize the safe state and restore state stages. And luckily, we can do two tricks with the process memory. First of all, we can effectively track which memory pages are changing over the time. We can request a kernel for that information. And the second thing is that we, at least in theory, but I'll get back to this later. Now that it's all in practice, we can restore processes actually without any memory and restore the memory on demand while the processes start to request particular areas of it. So to move the memory transfer out of the frozen time, there are two ways. The first is called memory pre-copy. It's pretty simple. We turn the memory tracker, memory changes tracker on in the kernel, and then start copying the container's memory to the destination node without freezing the processes. This step is called iteration. Once we've copied all the memory, we can either do what we've did before, that is freeze the container, get the state, but this time without the memory because we have transferred it, copy it, restore, and the container is migrated. Or we can say that the amount of memory that has changed while we've been copying it is still too big. We would like to try one more iteration. So we reset the changes tracker, copy what has changed from the previous time, and do it again. In this time, in this case, this life migration would be safe. So once it finishes, you can be sure that your container is alive and it is fully present on the destination node. You can just power down the source, and the container will still be alive. It has no ties to the source node. That's a good side of the pre-copy. What's bad about pre-copying memory is that it's unpredictable. There is no reliable way to predict how much time you would spend on iterations because you have no, at least before you started doing this, any hints about how actively processes are using memory and how much of the memory would change while you copy it. And this thing also gives us a second disadvantage of pre-copying the memory. These iterations do not guarantee that your freeze time will always be small enough. If processes are changing the memory actively and you copy it, the next iteration you might see that you still need to copy lots of memory, and it still should take, like, tens of seconds or minutes or things like that. Of course, you have a choice to say that, okay, I do not life-migrate this container because it would take long. The life-migration won't be alive. But still with this method, there is nothing that can be done. The second way to copy the memory without freezing the process is to copy memory-migration. It works the opposite way. You freeze the container, get its state, but skip all the memory contents. Every single page you just do not include into the state you save. Copy it, as I told, it's pretty small. It will get copied within fractions of seconds. Then you restore container but say that all the memory that container uses sits in the swap. This is what processes would see in reality. You would have to set up a subsystem. I will get back to it a little bit later. That will read the memory over the network while the processes are running and put the pages with the memory contents into the proper places. With this, you can well estimate the time that life-migration would take. You can estimate the amount of data you would transfer in the first step, the size of the state without the memory. It can be well-estimated. You know the total size of the memory that containers use and you will only transfer it once. Knowing the speed of the network and the size, you can say that, okay, my life-migration will complete within maybe five minutes within which, a couple of seconds, the container would be frozen and living on the source node. But that's the only good thing about memory post-copy. The problem with it is that it's really unsafe. So once you have life-migrated your container and it has started working on the destination node, you still cannot shut down the source node. If you power it down or if it crashes, the container can die as well because some parts of its memory may still be on the source node. You may have not yet copied it. And the second bad thing about memory post-copy is that it actually slows applications down significantly. When you start an application without a single page in RAM, it will start accessing it immediately. It will wait for the memory transfer, it will wait for the kernel to find this page and put it into a process address space, and this will happen with every single page. For the first seconds, maybe minutes, applications will work slowly, like if you put all the memory into real swap and ask them to continue running. So with these two things, really live migration, which keeps the frozen time really small, which is almost unnoticeable by humans. It's slightly more complex than just read state, copy and restore state. It includes several optional iterations of pre-copying It includes freezing and saving the state. Then we copy, restore and unfreeze, and optionally we post-copy the memory to the container. So to get an idea why this is that complex for containers, let's see the things we have to deal with when we talk about containers. So when we live migrate a container, we take effectively a tree of processes with all the kernel objects that these processes use. Files, sockets, sessions, process groups, all the memory, when it comes to processes as compared to virtual machines, memory layout is much more complex for containers. All the Linux-specific stuff, like iNotify, SignalFile descriptors, all of this, then when live migrate in a container we should take containers environment, which consists of namespaces and c-groups, and of course the memory contents itself. So first complexity comes when we try to pre-copy the memory. The problem is that there is no such thing as container memory. There is no a place in the kernel where you can go and ask for it to give you all the pages that are used by this container. Instead, you should first collect the processes that sit inside container. This is relatively easy, but then you have to individually analyze every single process to find out which memory the process use. And since we're talking about Linux containers, in Linux there are effectively four types of memory that processes can use. This can be anonymous private memory, anonymous shared memory, file private and file shared memory. All this stuff should be first collected into a system that reads the state of the process. All the sharing of the pages and of the mappings should be resolved. And only after that we have an idea of what memory is in use by the containers and we can start actually copying it to the destination. Then goes the second step when we save the state of the containers. Again, unlike virtual machines, when the state should be read from a tree of objects of known amount and of fixed size, mostly virtual hardware and powered stuff, container, even an average container, it forms a graph of objects which consists of thousands of nodes and this graph is oriented. Every single node has different attributes you have to deal with. And there is no, again, in the kernel a single place where you can call and say, okay, save me the state of this set of processes, for example. Actually, we tried to go this way some time ago we've sent a page set to the Linux kernel community adding this subsystem that saved the state of the processes and restored the state of the processes. But the kernel guys decided that this big button that does this magic is a bad design. Nobody wanted to have this in the kernel. So instead of doing all this stuff in the kernel space, we decided to do it in the user space. So we wrote, we started the project called Crew. It's a system-level tool that does all that magic I'm right now talking about. It takes a tree of processes and analyzes which processes are there, which objects they use, all this stuff about memory and saves this information into a set of files on disk. And to achieve this goal, Crew uses quite a lot of kernel APIs. It reads, opens and reads quite a lot of proc files. It uses several netlink protocols to get information, for example, about active connections. It actively uses debugging interface, ptrace, system call. And the thing is that all these APIs are quite diverse. There is no single or unified API for different types of objects in the kernel to get information about their state. Every single, every new object that appears and that we have to support in Crew, they typically all have their own API to get information from. This is pretty much the same for the Restore. Restore, unlike virtual machines, when you create a fixed amount of objects and just copy their state inside with single instructions, we have to recreate this graph of thousands of objects. And yet again, there is no, as I told, single entry point into the kernel where we can feed some information and say, okay, please start all this stuff from the very beginning. Instead, Crew recreates all this stuff literally by hands. It forks processes, it opens files, it calls all the system calls that can get a tree of processes into the state we need. And there is a word saying emphasized on the slide. This means that some configurations that we can see on Linux boxes that process get into, they do not have direct ways to get into this. For example, the simplest example is process sessions. Session, there is no system called take arbitrary process and put it into an arbitrary session. The session process lives in can be either inherited from the parent or created its own session from the very beginning. So in case we see some tricky combination the process lives in a session without a session leader which is a child of some third process. We can adjust for this process and create a session and move it into it with, for example, three calls. We have to create fake processes that keep sessions open while they're being populated then do tricks with re-parenting process on the tree and so on. This stays true for quite a lot of kernel APIs. All the APIs for re-creating stuff is also very diverse and crew has to do with this complexity to recreate that stuff. And the last interesting step is memory post-coping. This thing is already implemented by Andrea Arcangeli. The thing is called user fault FD. Some of you might have probably seen this in the kernel. Andrea did this to make post-copy memory migration for KVM. So he implemented user fault FD so that, well, user fault FD in simple words is a file descriptor using which you can catch page faults from the kernel and put pages back into it to handle the page faults instead of letting the kernel find the page somewhere. The user fault FD current implementation implies that the process doing page faults in its address space and the process reading events from the file descriptor is the same process. Maybe different threads, but still the same process which share the virtual memory between each other. For containers, this is not the case. We cannot make the processes that we have restored handle this user fault FD thing and repopulate their own address space. Instead, the process that accesses the memory is one process and the process that owns the user fault FD and puts memory into it is another process. And current implementation of user fault FD doesn't allow for that. For example, if the process which is being monitored calls fork or remaps areas or does some tricky stuff with them advises, the process that listens on the user fault FD may get confused. There is no notifications about this stuff coming into user fault FD. This is what was called non-cooperative mode of user fault FD and it's currently working progress. Hopefully someday it will hit the mainline and will have post copy for containers life migration. And other than this, before life migrating container we have to do one thing which is pretty similar for virtual machines. We have to check that CPUs are compatible between source and destination node. This is slightly more complex than the same for virtual machines because for virtual machines, the masks that are reported by CPU ID can be emulated inside the guest so you can make some assumptions about source and destination CPU incompatibility for containers. They see all the CPU features that are on the hardware so it should be done more carefully. And the second thing that is container-specific is that we have to check that kernels on source node and destination node are compatible. And here I'm not talking about binary compatibility because kernel community is known for keeping the binary compatibility for ages. Like you can be pretty sure that the binary API that is in 4.0 something kernels is the same as 4.2.4. This is mostly about the feature presence. Like if you have certain, I don't know, net filter modules loaded on the source node the same should be present on the destination one. For virtual machines this is not an option. So this is container-specific thing. All this stuff I've just described is already implemented inside two project first. First I have already mentioned it's called CRU. CRU is an acronym. It's a checkpoint and restoring user space. It's a tool written in C which does all the time critical stuff. It saves state and restores state dealing with all this complexity I have just described. And it also provides an API to set up and actually run memory pre and post. Post is still not in mainstream but still memory pre and post copy stuff. But that's only saving the state and restoring the state to properly synchronize all this stuff to handle iterations for pre-coping and setting up connections between source node and destination one. We have a sister project called P-Hole. Right now basically it's just a Python script that performs all the necessary pre-checks before live migration and then calls CRU to start copying the memory, saving the state, copies it over the wire, restores and probably set up memory post copy again using CRU. And one feature that we have also put into the P-Hole is handling the file system. If you have shared file system between source and destination that's simple you do not have to do any special about this but if the file system is not shared the file system should also be copied between nodes. It will make live migration longer but yet again there are ways to not to make this while the processes are frozen. That's still reducing the freeze time. Both processes are open source. The main entry point is CRU.org website. It's a wiki. There is a category called P-Hole where we collect all the live migration related stuff. If you want to participate join our mailing list it's CRU.org. We put news on the Google Plus page and on Twitter and the source code sits on the github. This is all I have if there are any questions. We have presents for the best question asker. CRU is a Firebird. It's a bird from Russian fairy tales and the P-Hole is a Hodgeback horse. It's yet another Russian fairy tale. The question was how do live migration at library state? When we get a state of a process we do not actually care whether the library is present as a virtual memory region inside the process. When we read the virtual memory regions inside the process we do not really care whether it's just a file or a library or whatever else. We just see at a flex this mapping was created with either it was shared or private. If it's shared then we just get information about starting and ending point or that was mapped. If it's private then we go to the kernel and ask for which pages has been copyrighted from this file and take them into the image. That's it. There is no difference in getting the state for a library or just a file mapped by an application. The question is about security context. The answer is security. Right now the whole thing only works if you run crew as super user as root. In this case it can create everything and then trim down its capabilities to whatever state is appropriate. If someone wants to run crew from a regular user and we have such request from OpenMPI guys they run their stuff as just some user on the host crew to do saving the state part because saving the state could be done regardless of security restrictions by the kernel but saving the state can be done. Restoring the state is quite tricky and it's not yet implemented not because of the capabilities but because we need to recreate the processes with the exact same PIDs they had before for unprivileged user this is only possible if you have a user namespace and the PID namespace but in this case what you have restored might not be completely equal to what you have done previously. So this thing is yet to be resolved. Right now restore only works for the root user which can do anything. What's the impact on the performance of the container? It's not a performance degradation it's a free time when the container doesn't respond. Like when using the PostCopy memory it really depends on the application. What we've seen before at least in the OpenVz users they mostly suffered from the free time but not from the delay they got after restoring because this PostCopy migration there are two tricks that can reduce this impact we can check which page the container would require first by checking its IP address IP register and some more registers and just fetching the pages this couple of pages it won't take time but the process will get them immediately. So this slowdown is really not that big. You are talking about TCP probably right? Or any connection? Okay. So netlink is relatively simple because it's a connection between the process and the kernel we just recreate the socket back. If we see that there is some data in the socket we for now we do not we refuse dumping that container because it's quite unusual situation. Process typically retold the data from netlink sockets immediately. So we can unfreeze it wait for a second and try to do it again socket will be empty. Unix sockets are not that complex too because they all connect processes within one container so we can just read them using the Unix DAIC module in the kernel that's a model that SS2 uses to fetch the information and we can actually read the data from the sockets without removing it from there. There is an MSG peak flag in the rack from the system call which just copies the data and leaves it there. And at restore time we do the same. We recreate the sockets and write the data back into it. The most interesting case is TCP connections. For TCP connections we have patched the kernel you can Google for TCP repair socket option. It's a socket option with which we can freeze the socket and get all the critical TCP information like sequence numbers, timestamps scaling factors all the parameters that sit on the socket and then we can create a socket put it into repair mode again and force it to have the sequences, timestamp and all the stuff that comes from the user space. This makes TCP sockets live migration. Yes, this is how seamless kernel update is working. You can say reboot and restore but in this case the freeze time will be quite big. Not because of kernel slow boot but because you'll have to read everything from disk. We have a proof of concept technology that keeps container in RAM while rebooting the kernel with Kexec so it doesn't disappear, we can do it many times faster. Yes, there is one thing currently we see that quite a big portion of getting the state is spent on accessing proc files. Opening, reading and surprisingly closing time. Closing takes like 30% of this time. Strange thing but it takes so we have again proof of concept patches that implement netlink protocol that can get this information in binary format and not in process by process but in batch manner. It saves lots of time but it's still not yet in upstream. With what? Docker. The answer is we have created we took parts in creating namespaces and C groups inside kernel on top of which later Docker introduced Docker files and this layering management system but we and them are two different companies and developers. In the existing versions of Virtuoso we still don't do it unfortunately but in next Virtuoso we will do it which will be Virtuoso 8. Something like that would appear. Thanks. Thank you everyone for attending the presentations. We really appreciate your feedback. Feel free to provide it on our official website. You can also tweet about the event and create a blog post. There is actually a competition for the best blog post so you can get some prizes for that. Thank you very much. Some people are with own presenters because they are professionals. They just come to drink but this is obligation. Do you present presentations like this? This is a surprise the guy is not from Red Hat over here. You are not from Red Hat? Not everyone is from Red Hat. It shouldn't be but in reality you know what happens. Red Hat is open source big company. Probably from other companies they are only small two open source corporations. Or personal corporation does it make up? Yeah, that's right. It's kind of mutant. Do you present presentations? No. We will do it again. We will do it again. Maybe again? I will do it again. I will do it again. Hello. Hello. Do you welcome? Great.