 Hello everyone, Pavel Emilianov. I'm principal engineer at Parallels, working on our server virtualization product. And today I'll describe you a time and space travel service for Linux, project called Crew, which appeared at Parallels and which I'm currently developing and maintaining. I will describe you what the Crew is, say a few words about the project history and current state. And then we'll mostly concentrate on the usage scenarios which we invented ourselves, why we use it in Parallels at all, and which we get from people who found our project and found it useful in their project and setups. It will include life migration, rebootless kernel upgrade, these two things we do at Parallels. But I'll also describe such things as low-services that start up advanced debugging and testing and a little bit more. If you have questions during the course of my talk, I'll just raise the hand and I'll try to answer it. So Crew is an acronym. It stands for checkpoint restore in user space and the project basically is about this. Consider you have a set of processes running on a Linux box. These processes may have some set of resources like open files, memory mappings, memory mapped files, timers, these things sometimes can be shared. For example, if a process opens the file, then calls fork. The file becomes shared between these two ones. It's not the same as if these two processes open the same file using two calls. All this process may leave in what is now called containers. That's namespaces and control groups are involved. So what Crew does is it tries to read the state of these running processes and saves them into a set of files. We call them image files. Crew tries to read actually the full state of the process. The information Crew obtains should be enough and in our test it's appear that it's really enough to recreate the same set of processes later on the same box or on another. But what Crew recreates from these images should be exactly the same as if it was before from the process point of view, of course. The first thing is called checkpoint or dump. Just shorter alias. The second thing is called restore or sometimes restart. So Crew is about these two things. It takes a dump of the process state, puts them into a file, then it can restore them back. This is what checkpoint and restore mean and what in user space mean before explaining this. I have to say a few words about what was before Crew appeared. The whole thing started within the OpenVizet project. It's containers, virtualization projects which provide Linux containers for users. One of the parts of the project was so-called containers life migration feature. It's when we took containers and moved them from one hardware node to another without stopping them. A remote user just noticed a small freeze in sourcing. Within one of the goals of the OpenVizet project was providing this functionality, I mean container's virtualization within the upstream Linux kernel. So far we have more or less succeeded with that. But by the time we tried to make the checkpoint restore feature available upstream, we first tried to do it as a kernel module or as a kernel subsystem. But this approach was not accepted by the kernel community. So we made another attempt and did the whole stuff from the user space and this approach seemed to work. So in user space here means that if we take a process running on a Linux kernel, its state sits not only in the user space. It doesn't include only the memory pages which process works on. It always includes some complex set of kernel objects which reside on the kernel state and help kernel sort the processes and let it work and run. And so it seems natural that this state which is mostly concentrated in the kernel is easier to be exposed from the kernel module. But as I told, this approach was rejected. So we implemented in the user space which means that CRIU is a Linux program. It's a command line tool which dumps information about Linux processes using standard kernel API. It uses Ptrace, Procfiles, Netlink, Sockets and all sorts of these calls that can help us with process state. The same stays true for the restore. Restore is yet again running a command line tool, the CRIU tool. It will go to the kernel with standard APIs and will recreate the same set of kernel objects which have been there before. And eventually, CRIU process will transform itself into a process it's trying to restore. So that's why it's called in the user space. This project was started about two years ago it started with an RFC patch set on the kernel mailing list which provided several extensions to the kernel API regarding process memory management. And it also contained a very small command line tool which was able to dump and restore a very trivial process which actually opened the file, mapped the file and then did busy looping waiting for it to get dumped and restored. After that, we've been working for about a year before we did the very first release. It appeared in 2012. It supported 8664 bits architecture and basic stuff like open files, memory mappings, timers and similar things which applications typically use. Since then, we have merged quite a lot of, relatively a lot of kernel patches. So they are all about extending Linux kernel API this way or another so we can dump and sometimes restore more things. Currently, the project's latest release is version 0.7. It supports x86 and ARM architectures. And with the unmodified Linux kernel 3.11 which was released a couple of weeks ago, it supports all the stuff that typical applications use. We, from time to time, we take real Linux applications and check whether they can be check pointed and restored with crew. So far, we have checked Linux demons like Apache, MySQL and MongoDB. We've checked the system services like Chrome or SSHD, checked various Java programs, checked GCC works and graphical applications with VNC. If we have time, I'll try to show you a demo of how VNC and the XSize program is supported. And we have checked the command line tools like tar, bash, talk within the screen. So this is it. Now I will tell about how crew can be used. Of course, the initial usage scenarios we started with were they came from Power Parallels. They wanted a checkpoint restore feature, it's live migration. This thing is especially useful in the cluster. And we use checkpoint restore to upgrade kernels on nodes without doing what looks like a reboot to a server user. And we have several usage scenarios which appeared from other people which started to use crew. It's closed services start up. HPC guys came with the need to do periodic snapshots of applications and similar stuff. So let's start with the live migration. What live migration is? If you have to host host A and host B and have a process running on one of them, you can migrate them from one box to another without stopping them to do it. It's a several step process. First you dump a process into image files, then deliver this image files on another host, then you restore them and that's it. From the process point of view, what has happened? They were stopped and then resumed and they appeared to run on a different host and they are not stopped. They still serve if it's services, they still serve the quest and so on and so forth. This live migration can be optimized in two ways. Our first of all, sorry, first of all it's worth mentioning that while doing this live migration, all the process have to be stopped so that the image files with the state you get are consistent. So process are in consistent state and while you don't get the image files copied on another machine and restore it may take some time and during this time, services will be appeared as stopped to the clients and this time, which is called a freeze time, should be reduced as much as possible to improve the quality of service. So this time can be reduced first by using a technique called premigrating memory. If we take process, get their memory contents and start transferring on another machine but do not stop these services. This can be done with crew with the help of so-called memory tracker, which appeared in Linux 3.11. If we do this, we can later stop the services, get the contents of memory that has changed this previous time using the memory tracker we can find those parts of memory out. This improves the freeze time significantly. Another thing that can save some time during life migration is using some shared file system or shared storage between false end so it'd be like NFS or other stuff. So if you like migrating applications using these two optimizations then even for huge applications like Oracle or databases, the freeze time will be seconds or even less. This life migration feature is quite useful in a cluster. For example, you have a cluster of say three nodes in my example, you can do load balancing. If host A and host B are overloaded with process and host C is relatively idle, you can life migrate process from A and B to C and thus probably improving the overall performance of your cluster. Another usage is you can do it vice versa. Moving services from host C backwards in case when these services are mostly idle. For example, if these are Apache or Nginx web services and your data center, for example, is at night, so people do not use them actively, there is a low amount of requests. You can life migrate them all together in for example one host and power down the free one that's staying in the power. Another usage of this feature, I mean life migration in the cluster is helping administrators of the cluster with nodes maintenance. If you have host A and you want to power down it to replace hardware or to add the RAM and you don't have host like each or do something else or replace the host A completely with some newer box. Yet again, you can move the services on another host then do the maintenance you need replacing or fixing what is required and then optionally moving them back. I say optionally because such things are better to be done with some dynamic load balancer which decides itself how to utilize their resources. Okay, another usage of the checkpoint restore which also came from parallels. It's a kernel upgrade on the host without the reboot. If you have a host with the kernel and want to upgrade one, currently this is typically done by stopping the whole node then replacing the kernel, then booting the services up. It can be faster or slower but anyway you stop your services and other things and have to start them from the beginning with the checkpoint restore feature. It can be done if not faster but at least without interrupting or big time. You dump the process, then replace the kernel. For better performance at parallels, we use Kexec because when you reboot the kernel on more than server hardware the majority of time is taken by the hardware boot up after the reset. It scans memory devices and does some other things and doing it with Kexec will save time a lot. Then you restore the services back. At that point you have host with upgraded kernel and the services are up and running again but they are not started from the beginning. They have resumed their execution from the point they were interrupted on the old kernel. Now I will describe some use cases and ideas. We do not use it parallels but we have as I told, received this usage scenario from people who tried to use Creo. One of these scenarios is slow service startup. Let's consider you are trying to start something on the box like you type service who starts. What happens after that? Any process spawned then it does its internal initialization. It will load config file, it will initialize some caches, maybe create some users, initialize some pools, do other interesting from this point things and by the time it's ready to serve the request some time passes, it can be bigger time or smaller time but if we dump this process or processes at the time they are ready and then instead of starting the service again we restore them, the time needed maybe smaller. This may improve your usage scenario. For example, this case came from people who did testing of Java applications. What they did is they wanted to run a bunch of tests on Java applications but they didn't want to run them sequentially, they wanted to run them every time on a freshly started application and the thing is that application starts within Eclipse IDE in a minute and the test typically takes a couple of seconds to run. You see running a bunch of tests will mostly be busy with starting the application again and again. They took crew, starts the application, check pointed at and then instead of starting it again they restored it, the restored took a couple of seconds so they saved a lot of time and was able to run all the tests in much shorter time period. Yes, the time to restore service if it's smaller than the time to start the service then it's probably your case. Another thing that can be done with crew is the periodic snapshots. If you have an application, you can take a first dump of it, it will be some set of image files and let the application work. While it works, it will change its internal state, it will do some things, I don't know, open files, change variables, I don't know, request and so on and so forth. We can do the second dump of application faster, producing smaller image with the mentioned memory tracker feature. Applications do not modify big amounts of their memory while they work. Well, of course they can, but typically they don't and the amount of memory that has changed even for relatively big amount of time like minutes is usually much smaller even than the working set of applications because they read some parts of memory and these parts do not change. So we can do these snapshots several times and later we can take any of them, not necessarily the latest one, and restore the parts into that state. It will be exactly in the state we took the dump. This scenario is mostly, this scenario is wanted rather badly by HPC people and that's why if they have some computational task which performs complex solving, complex computation which takes days or weeks, they want to take the red snap shots of it every like 15 minutes or half an hour or the way they find better. And in case they have a power failure in their server room, they do not have to restart all the computations from scratch from the beginning, spending a game these days or weeks. Instead they just restore their computational task from the latest snapshot and continue a calculation losing for example, 15 minutes of it out of days or weeks. Another story with the checkpoint and restore, it's about debugging applications. Consider you have host which is in production, it works on some server, doing some very, very useful thing for probably your customer. If this application happens to be in trouble, for example, it gets stuck, somewhere internally deadlock or lifelock or needs behaving from customer's point of view or whatever. And you'd really like to debug this thing to find out what exactly is going on but since the server is in production you cannot spend a lot of time doing this. With Crew, you can take a developer host, live migrate this stack application to it but not necessarily killing the application on the original host but doing something that will bring it into a normal state. Maybe restart, maybe reload config, maybe doing something else with command line interface but let the application on the production host continue running and executing. And on the developer host, you can use the debugger, you can attach it, trace it, check what's going on, why it got stuck, dump memory and you can do it again and again, restarting it and checking what's wrong with the application. Advanced testing thing. This scenario also looks the same as I have just talked about people doing Java testing. If you have an application which source request, for example, network packets or GUI events or something like that so it changed its state somehow and you want to test some combination of requests and so on and so forth. So finally you get some states of your application which are very useful from your point of view. And what can be done if you have some new test or new hardware or new environment, I don't know, newer libraries or newer clients and you want to test this complex state, your application happened to be after several steps of doing previous testing. Without CRIO yet again you have to run all the previous history again, putting your application into that complex state that you want to test. But with CRIO it obviously can be speeded up, you just restore it into that state and do whatever you need. We actually have a little bit more, some of them are funny use cases. For example, I've heard that people, some of them use CRIO to move process into a screen application when they forget to I use this thing to save a transition state of a tax runner game because it doesn't have a safe button in between. We have considered also emulating the suspending to run feature for Linux VDI session like if you close the lid of your laptop and similar stuff. So this is it. Actually the project started as a purely a tool to life migrate the containers. And right now we have a tool that can dump and restore state, not of a container but of arbitrary Linux application. Container is not a strict requirement for CRIO, it can dump applications running even without any container stuff. We have started with only two use cases. It was life migration and collaborate without reboot but have received various use cases from people. As I told the latest CRIO version 0.7 and Linux 3.11 can do all the stuff I described previously and you wouldn't have to add additional patches to 3.11, everything you need is merged in there. At most you'll have to turn specific configuration options on. Well it might seem that the whole tool is quite simple thing like we do a core dump of application, copy files, then do the things like GDB does but it's actually not so. Within CRIO we have solved several interesting stuff like implemented memory tracker for the Linux kernel. With CRIO we can life migrate established TCP connections so that we can dump only one end of it, move it to another place and restore it to there and the peer wouldn't notice it. We have solved issue with detecting which kernel state is shared between our process and a lot of other things. I will talk about some of them late from Plumbers especially about the TCP connection life migration on the network minicons. But this talk was mostly about the usage scenario and right now let's try to have a small demo. A small demo. Okay, so here we have a cave-in box with the Linux running site and we have a top application running. I kind of forgot to run it on the screen. This is how it looks. We have this and top under it. Let's try to move it under the screen. To do it we first dump the top itself. We dump the whole stuff into the directory D as CRIO to write a log file with increased verbosity and use the key dash J which will tell CRIO that top is an application which runs with an so-called external resource. It's a terminal which helps talk to display information in the external. From CRIO point of view, this terminal is external resource because we dump top but we cannot dump terminal because it's used by external as well but we do not dump itself. So something is leaking through what we dump. So we ask CRIO to handle this. Okay, as you can see there is no more top. It was dumped into the directory. Here is a set of dump files. Now we can launch screen. This is how it looks. That's the screen with the bash. Now we'll go into that screen and restore what we have just. Let's try it. Here is top. It's up and running again. And from the process list point of view, it looks exactly as we wanted. Here is the screen, the bash, the command line interpreter. The CRIO itself is kind of an avoidable thing because we need to restore top with exactly the same process ID as it was before but since we start CRIO from the shell from bash we cannot make CRIO itself have this desired piece. Instead, CRIO forks another process with the PID we want. So it appears there in between. We can detach from the screen and see that it looks like a regular screen session. That's it. I have just a few words to say. CRIO lives in CRIO.org site. It's an open source thing. We have all the sources in the Git repository. We have Google Plus page where you can find some news about the project. We have a mailing list, some amount of technical discussion. It works pretty much the same as the Linux kernel development works. And this is me. You can ask questions to me via email or right now if you have. Yep. Okay. So the question is what happens with the namespaces and control groups? Currently CRIO works like this. It takes the root process you want to dump, then collects all the children and other process and checks whether namespaces this root tasks lives in are the same as the ones CRIO lives in. If it is, then it considers that no containers are currently in the game and just dumps the process state, leaving namespaces alone. And at restore time, it just restores them in whatever namespaces CRIO restores on it. But if the root task lives in another set of namespaces at dump time, CRIO will first check that all the children live in the same namespaces. So for now we support only plain containers. After that, it will dump process and will dump information about namespaces. For example, for ITC namespaces, it will read C's controls, shared memory segments, semaphores and other stuff for UTS namespaces will get your name information. For network namespaces, will read the network device's root addresses, will store it all in the image files and on restore time, it will create new namespaces and restore the whole tree inside them. Yeah, sure. It's x8664 bits and ARM 6 and 7. We have a success story of someone migrating from 32-bit ARM into 64-bit ARM, but 32-bit x86 is not supported. It has different set of registers than some kernel-specific stuff. So with x86, no, only 64 bits. Things that are specific to a process, for example, nice value and parameters set with the CIS-set scheduler system call are taken within the CRIO, but the parts that are about the CPU control group are dumped using external scripts. CRIO has a couple of hooks inside, which can launch scripts that will help to dump whatever environments the caller is supposed to be important to be dumped. So with the control group, we just dump the whole tree of, if we're talking about Unix sockets, okay, with Unix sockets, everything is supported. Like if you have an in-flight connection that is connected but not yet accepted, we will dump it and restore in the same state. So one end will be in connected state, the other one will be soon to be accepted. With TCP sockets, it's a little bit more tricky. TCP sockets, we only support two modes. First, if a socket is in listen state and no in-flight connection currently, so it's request queue is empty. The second mode is if socket is in established state, so that the three-way hand shake has finished, and only after that we can dump it. This transition state, like since then, or in between, is not yet supported, but well, we know how to do it, but haven't yet made the required kernel support. But the established is already in the 3.11, it's the TCP repair thing. No, currently, the queue doesn't require any other stuff to be run, but in the next version, the queue will be able to work as a system service. This thing came from the fact that a lot of kernel API we use to get the state of the process is restricted to root user. And people asked us whether a program can request a self-dump with crew so it can ask crew to dump itself. We, due to this restriction with the root user, we cannot make this thing as a library that process gets linked with and then call the function. Instead, we implemented this as a two-part thing. First, it's a crew service, which is run as a root, as a system service, and a protocol using Unix sockets. If you want to request a self-dump, you open the Unix sockets, send to the service request, please dump me, and this process, since it's run from root, can use all the APIs to dump itself. But other than this, no, it's just, you run it from the command line and it does what you want. Was that a question? No, currently it's not possible because if you do the crew restore command, it will fork another process. And if you want this other process to be in it, the P1 will be busy by the crew, yes. But the code is ready for that. I mean, if there is no other process in the system, we can not fork the process but start restoring the crew itself with the image. It will require a patching of the crew, but technically it's ready for that. Okay, thanks for coming.