 Hi everyone. So welcome to our session. My name is Isabel Jimenez. I'm a Distribute Systems Engineer at Mesosphere. I work with the security team. And this is Kapil, so I'll let him introduce. So, hi. I'm Kapil Adria. I'm a Distribute Systems Engineer as well at Mesosphere. I'm also Apache Mesos Commuter and DMDCB developer. So in this talk, we are going to talk about the process migration aspects in the container orchestration world. And let's get started. So here's a brief overview of what this talk is going to look like. So we'll first jump into the motivation and see why do we need this and why are we actually talking about it here. Next, we'll introduce the concepts behind process migration and check pointing and so on, followed by the introduction to Apache Mesos, which is a two-level scheduler for the data centers. And then there will be the combination of check pointing or process migration with Apache Mesos, and we'll see a demo at the very end. So let's get started for the motivation. So the key thing here to notice is the idea of stateless versus stateful application. In the whole container world, we always talk about the stateless applications, where you actually don't have any local state. You start from a vanilla image, and then you are performing transactions. Things are being stored in a remote storage somewhere. It could be a distributed or a centralized version. The kill of an application instance doesn't really affect the performance as much because all of these transactions that are happening are pretty atomic and pretty simple. However, in case of stateful applications, the application by definition has a local state. And so whatever to say, kill an application a few minutes into execution, and it has not taken a snapshot or has not saved the progress, you lose the compute time up until the current point from the last time you saved it. So here, again, you start from some image, but you are doing some pre-computation to get to a work state where you actually start beginning to do your real work. And the non-graceful shutdowns will result in the loss of compute time. If it were a graceful shutdown, then the application will know that it has to save the state, whatever it has locally, to a remote server or a database or so on. So when we are dealing with scheduling of such applications, stateful and stateless, the stateless applications are fairly straightforward in that sense. When you scale up, you basically launch the new instances on new nodes or a new cluster. And they are pretty much starting all from the vanilla image. The idea being that everything is immutable. Imagine so you start from there, you do the transactions, and when you want to scale down something, you just kill the extra instances. If the need comes and you want to say schedule some high priority task, you can easily kill the additional instances that are no longer needed or that need to be preempted, and your high priority task can actually get the node or the resources right away. However, in terms of, in case of the stateful application, the scaling up may or may not be difficult depending on how the application has been written. For example, if it's an application which has to do some precomputation or initialization phase, then it will need that much time before the application is productive. To kill an application that is already running, if it's not a graceful shutdown, then you will lose the computation time and so on. So basically what that means is if you have a high priority task coming in, then killing some instances of the stateful application will definitely result in some compute time loss. And similarly you can think the same things for the cases when you have to move some application instances from one node to another. This could be because you want to take the node down for maintenance or there's some hardware failure and so on and so forth. So in short, what we can kind of say is that the modern container orchestration tools are actually optimized for stateless applications. I'm not saying they are designed for stateless applications, but they are more optimized towards stateless applications because that is what the entire world is using. You have this immutable state and that's all this immutable image where you start from, do the computation, kill it, and then if you have to start, you start from scratch. So the question then is how can we make it better for the stateful applications to run and to survive in this world which is optimized for stateless applications? Well, the answer is really simple. Make them stateless and then you are done. Well, that's fine. But how do we make them stateless? So that's the next question. And the answer is something that you won't like. Rewrite them. Start from scratch. Now this time we'll write a stateless application. Again, not scalable, not something that you want to tell your developers to do. I wouldn't like if someone were to tell me to do this. So what's an alternate? The alternate here is to combine the application with the notion of process or container checkpointing and then do migration and things on top of that. So this is what we will talk into this next section about process migration. I'll introduce what actually it means and how it actually applies to the applications and so on. So before we begin, just a brief overview of the terminology. When we say process migration, people usually refer to migrating the single operating system process, like a single Linux process to take it from one node to another. When you talk about container migration, then we are talking about a container which can have more than one processes in there. You are migrating it to a different node. Finally, virtual machine migration. For example, the VMware vMotion, which I believe everyone has heard of, where you take the entire virtual machine and migrate it to a new node or a new data center maybe across the continent. So in all these cases, the key point here is that the process or the container or the virtual machine was running before, after the migration, it also continues to run. So we'll see what actually is involved in doing such a migration. So this is like a very general recipe that pretty much works on all these scenarios. You first pause the running process or the container or the virtual machine so that the state is now immutable. You then take a snapshot of the current state. You copy over the snapshot to the target node or the new data center or the new cluster. And finally, you restart from that snapshot that you just took, and you have the application or the virtual machine up and running. So taking the snapshot part is often referred to as checkpointing or snapshotting, and the restart part is called restart, just plain simple. The one key thing to note here in terms of process migration is that you want to make sure that it's actually not visible to the outside world. Because if the outside world can notice that the application is not there, then that's actually bad application. The way we do that is to ensure that we have the minimal downtime. It's like as obvious. And in these four steps, the biggest time that is consumed is in taking the snapshot and then moving it along to the new node. So if you have a huge application that it may take on the order of seconds or minutes to take the snapshot of the running application, then migrating it across a network to the next node or the next cluster is going to be an expensive operation. Ideally, one would want this to be on the order of milliseconds. So there are a lot of shortcuts and tips and techniques that people use and have actually researched over the years. For example, there are things like demand paging and doing the delta compression and taking the snapshot at regular intervals so that you actually only move the pages which have been modified since the last time you moved and so on. So basically in all of these cases, you're trying to reduce the time spent in step two and three. So sometimes you often have these two application instances running, one on the original node, one on the new node, but the new node, the new instance or the migrated instance has not actually properly caught up and you actually keep taking this incremental snapshot, move them across, and eventually you have a very tiny window where you shut down the original application, move the latest state to the newer one and then restart it there. So I won't go into the details of how this is done. There are several research papers and blog posts to handle this. So in terms of checkpointing, when someone says we are trying to checkpoint a process or a container or a virtual machine, what does it mean? So the textbook definition would be checkpointing or checkpoint restart is the ability to save the state of a set of running processes on disk so that you can later restart them. It's a very simple thing. You have this computation running on this set of nodes. You take a snapshot, now you have a bunch of files on disk. Now because they are files you can move them over, you can copy them over the network or you can do all sorts of operations on them. And finally when it's time to restart, you just take the image and you basically restore the process state or the application state from that particular snapshot that you have. Okay? Now how is checkpointing related to process migration? So process migration inherently requires checkpointing. So all you are doing is you take the snapshot so you are checkpointing and in some sense you are actually restarting it at the same time while you are taking the checkpoint. So you want to minimize the time that is needed for checkpoint and restart by parallelizing some of these operations. So for the sake of simplicity we will just actually look into checkpointing. The optimizations are something that we can do later on. So now I want to give a quick demo of how the checkpointing feels like. Okay? So it's a very dead simple demo. I'll show you the C file that you can see here. It's a while one loop where we print the counter and then sleep for two seconds. Like nothing fancy in here. And if I run the binary, you can see it's just counting one, two, three and sleeping in between. So for the purposes of this demo, what I'm using is the DMTCP tool. So now I'm trying to launch this application. I'm using the interval flag, so saying to take a checkpoint every five seconds and then test. So we'll see that in about five seconds it will take a snapshot. So it should be five seconds by now and I'm killing the application. So notice that I killed right after four. I'm looking at the list here and I have a .DMTCP checkpoint image. So let me see here. So it's roughly two megabytes at this point, 2.9 megabytes the checkpoint image size. Now at some point I can move this checkpoint image to a new node or in this case for the purpose of the demo I'll just restart from the checkpoint image. So DMTCP restart and the image name. And as you can see it resumed from like right after it took checkpoint. So it took checkpoint sometime after it printed three seconds. So that's pretty much the demo. So going over some quick use cases for checkpointing. Traditionally checkpointing has been used mostly in the HPC world where things are really compute intensive and losing like even an hour of compute time is pretty bad because you are running this application on thousands of nodes and losing one CPU hour means you are losing 10,000 CPU hours from your overall quota of a lot maintenance and so on. So this is where you have your application running on a thousand nodes and what if one node goes down? Do you want to lose like the entire computation? No. You actually want to take snapshots so that if you lose something you restart from the previous one and you are good to go. Scheduling and process migration we already talked about. Debugging is also a very interesting use case. So think of the checkpoint image as your ultimate bug report. So you have your program running, you are taking frequent checkpoints and take the previous checkpoint and send it on to the developers of the application because they will restart from the checkpoint. It will be exactly the same scenario. They will run and eventually get back into the bug assuming of course that there are no races at all because if you have races then you can run it multiple times. There is also this faster startup times. So as we talked if you have a huge initialization time for some application then instead of running 10,000 application instances which are one application or one instance initialize for 4 hours, take a snapshot and then restart 1,000 copies of it. So that basically saves you a bunch of CPU cycles. And then there are things like save, restore, or workspace like we are working on a MATLAB session and we have a bunch of windows open. You don't want to recreate them every time. Just take a snapshot, take it on a USB key with you and restart on your laptop. Speculative execution and managing long tails. The last one is actually quite, I'm really happy to have 16 core machines or 24 core machines or 30 core machines like really fat machines and you have these multi-threaded applications. So whatever you have an application which has 16 threads and they run for say 16 hours or 12 hours and then 15 of these threads have exited but there is this one thread which is going to take another 4 hours to do some accumulation or some post processing of the data that the other 15 threads generated. In most cases the 15 remaining cores are getting wasted and you can claim the cores or the resources for the entire application but the current application is actually single threaded. So this is what it means by managing tails. So you can in a sense, you can optimize these things by moving 16 such processes onto a single node so that each thread is actually consuming one core and now you have the 15 cores or 15 nodes available to do other work. So anyway, in terms of the stateful applications, so the claim roughly here is that if you take a stateful application and add checkpointing to it, it is almost equivalent to a stateless application. It won't be as smooth because you will have to take checkpoints, checkpoints will take time and so on but it's almost equivalent to having a stateless application and that's what I meant in the beginning when we said how to make stateful applications stateless. So again, scaling up, now it's trivial. You take a checkpoint, you replicate it across all the nodes and you can kill it and migration is fairly straightforward. So one thing to note here is the cost of checkpointing is often dependent on the memory footprint that the application has because if we have a memory footprint of a gigabyte and you are writing a checkpoint image to a regular disk then assuming there's roughly 100 megabytes per second, it will take 10 seconds to dump the checkpoint image. Now if you have some fancy hardware or some of those things, then you can get pretty amazing speeds like 60 gigabytes per second and so on but depending on the cluster and the application, this is the time which actually is the overall bottleneck in terms of creating a checkpoint and so on. Okay, and so now I'll invite Isabel to go over the Apache Mesos framework and talk about checkpointing. Yeah, so hello again. I'm going to introduce the data center kernel or as we like to call it the data center kernel. So the idea behind the Apache Mesos, the project itself on the paper is based on was to solve two main problems. So deploying applications in data centers is becoming more and more difficult, almost impossible. Most of traditionally data centers are static partitioned. So the idea was why can't we run applications as easily as we run application on our data center as easy as we run applications on our phone. Mesos is trying to solve that problem and also building distributed systems is very hard. It takes a lot of different scenarios and having to manage a distributed application means taking care of state, taking care of resources. So those are all difficult things to do and Mesos provides a solution for that. How does it do that? It abstracts the data center as one computer. So basically it aggregates all the resources of all the different machines in the data center and provides them as one, as a single block. So we at Mesosphere we are building what you can call an apache Mesos distribution. It's called the data center operating system. So in this definition you can see that it applies to the computer hardware resources. So now what we want to do is apply it to the data center hardware resources and that's how we do a data center operating system. So the main concepts to understand behind how Mesos works is this Mesos it's a two level scheduler. It's based on this offer-based model. So Mesos itself cannot run applications on its own. It needs this second scheduler. So if you've heard about Mesos you've probably heard about these things called Mesos frameworks. A Mesos framework is simply a distributed system that has a Mesos scheduler. So like I was saying, Mesos it's this abstraction, this kernel abstraction for a data center but it's also an easy way to write distributed systems. So when you do the most popular one among the Mesos ecosystem it's marathon. So for example when you have marathon here in this schema you run the marathon that is a distributed system that has a Mesos scheduler he'll receive a resource offer from the Mesos master. He'll decide to launch a task on that offer and then the Mesos master will execute that task on an executor. The task will update its state to the master and then he'll send the status back to marathon. Please not hear that the executor it's kind of separated from the Mesos itself because a Mesos framework can also provide a custom executor. So the second main concept behind this abstraction of the data center is that in a partitioned data center there will be always times with idle resources whereas if you just aggregate everything you'll be able to take advantage of all resources at all times. It saves a lot of money. So how does this actually work? The main three components of Mesos are the scheduler the Mesos master and the Mesos agent. So the Mesos agent starts by sending the resources available to the Mesos master. Then the master sends that as an offer to the scheduler and the user decides to act upon that offer. So in this case it's just a docker image with the resources on that offer. He can decide to just use partially the resources that are on that offer. There's not a problem. Mesos will take care of offering the left resources to another scheduler or to the same scheduler in another offer. So then the scheduler sends the accepted offer to the Mesos master. The Mesos master simply sends it to the agent the offer comes from. And once the task has some state change like stage and running, it can also fail. It sends it back to master and it goes so forth to the user. So here you can see that you can also run an executor. You can also specify that you want to run a custom executor. At an agent level what this does is that the agent sends a task, the launch task command to the executor. The executor actually launched the process. So the task waits for the task status to come back and update the agent back. And so forth like I said on the previous slides. So this line there that's kind of weird it's the Mesos isolation that is provided at all times whereas you use a custom executor or you use the default one on Mesos. So all of this is very nice but at that point when you start launching multiple schedulers on Mesos the master becomes this message hub. So the way Mesos is implemented is using this algorithm for fair resource allocation. So it's a weighted fair resource allocation but it's not always sufficient to guarantee that the resources are fairly shared. So you can also plug in your own allocation algorithm. There's a module allocator module so you can write your own. So at this point you start having something that looks more like a cluster. You have multiple agents, you have either multiple applications or just multiple schedulers. And Mesos provides also these highly available availability for its masters as you don't want a single point of failure for that. How it works, it's simply we use Zookeeper for consensus so either when the scheduler or the agent are trying to talk to the master they're just find the master through the consensus on Zookeeper. So when you put it all together that's probably a better illustration of how a real cluster looks like. You'll have hundreds of applications running on schedulers like a pretty big quorum of masters and a typical data center running on Mesos can go to tens of thousands of nodes. So this is more likely. If I would show you this slide on the first time you'd probably be kind of disoriented. So how do we join all these two levels scheduling and with a container migration? I'm just going to make a small demo of using RunC. So why RunC? Some of how many of you have heard about OCI? Okay, cool. Some of you. So OCI is the open container image specification that's trying to standardize the way you defined what a container image is. So RunC is based on that. It's well integrated with CRIU. That's another checkpoint service provider. It's also a very lightweight runtime for container. And it's compatible with Docker. So Mesos has had Docker as a first-class citizen for a few releases now. And having something that is compatible with that will allow me to not change anything on the user interface side. So what do I do now? I have my own scheduler that I call Vault. And when I run my application I just do exactly the same thing. I precise a Docker image and I precise my executor. On the agent level my executor. I'll just the agent will just talk to the Vault executor that will launch a task. And this task will not run in a Docker container on a Mesos container. It will run in a RunC container. So every time I launch a task it will be on the same sandbox but it will be a RunC container. On the same isolation provided by a RunC container. So demo time. Okay. So here I have a Docker Compose file that basically creates a basic, very basic Mesos cluster. So it has the most basic features. A master, an agent and my framework. So when I do Docker Compose PS I'll have my my components running. I can see that now I can hit the Mesos UI. I have one agent available with the resources that are there. And this is my Vault UI. So I can run an image on my Vault UI. For the demo I chose to run a Redis container because Redis is this can be associated to this cache tool that basically will keep everything in memory. So when you checkpoint it and restart it again you can see that the memory what you put in memory is still there. So I'm just going to tell him two CPUs and 50 memory and here I'm going to say Redis the command is Redis server and I just specify please don't use the default executor use mine. So I send that and my task running with this ID it's the same one in the Mesos UI you can see that the Mesos master my task is running and I can go check the sandbox. So in the sandbox I have my executor I can see that it runs it was there and it was used and that it mounted it created the OCI bundle with the root FS for the Redis image so you can see everything there. So this is all great but what happens when I go see on my on my mini cluster so I can create a I can exec into the docker sorry the Mesos agent container and see what's going on so I can do a run see list and see that yeah my Redis is running I can exec into it so you can also exec create provide this and a command so we're going to run bash too so the way run see is you have to be in the bundle so now I am in my in my run see container writing the Redis image and I can just do my Redis CLI and for example we can put a simple run see container so when I get hello so this is just like a simple run see container running nothing very fancy for now let me just create well and great nothing fancy so in my UI I made two two buttons for check pointing and restart so these almost download button here maybe you don't see it very well we'll checkpoint that container so when I checkpointed here my tasks failed why is that because right now the way we integrated for the demo it's not integrated in mesos so the current states available for mesos tasks do not include checkpointed or restarted or checkpointing or migrating they're only mesos tasks for now so that's one of our aims through this demo it's to help the community help us integrate that with mesos by creating demand so here when I do run see list I see that my container is no longer running and that is still there so I can actually check the checkpointed image that it's here this is what resemble looks a checkpointed image these are the files that are actually used for the restart so when I sorry okay so when I go back and I use my second button that actually restarts my container from that checkpointed image I just hit that it gains the same thing it just restarts with the same task ID so the state just comes back as running and when I go back to the mesos UI I just see that I had my task failed five minutes ago now it's running again the sandbox is exactly the same one so if I go back and I do run the list my container is now running again and I can I'm already in the bundle so I can exec as I did before on my container ID and when I run my redis CLI and I do a get hello the world is still there so like Gabriel was saying this can open windows to the ultimate debugging like he was calling it process migration so live migration between nodes and all of those cool things that come with checkpointed but in a distributed way now the things that are left to be done in on the mesos part is that as we were making as we made Docker a first class citizen to run containers in mesos we'd like to make checkpointing a first class integrated in mesos so provide checkpointing as a service in mesos it's not trivial we still need to make all the transactions transparent to the scheduler and the executor will need to integrate the task states into mesos directly so add new task states as I was saying checkpointed restoring we also love to support multiple checkpoint services so right now I use run C because it's well integrated with Docker and it uses crew but for live migration that Kapil showed before in his demo DMTCP works way better and it's not about better or not it's about user friendly experience so DMTCP actually allows you to have a better experience in a user level crew it's more system oriented so thank you very much do you have any questions so thank you this was good what question I had is have you with the migration for containers mechanism have you done any performance studies on what the limitations of it are to give you an example let's say you have a very high performing network application take for example something like you want to put in an EPC like an evolved packet core or a mobile gateway or something like that with such sort of an infrastructure that is that handles really really high traffic and what happens sometimes you can't shut that down migrated to another host and stuff like that and this would be extremely useful but at this point we are looking at process migration with VMs and there's a certain class of applications that you cannot use this for so as in how at least my hope is as in how people move away from to using containers to creating these as microservices those class of applications where you cannot use process migration or container migration continues to shrink but I'd still like to hear your views on what limitations are from that perspective over here or if you've done any studies on that I'll let so I guess the that the cost in terms of migration is still there so no matter whether we go from virtual machines to single processes or containers that particular cost will always be there now the thing is if you have say heavy network traffic on a given with a given process then there's not much you can do fundamentally to change that there are some research articles which actually point to having these duplication or deduplication going on in parallel so you have a process running as as a slave of the first process and all but that again is is not something that you can just like put out and say like this actually solves all your problems so there is no easy answer to that I'm afraid okay thank you and more questions okay thanks again then