 Welcome to another edition of RCE again. This is Brock Palin We are back with new shows for 2013 and Again, you can always find us online at RCE dash cast comm you can find all the old shows there Subscribe look up at our blogs and find all of our twitters and stuff like that I also have with me Jeff Swires of Cisco Systems and one of the authors of open MPI Jeff Thanks again for your time. Hey Brock welcome to 2013 even though it's you know Really December 2012, but don't tell anybody that Yes, we're recording this ahead of time and it will be released It'll be the first show released in the new year, but we are recording it at the end of the previous year Scheduling So rock, okay, so today we have Dmtcp and we have two people from Northeastern University here, so they're developers of it We have Jean Cooperman and Kapool Aaron Let's again. I'm sure I killed that name. Yeah, let's give the standard disclaimer that we're ugly Americans and we're terrible at pronouncing Other nationality names. So when why don't we have you guys introduce yourselves and give the correct pronunciation of your names? so Hi, I'm Jean Cooperman really pleased to be here and So just guess a little bit about my bio way back a long time ago my degree wasn't applied math I At some point got heavily involved in parallel computing computational algebra and so on and One thing led to another and at some point I started worrying if this program keeps on running for a long time And then the computer crashes in the middle. What are we going to do? And so then at some point I became heavily invested in checkpoint restart maybe around 2004-2005 Hi, I'm Kapil Adrip and I'm actually a fifth year PhD student with Professor Cooperman and Before coming here. I was I finished my degree in like my bachelor's in computer science from India and Glad to be here Okay, so give us a little bit of background a DMTCP. It was only recently brought to my attention. What is it? so it stands for Distributed multi-threaded checkpointing So this is what happens when you ask the students to come up with a name for a project But the good news is that you can Google on it and you will always get our project. You'll never get something else So the idea is that you have a program running and you want to save its state into a checkpoint image restart it later restart multiple copies whatever you want to do and It should be totally transparent. You shouldn't have to modify the binary You shouldn't have to modify the operating system. It should just work. That's the basis of what we're doing Now, what is the point of this? Why why is checkpointing good good thing? So you said a second ago that you know, I've got this long-running program and the computer dies What happens? But is that a practical problem? Does this happen in real life? And you know, what does having something like DMTCP? Mean to the application developer and the end user So yeah It's a really important problem. So in high-performance computing batch queues and so on People have known about this forever Suppose you're given a certain slot of time up to 24 hours and your program is going to take 36 hours What do you do? This is one of the places where people would often come to us But then There are a lot of just ordinary people working on the desktop who want to run a program for a long time Maybe on their laptop and they don't really want to keep their laptop in one place for the next 72 hours So it just adds a lot of flexibility and then later in the program We hope to talk about some more unusual applications of checkpoint restart which go far beyond the original high-performance computing So a lot of high-performance computing applications large parallel Applications already have checkpointing built into the application itself Why make something like this? So this is a great point so people distinguish between application level checkpointing and System level or transparent checkpointing At the application level the programmer has to work harder to make it happen If you're running on some kind of supercomputer Then the machine is expensive and you just spend the human resources to do it And so that's not really the area in which we play but for ordinary researchers which they have a Smaller research group. Maybe they're scientists and they don't want to spend their time writing unusual code in order to Save only their data structures and nothing else if they get it wrong. They just have to start over Using DMT CP it's a no-brainer. You just start under DMT CP and you ignore it And whenever you want you can set a checkpoint You can do it on odd a timer at regular intervals. You can have your program asked for a checkpoint when it wants it And and then it just works behind the scenes Which is the way the best software should work So compared to the application level checkpoint though, what's some of the So that was the benefits what some of the downsides to using this transparent checkpointing So the downsides Initially as we've taken a philosophy that we don't modify the kernel the binary or anything Therefore when it comes time to checkpoint we have to work harder to interrogate For example the kernel about any missing state We do support distributed computations. We have to work harder to figure out what data is in the network So especially in the early years of the NTCP our coverage was not as good as we would like We're we've been working to improve that coverage now Adding a number of heuristics and more recently plugins so that The end user can easily add coverage to checkpoint aspects that we don't directly handle So that sounds kind of like magic, right? You said you don't change the application. You don't change the kernel How does this work then? Yeah, it's a good point So the simplest example is suppose you have an open file at the time of checkpoint What we'll do is So I should say this works primarily under Linux right now, although we there seems to be a recent port to Android that we're also excited about And Because we can checkpoint a virtual machine. We can checkpoint windows inside the virtual machine in any case To take an example suppose you have an open file at the time of checkpoint then In Linux you go to the proc file system find what your open files are Using lseq you can determine. What is the current offset you save that? And then when you restart we restore all of memory so the program doesn't even know that it was checkpointed It just assumes it's continuing it looks up the offset sets the offset back to what it should be opens this file descriptor to the original descriptor number and so on now you said earlier to that it can fire via a timer is that via a signal or or What what mechanism is that? the simplest way to do it is To let the MTCP handle use its own internal timer So the D stands for distributed When you have distributed processes you want to have a central coordinator that talks to everybody So the end user talks to the coordinator and all of his applications registers themselves with the coordinator a Late process on a new computer can just join the computation by calling up the coordinator And then the coordinator has a timer when the timer goes off It will send a message to each of the individual processes saying it's time to checkpoint Meanwhile, we've added a At the same time we have added a hijack library in each user process and that hijack library has created a checkpoint thread Which is our code and is listening to the coordinator to find out when it's time to check one way I See so you're running off in an extra thread over there And then can just wake up whenever you choose to whether it's by an event or a timer or whatever you want. How do you? Stop the main thread or actually all the other threads and restart that that seems like you have to get into a little bit of plumbing there We do a little But it Seems to work. Well, so the sequence of events is this At the time Typically the user will just start up their own application that application might transparently create a coordinator If there is not already one at the default location After that the application runs our hijack library first using currently the LD preload feature of Linux But there are other ways to do that This creates our own checkpoint thread The checkpoint thread then sets up a signal handler By default it's sig user to when and Then it's time to checkpoint our checkpoint thread Sends that signal sig user to to each of user threads The user thread goes into the interrupt handler for it that we declared and now the user thread is executing our code since we control him now we Force him into a lock and he has to wait there while we save all of user space memory to disk So is this thread literally not have to do anything until point a checkpoint? So there's no overhead for running when not checkpointing Certainly our checkpoint thread does nothing while the user threads are operating And we take this as a point of principle because we don't want to have any race conditions So either we're running or the user is running, but never both at the same time We do have some minor overhead in another respect There are certain subtle cases where we want to know what the user thread is doing So we put wrapper functions around certain system calls Usually the ones that are called only infrequently And so this can tell us when perhaps a socket connection was opened Then we record so But now in this case the user thread will ask for a socket connection through the normal system call That will be intercepted. He'll end up executing our wrapper function first Our wrappers from function still being executed by the user thread We'll save in a special place information about this new socket And then pass on the parameters to the normal system call Which then again returns But all under control of the user thread So there's been other transparent checkpointing Projects out there most of them relied on kernel modifications Why did you decide not to do that and Because you don't modify What does that cause you to have to do? Yeah, so this is certainly a favorite subject of ours. What are the pros and cons of Kernel based checkpoint restart versus Completely transparent no modifications If you just want to get a quick checkpoint restart up and running rapidly Then probably the easiest thing is just to go in there and modify the kernel The problem is that the kernel keeps changing over time and so therefore There's now responsibility to maintain code within the kernel Our view is that we try to stay close to the POSIX API The POSIX API is extremely stable and as long as we stay close to that the code doesn't have to change So it's been many many many Linux versions now When we have not had to change anything about the MTCP. There's a new kernel and everything continues to work So this is probably the biggest difference in philosophy. We have to work harder to discover what's inside the kernel We're not there since We do that through system calls and the product file system But on the other hand because we are staying close to POSIX APIs and the product file system API Our API doesn't change and therefore the code continues to work across most versions of Linux 2.6 Okay, so then the one question would be is you can keep track of threads that get spawned What about applications that kind of do like the fork exec you know where it completely Loses track of its child Yep, so one of our rules is that we try to be contagious So we have our hijack library on the initial process If it calls fork, there's no problem Our hijack library now exists in the child process also and we have a wrap around fork to make some minor adjustments If exec is called again, we make sure that the LD and preload environment variable sets So we get our hijack library and To take it one step further. We also have to be contagious For example, if we're checkpointing an MPI application since we're transparent. We don't know that is an MPI application I can But what will happen is an MPI application will usually start on one node and then generate processes on other nodes using SSH So we since we're spying on system calls anyway, we look for any call to SSH If we find that they're passing an SSH command Then we modify that command line to make sure that LD preload will also be present on that command line So we're contagious through fork through SSH Obviously when extra new threads are created and so on Now what happens if the MPI doesn't use? System or exec and SSH and things like that, but some other resource manager API to launch on remote servers So if they're using a different resource manager Again, there has to be some system call or call to an MPI library That is used by the resource manager to create new MPI processes So on the principle of being contagious we would put wrappers around that and make sure to include our hijacked library via LD preload or a different mechanism. So I would like to add here We recently had this for torque torque actually uses this remote spawn API to create remote processes So if an MPI application is running on torque, it actually is not using SSH so we actually do put wrappers around the system call which actually gets the processes and Be contagious Okay, let me let me specialize you a little further and we work together a little bit before To get DMTCP in open MPI. What do you guys do there because? We don't really in open MPI. We don't really do that at all right. We launch actually a Remote daemon and the daemon is the guy who launches the MPI processes and things like that and we might use a variety of API's Certainly not just SSH. So how do you guys work under open MPI? So That's a good example because there we have two ways of doing the checkpoint The first way is what we described earlier essentially DMTCP is sitting on top and checkpoints every process an open MPI including daemons and anything else But the other way is Open MPI has a very nice checkpoint restart service with a well-defined API and so in that case What we do is we use only our In that in that case open MPI assumes that it will be responsible for stopping the network traffic And then it will notify the checkpoint package for each individual process saying it's time to checkpoint this one process So in that case we use only the lowest layer Which is MTCP? There's no deep because it's not distributed anymore. It's for a single process It's simple. There's no coordinator Our library again is a hijacked library in the user's process and that single process is then checkpointed Okay, so for a little bit of clarification here for me who has not done this This is something that me as a user can install using the system MPI library that's been installed by the system right? Correct and at open MPI they've qualified our our MTCP Attachment is basically working Okay, so this all happens on the node that initially spawns this and this gets trapped and attached Before it goes to the other machine to start the other process Yep, yes, basically what happens is that? these good northeast and folks have Written a plug-in for open MPI that uses our underlying checkpoint right and so when you MPI run a process As Jean was saying they're not really involved in the distribution of the process But we might wake them up later so to speak if open MPI requests a checkpoint It'll kind of halt all the network traffic And then jump into the MTCP library to actually checkpoint each individual MPI process that is correct, right Jean Yeah, that's nice explanation. Thank you And in a similar way going beyond MPI there's condor for high-throughput computing And and there also we have a similar mechanism for working with them in their case, there's a Certain glue script and the glue script instead of running the application directly with condor It will also bring in the the MTCP libraries and run the two together And so for condor also one can get automatic checkpoint restart Oh, and I should say this is on their vanilla universe They have an older standard universe from the 1990s Where they have a more limited form of checkpoint which works very well for single threaded processes But they I guess don't have the resource to work to develop more general checkpoint packages I could say you mentioned open MPI you mentioned condor with both the resource manager and environment What other MPI libraries do you guys known to work with or other novel? Parallel distributed systems So couple you've done more of this work. Yeah, so we have verified it with MPH 2 also in terms of parallel things it's open MP is also supported and So is silk So, yeah, we do support all these Oh, open MP. There were some interesting horror stories and getting that working that you want to tell a little about it Yeah, with open MP things were quite different from the rest of the applications One of the things that bit us was the rapid creation and destruction of threads So as it turns out open MPI open MP actually creates a lot of threads and destroys them pretty rapidly so we ran into issues of PID conflicts or the thread ID conflicts. So basically whenever a thread is created the kernel assigns a thread ID and Later on when a new thread is created it assigns a new thread ID but what if you have created so many threads that thread IDs are now wrapped around and basically a later thread gets the same thread ID as a older thread and Those were some of the issues which we ran into and that actually was where we had to Basically design a new layer of virtualization where we virtualized all the PIDs and TIDs. So instead of Telling the application about the kernel IDs. We actually generate virtual TID for each process thread So this is a good example going back to the discussion we had earlier about kernel-based checkpointing versus user space transparent checkpointing in kernel-based track checkpointing you have the luxury of saying Restart time. I'll look up what the old process ID was and I will start the new process with exactly the same process ID and I I just hope that there's no current process that has already reserved that process ID So so so this is one of the basic challenges Prior to checkpoint the process save this process ID after restart it thinks it has the same process ID in The kernel method they say yes, you do have the same process ID. We will guarantee it we just and On the other hand in our case we put wrappers around every system call that might ever refer to a process ID and When the user asks for what his process ID is we don't give him the true process ID We give him a virtual process ID and then we proceed to translate between virtual and real process IDs always Then at restart time the real process ID May change the kernel may assign a new real process ID But we just adjust our table and the user is still working with virtual process IDs So we think that this is actually more flexible. There's less danger of a Thread ID conflict or process ID conflict in case where a huge number of threads are created Let me jump back to the open MP bit there I have a clarifying question to ask about that so before when you're talking about checkpointing single threaded processes You had mechanisms to make that main thread stop and then restart it and so on So how do you do that when there could be lots of threads running? Do you? Guarantee to stop them all or do you just kind of catch them wherever you catch them or how does that work? So basically we use the same signaling mechanism to stop all the user threads and Once everyone has entered the signal handler We make them wait for the lock and that's the point where the checkpoint side starts and saves the checkpoint image to disk Currently we put a wrapper function around the thread create So that we know all the threads that the user has ever created in an alternative design We could use the proc file system to discover all the threads the user has created Okay, so the threads wait for a signal to actually and like synchronize and Checkpoint what about for uninterruptible things like IO so With IO what we do is When you have suspended the threat there might be pending IO like data in buffers or data in the network and we actually drain that data before we Write the checkpoint image So we query the kernel or the network to get every single bit of data Into the process memory and then we just save it to the checkpoint image And similarly when we write we flesh all data to disk At checkpoint time Okay, and then like you're keeping track of all these threads But you mentioned with open mp. It spawns and it creates and destroys threads very very quickly Do you start getting into some noticeable overhead in that kind of situation? So yeah, this is this was a very interesting case when we were actually Developing this work. We noticed that there was considerable overhead because of some of the things that we were doing Mostly related to the system called the thread join where the main thread actually waits for the threads to die And there were some issues with that in that particular implementation that caused the overhead as as high as 20% but in the current code base we have brought it down to pretty minimal So it's like less than a percent or so I should add that we have been really lucky to have a great user base That gives us a lot of this feedback. So the issues with performance for open mp. This was brought to us By some of our users in fact for the trunk SVN even before a release and so we're always grateful to communicate with our users So now you mentioned Stuff for disc activity, but what do you do about the network because the network is a little more amorphous? You can't necessarily guarantee what's happening on the other side. What kind of guarantees do you give about socket reads and writes? That's a great point and in fact In 2004 DMC speed started as an undergraduate project and the undergraduate student actually came up with this Initial idea other people have done this before of course, but nevertheless For him it was original The idea is that if there's data in the network then at the time of checkpoint After we have forced all the user threads to stop We then look at each of our sockets and at the send of the socket we send a special cookie some unusual data pattern and then There's a barrier and then we look at the receive end of every socket and We proceed to read from the receive end of every socket Until we see that cookie and when we see it. We know now there's no more data in the network So this is great when it's a closed computation But in addition we have also had to worry about the rest of the world sometimes there's still Vim we can checkpoint them, but then we'll keep a Socket open to the X server There are examples where the NSC demon will be a shared memory region So for each of these cases, we've also had to develop certain heuristics But here we believe we have an advantage in doing this totally in user space at a Colonel-based checkpoint restart package Could not handle this and would still need a user space component in order to supplement that for these reasons All right, we'll continuing on in a network vein. What about the you know, the OS bypass API like verbs Where a lot of what's happening in the network is directly in the hardware and you have no insight into You can't probe the OS the proc file system or anything like that to find out what's going on Do you wrap all of those calls to or hand you how do you handle outstanding? completions so So there also we have to wrap this was work which a lot of it was done by An undergraduate student here Gregory Kerr So I've Gotten all of the details, but I remember there was a certain context associated with this and we had to create a shadow context beyond the context and There we would put wrappers around various Functions for the verbs that had to deal with this We would give it the shadow context and then we would copy to the real context that the hardware actually talked to But the other part of this ultimately is at the time of checkpoint We just have to pause a little to guarantee that there's no more network traffic So something as a sys admin I'd be curious about is Do you guys integrate with any of the existing resource managers and schedulers for say Checkpointing a process that the schedulers decided needs to be preempted to make space for a higher priority user Some of our users have implemented that So guess there was an example grid engine Couple you were talking with them. So there was one particular example They actually wrote some scripts for a grid engine to use dmdcp with grid engine to do all the preemption and so on And then more recently and Amazon So yeah, there is this one guy who actually He contacted us because he wanted to checkpoint or use dmdcp to checkpoint These scientific computations which are running on Amazon EC2 cloud So there also he is writing the dmdcp plugins to integrate dmdcp and Integrate dmdcp to use the Amazon EC2 apis to do the preemption and checkpoint and restart Okay, and There was something on your website. I noticed um you are DB. What is the relationship with that in you guys? So you are DB stands for universal reversible debugger So so it's a nice thing Although I must warn you that this actually is deprecated and we have a new cousin of you are DB Which we are planning to release like early next year and that is called Fred which stands for fast reversible debugger so instead of Talking about you are DB. Why don't we talk about Fred itself? Yes, you are DB was to handle single threaded processes Fred handles multi threaded processes So the first thing that was needed is we needed to be able to checkpoint P trace and So take had a series of these challenges P trace was the next one. Eventually. We did it Still later. We converted that into a plug-in to make everything more modular So it's not part of the core of the mtcp. It comes with a distribution But one can load it in or unload it This gives us the ability to checkpoint gdb If you want a reversible debugger Then the way you can do it is put the mtcp on top of a gdb session Allow for the user to give the debugging commands. Suppose. He has given a hundred commands Since the last checkpoint you want to go back one command Just restart and go forward 99 commands To make this work. There's several software components that one needs you need to checkpoint gdb We can do that because we can checkpoint P trace You need a Deterministic record replay component Because if you have multiple threads when you replay you want to make sure that on replay you arrive at the same place where you were originally and I guess you want a Python in our case a Python script at the top to control all of this to control the checkpoint restart deterministic record replay and Naturally you need the debugger Mostly we're using gdb as the debugger, but the method works for many debuggers When you say next year you you actually mean this year right because we're in 2013 wink wink, right? Yes, that's right So that's actually pretty fascinating so Let me ask a clarification question about the deterministic replay bit. Are there do you support? Everything in gdb like if I run for a while in gdb and then hit control C to get a command prompt back You know I interrupt my program. Do you support that kind of mode as well? Definitely not in the initial version Right now our goal is to get the basics working and then get feedback from the users about what they want in practice So already the release of Fred has been delayed because we were worried about making sure that it was Close enough to production quality and then we can release it as beta software Get feedback and eventually maybe get into the subtle issues that you're talking about Okay, so if I'm writing an application and I'm just planning on using dmtcp to checkpoint it Do I need to be aware of anything to make sure that my application that currently works? I would say no. What we try to tell people is that If the application doesn't work then we consider it a bug tell us and we want to fix it Because we're working totally in user space We have a number of heuristics to handle special cases We mentioned nscd network services caching demon Next servers and so on But ultimately we want to continue to add heuristics and plugins to handle these extra cases Let me ask this is a little off topic here, but it's something I like to ask everybody is what kind of source-controlled you use and Why I think you actually said subversion already, but I always love to hear people's reasons for why they pick what they pick so in 2005 or so subversion seemed the right way to go is easy to use and it worked But I guess now there's also more of a preference for git So we try to support both the various schemes a couple can tell you more so With the centralized thing or the source which project page we have subversion there But we also have a GitHub project page, which is in some sense secondary But that that is where we have the git repository and I personally have been using Get over SVN or get SVN for past couple of years or so for the development We just don't want to move from move the subversion repository To get at source for you just want to make all people happy Yeah Keep the historical context. I want to go back to the different versions Easily, but who knows maybe in the future we could switch to get as our primary version control system All right I want to flash back to much earlier in the conversation you said something that I knew I wanted to return to you said that You can now checkpoint Android Can you give us a little details on that? Why is that useful? What's cool about it? Who did it and all these kinds of things? So, yeah, this is something we're really excited about and it happened only in the last month. We were quite surprised internally I was working with an undergraduate student to port the lowest layer mtcp to Android and then in the middle of that we noticed I guess and basically in a new get Source github source Which was set talking about porting to? porting the mtcp to Android We got in touch with the developers Keto Cheng and Jim Huang working for zero X lab in Taiwan and They've done a magnificent job. They have slides about it and Their Android is developing very rapidly. So there's still some things to catch up on but It seems to work and it allows us to play in a whole new area Separately, what's the use case? What's the use case for that? So you want to checkpoint your phone or okay? Oh, and I should add in February. We had Ported the mtcp to arm also it works on Intel and So yeah, why would you want to checkpoint your phone? so in Tyre in Taiwan there are many companies that develop for Smartphones and you know as they say if there's a bug in the field, what are they going to do about it? especially suppose it's some device say Your refrigerator or TV you're going to bring that in So what they would like to do is? and Something crashes to just checkpoint it and then they can develop this software so that the customer will send the Checkpoint image to the developers for analysis And then they're also very interested in our work on Fred because then they can also get some of the history of what the customer was doing And also send that to the developers there's Another use case they want faster boot times and so they actually checkpoint They are using some sort of virtual machine which they checkpoint and then when they're booting up the machine Booting up the device instead of actually going through the entire boot of procedure They can just restart from the checkpoint image and continue saving the boot up time But that's an excellent point That that will lead into another thing. We'd like to talk about checkpointing a virtual machine very fast well, let's let's talk about that so Virtualization is pretty hot particularly in the company that I work for Cisco We play a lot in the virtualization space and there's a tremendous amount of innovation happening in that area Across the entire industry. So how are you guys contributing to that? So we just wrote a technical report on archive.org this week So we As I said if if we can't checkpoint it then our view is that's a bug and we want to fix it One of the things we'd noticed is we could not checkpoint virtual machines so why not and Since we have this plug-in architecture, which allows us to make things modular We could just have a different plug-in for each new virtual machine so with some students Rowan Garg and Kamal Soda We've now done this. It works for KVM slash K Mu Works for user space K Mu in fact in that case apparently our checkpoint restart always work for user space K We just never got around to testing it And we've made it work for L. Guest, which is a very small virtual machine great for learning how virtual machines work and the other thing that we're really fascinated by is Since we wanted to take a whole snapshot. We want to save not just the virtual machine, but also the guest file system and How can we do that fast? Well now there's this butter file system BTR FS several distributions are talking about making that their default in about a year and This allows you to take a snapshot because this is a copy-on-write file system You take a snapshot and the running process continues to write to the guest file system So now we can checkpoint in 0.2 seconds currently and we haven't even tried to make it fast This is just what works out of the box when we work when we get interested make it faster I will probably get down to maybe 10 milliseconds. It's hard to say That is pretty impressive So that's some pretty strange uses of stuff. They're checkpointing refrigerators and virtual machines and cell phones and their stuff So what's coming in the future of DMTCP? Kappa would you like to talk about some ideas? so one of the things that I've been working is The plug-in architecture for DMTCP So we want the users to be able to write their own heuristics So of course we can take care of doing a very Very robust checkpoint restart of the entire process But sometimes the users may want something special like if they have a huge memory footprint And they don't care about half the memory then why checkpoint it so we basically want them to be able to have plugins So that they can write the plugins and these plugins will actually take care of specific tasks Which currently DMTCP as a whole is handling So we have also we have moved our own Stuff into plugins and so that we can reduce the size of the DMTCP core But we definitely want the users to be able to write the plugins And then I should say that It seems like each year we have more and more Developers and semi developers around the world who are contributing to DMTCP And that's been exciting also so We're really I believe getting to some kind of critical mass now where it's hard for me to predict What we will have a year from now because people are contributing New ideas from anywhere, but that's that's part of the motivation why we think these third-party plugins are so important And we really want to support them Okay, so where can people find more information about DMTCP? Probably the easiest way is Google the MTCP the good news about having an unusual name as there are no conflicts But the main site is DMTCP dot SF dot net source porch Couple do you have other places they should look at? No, if you just start from the source which web page for DMTCP you'll find all the links out there in including examples of supported applications and Something we call the closed world assumption as long as your application is not talking to the rest of the world We believe we can check point of mind what is talking also to the rest of the world That's when you need Plugins for example if you're talking to a database You could write a plug-in to disconnect from the database at the time of checkpoint and then restart immediately after when you resume cool Okay, well Jean cool. Thank you very much for your time Thanks. We really enjoyed this. Thanks guys