 Welcome to another edition of RCE. I'm your host Brock Palin system administrator at a public University And I have with me my good co-host Jeff Squires from Cisco Systems and the open MPI project Jeff Thanks again. Hey Jeff. I should point out Jeff I'm not going to have we're gonna skip a show here I'm going to be traveling to Terra Gritto 9 for a week and I'm not gonna be able to do a recording during that time So there's gonna be a two-week pause here before we have the next show out But we will be back with another show later Good to know okay, and our show today is we have Paul Hargrove from the Berkeley lab checkpoint restart project It's a checkpointing software for the Linux operating system and maybe it runs on some other stuff, but we'll find out from him and He's at work Lawrence Berkeley National Laboratory and make sure I have that right Lawrence Berkeley So Paul welcome to show Thank you So Paul can you give us a real quick? I know BLCR works on Linux, but can you give us a quick rundown of what it is and if it runs on any platforms besides Linux? Sure. Well the idea of BLCR Berkeley lab checkpoint restart is to provide The ability to save the state of HPC applications and restart them again later And it is really targeted at Linux systems, but I've heard people talk about having some sort of Port to some Sigwin thing that was was developed. I don't have the details on that It's not something that we're doing directly as far as I'm concerned Linux is our target Okay, so real quick, so I'm familiar with checkpointing, but normally checkpointing is built into the application itself The application every so many time steps or so much wall clock. It will save a state to a file Is Berkeley a library that helps with that or is it like OS level checkpointing? BLCR is an OS level checkpointer and When we started this project there were at least two major reasons why we were interested not in replacing but in augmenting what applications can do on their own and Well application level checkpoint is pretty good for handling fault tolerance or The issues of how long a job can run in a given queue There are two things that the NERSC Center the National Energy Research Supercomputing Center at Berkeley lab We're using the OS level checkpointing on their cray system at the time And we've tried to emulate those ideas and so the two things that were of interest there were really more system administration oriented things and that was Job migration type capabilities being able to repack the Taurus on a cray T3e was very important to Getting jobs through on that system and the other thing was so much of that being able to Manage the scheduling of the system in such a way that very long-running jobs or very wide full Configuration jobs were limited to a certain part of the work day or actually not the work day to the midnight to 4 a.m. Time slot so BLCR's original sort of selling points when proposing this work the Department of Energy was more in the scheduling area and the sort of job throughput system utilization benefits Then for the fault tolerance now, of course the fault tolerance is something that checkpoint restart is is good for but It's something that is harder to to to do efficiently we have the You know the ability to checkpoint at the OS level But we don't have any application level knowledge So I should point out real fast that when we say checkpointing and you mentioned in their restart It's really the ability to restart an application from some point in time before it's done It could be saving to stay and pick up from any point. So with the OS level checkpointing you're actually able to kind of save an application made run and restart it From that point in time without having to start from the beginning Somewhere else too, right? I mean that's that's kind of what Paul just mentioned there that it can be useful In craze example, they were repacking the torus to get better network utilization Assumably but say a node is about to go down or you just need to give up some nodes all together And so kind of pause your application and restart it later But that that later might be on different nodes in your Linux cluster, for example, are these all Correct examples Paul. Yes. Yes, actually while we were originally targeting cluster systems There are a number of grid related projects that have started using BLCR for migrations. So the sort of network of workstations the Volunteer computing grids sort of people have looked at using BLCR for migrating jobs away from machines when they go back to their Normal use so the sort of you know, this machine is someone's desktop 9 to 5 But overnight runs runs someone's engineering or scientific computing job They've looked at using BLCR for for managing those types in areas where they can migrate it off to another resource that might be available All right now Paul you mentioned that this is you have no application level knowledge. What do you mean by that? so this is Assumably down in the kernel somewhere and and you know, are you in in a sense? Just making a big core dump file that can be restarted later or you know, how does that work? Well, yes One could think of what's in there is a core file in fact That's actually what we experimented originally with before we had the ability to restart things We were just dumping core files and examining with gdb to make sure we were getting the right process As I said the application level knowledge, I guess I sort of jumped into the middle of something instead of the beginning or the end Brock mentioned that applications typically do checkpoint and this is often Using sort of a minimal representation of what the application needs What we're doing is we're dumping the entire memory image of a process now. There are some optimizations. We're able to make For the executable and shared libraries for instance We're storing just the path name to that file and we're not actually copying the contents but everything that's in the heap in the stack and Similarly anything that relates to an map of an unlinked file, which is something often done for temporary files Those are all saved in the image and we don't have the knowledge that this gigabyte of memory was something that was maloct and freed and Doesn't mean anything to the application anymore But the OS still had it in in the memory map of the of the application So in that sense it is like the core dump and it's capturing all of the all of the memory. It's also capturing the registers the Signal handler registrations and number of other things that are stored in kernel data structures that Don't exist Specifically in some particular region of the application memory something that the application may be able to query Through the through the OS, but since we're doing things at the OS side of the interface We're able to get all those things Little more efficiently and a little more thorough in some ways that You can't always access things and this is one of the things that distinguishes BLCR from some of the user-level checkpointing libraries that We can get information for instance about the files that are open or the files that are unmapped Without needing to dive into the proc file system and without needing to do tricks like Proposing or wrapping parts of libc to track the open and close calls Tricks like that are not necessary because we're working in the kernel side of things. So there's the the trade-off between being able to Capture the information efficiently and accurately Versus having to capture all of it without discriminating what was really essentially necessary to the application or not Cool, so since you're down in the kernel, are you part of you know Linus's kernel? Are you part of you know Red Hat or any other distro or how do you distribute the BLCR software? So there are a couple parts to that that question there Are we in any distros? We're working on that. Are we part of the Linus 12 volts kernel the answer there is definitely a no the Decision going from the very beginning was that we were going to write BLCR as a kernel module and not require any patches to the kernel That means there haven't been any patches to submit upstream for inclusion for one thing But that choice was made because looking at our target audience of these HPC centers, especially ones that are funded by DOE Often you're buying from an integrator or a vendor that just won't support your system if you're patching the kernel They provide and so by having a loadable kernel module a given center can just not load the module start up or unload the module after that and Make their service call for whatever may be wrong and the vendor You know will still support the system whereas a patched custom compiled kernel It's a matter of they would have to reboot the system back to the vendors kernel to get support so we've stuck with a patch-free approach and The result is that there hasn't been any sort of active There's nothing to feed back to the to the Linus Torvalds kernel as far as the distro thing goes though Well, we're not officially part of any particular Distribution we have a few users out there that are pushing for inclusion in some of these Package download sites so the rpm fusion site that is used with the fedora systems Has for fedora 10 BLCR package that one of the users packages up So one can just use the standard tools there to easily install the LCR So because it's a kernel module the administrator at a site needs actually install it's not a library So if a user has an application they want to use BLCR They have to convince their administrator at their site to actually install The kernel side of it to actually have it be available. What are some of the other limitations out there? You mentioned that you Don't actually package up all the dynamic libraries all the SO files and you just include pass to them What if I try to resume that checkpoint on a machine where the libraries may be in a different location? Will it fail or will it work? Um, well, yes or no As I said, we do have some interests some users in the grid community That we're not part of our original target audience certainly not part of what we proposed to the Department of Energy when we started the project But at their urging and to satisfy some of their needs We do now have the option of actually saving the executable and shared libraries. It greatly increases the size of the Image that's saved, but it does allow us to optionally get around the Issue you just raised and that is that if the shared libraries are not the same or not available on the target system in migration or not available on a system Even if it's the same system after it's been shut down for maintenance So let's say you've taken a checkpoint of the system brought the system down Upgraded from you know one distro to its its next Version in G libc was updated as a result We wouldn't be able to restart normally in that circumstance, but if when you took the checkpoint You being a good system administrator and haven't realized that you were upgrading G libc Gave blcr the option to for that particular set of checkpoints anyway Do the extra io and save the shared libraries? That's That's a limitation that we Still have by default, but there's a way to get around that one other limitations Since you go there The lcr really is targeted at hpc and as a result We realized that MPi is an absolute must you have to be able to checkpoint more than just a serial process Yet we don't have the resources ourselves in our group to implement the necessary integration for all the potential high speed networks all the potential MPIs so We sort of punted on that you could say Instead of having checkpointing of any sort of network io no sockets no Infiniband blcr just has a callback mechanism and the MPI implementers are Stuck working with that We can deal with that laundry list later the MPI implementations have done pretty well on that But because blcr isn't natively handling network communication If someone has an application that's using Something other than MPI for communication blcr won't be able to handle that we've recently had someone Who for I guess sounded like it was a class project type thing Wanted to play around with checkpointing of a patch and we just had to tell them well We can't deal with tcp sockets. You're not likely to get what you want out of that We've had people ask questions about They wanted to checkpoint an application that was connected to My sql server and we told them similarly there that You know, we won't be able to handle the socket connection to that and most likely we won't be able to do anything useful with the The the data in the in the sql So we had to tell them that they weren't likely to be able to do Much with that so there are a lot of sort of non-hbc circumstances people out there have client server Applications we can't do much for So you're really concentrating mostly on the you know the compute and the memory and and those kind of resources and and not so much the Network resources. Is that a fair characterization? Yes, we're doing the part that we're able to do efficiently. We're able to do effectively. We're sort of specializing on on you know those types of things the stuff that the that the kernel manages in a uh Almost passed through away the networking Doing checkpointing of a tcp stack is a research area in itself Trying to track the drivers for You know quadrics marionette Infini band all those high speed network things Not something that we were Able to to think about doing in a fully generic way If it's done in an mpi runtime it needs to integrate at the level of That implementation and not be fully generic. And so It seems much more effective in terms of Hours of human work that go into it To provide a callback mechanism. And so blcr has this way to register In the user space side of things so in in libmpi A callback and that's similar to a signal handler It's invoked when the checkpoint is requested and allows mpi to save its state enough information to reestablish connections, for instance Make sure it's not losing any messages or has the ability to replay them if it's willing to lose them And uh, you know, whatever other bookkeeping may be necessary for that particular implementation Well, that certainly seems fair because there's certain I mean some of those networks that you name too are also os bypass so the os And and therefore you as a kernel module don't even have access to the state the only State that is visible is gonna be known to the mpi. And so Uh, you know, I since I obviously represent open mpi here. We've done a lot of that integration work Actually 99% of that work has been done by josh hersey from indiana university. He's done a really great job Um, but yeah, so I it seems like a perfectly logical and reasonable Division of labor there and division of state knowledge that uh, you know a higher level runtime That's that's extending for example, even the you know, the single process Concept to be a multi-process job You know that guy's got to be the one that coordinates the network checkpointing side of things I you know, that makes perfect sense to me. Um, but What what other things Actually, I'm sorry. I want I want to take a step back here and and let's take the parallel bit out of it Here and just explain a little bit more. So what you've provided in blcr is actually a just completely Transparent let let's say it was just a plain vanilla serial job It can be as simple as a plain vanilla Your job was running and then all of a sudden it's you know paused into a checkpoint file or something like that And then um, you know 20 hours later for whatever, you know, the system administrator took it down and then 20 hours later The system administrator put it back up and that job Had no clue that it even happened. It was you know, there's no extra code in that job At all right and is that a correct example? Is that the the level of transparency that you were going for there? You've got it 99 correct. So there is no Modification made by the user to their source code. So it's transparent at the application level Just as when you move from one MPI to another you may be required to recompile or relink There's a little, you know a little gotcha there. So We do require a small amount of code to be linked into the application And when we're working with a integration with an MPI that's actually sort of hidden down in the MPICC wrapper Links in our library and it needs to because it's calling into that library itself to manage the the callbacks But we also have the ability to use the ld preload environment variable. So if you did have that truly unmodified example being All those programs in bin and user bin We can use ld preload to insert our small amount of code At runtime into those applications. So that brings up another weakness. We do not Have that capability if you have a statically linked application so static is not a Very common case But it is one that we we do have to work around telling people not to Documenting that linking statically is not going to been a workout So so far we've actually mentioned like system administrators I'm starting a checkpoint If I'm just a home user and I've got this on my box where I'm developing stuff and I want to use it How do I actually tell blcr? I want to take a checkpoint now Well, um, it's actually just just basically we have three commands in in the release of blcr around which Various schedulers batch schedulers can integrate things But if we're not doing the batch system, then there are basically three commands cr underscore run Is provided to handle that ld preload thing. So let's say I wanted to start Um a bash shell and I was going to checkpoint that and so I was going to have a Sort of interactive session. I was going to checkpoint that and restart it later Let's say that you know hibernate and suspend doesn't work on my laptop I could actually do cr underscore run space bash space dash login and have a or well, yeah, I guess that's a reasonable example I could just be running stuff in that normally And then I could take a checkpoint of that with a command cr underscore checkpoint And then there are a bunch of arguments that are really not worth trying to Describe out loud And that would create a file that would be Usable at a later time to restart from the bash session would happen to continue and you could Do a bunch of work there and let's say your battery suddenly fails When you get back to ac power later, you could cr underscore restart And give as an argument the file that was created when we did cr underscore checkpoint And it really is as simple as that And the example of bash is not a made up one. That's something we actually have done We have a test suite a lot of the Sort of corner case tests actually are Really messy c code, but for a lot of very simple things We actually just have bash scripts that do various things that we Checkpoint restart to test some of the easy functionality So if you checkpoint bash and bash is running something else blcr will also checkpoint all the children of that process It will automatically do all that. I guess that makes sense. It would kind of need to but It will actually do that so Earlier versions of blcr that was the biggest limiting factor I would say more so than the inability to handle communication or static libraries was that It was originally only able to handle a single process Could have been multi-threaded with p threads, but it had to be a single process, but for A few years now we've had the ability to handle just You know Not arbitrary sets yet, but handle a process and all of its immediate descendants So, you know children grandchildren, etc If a program tried to what we call demonize where A Process starts a process that exits after starting a third process Then we've lost the connection between that parent and or sorry. I guess that process and its grandchild We'd miss that, but we have the ability to handle POSIX session IDs and process group IDs and in that way we don't actually depend on on the the parentage so both process trees and groups Were able to handle and and gets the the children of of the bash and so Actually, I guess you've you've reached another one of my limitations the terminal handling that Fullscreen editor like vi or emacs Does we're not able to checkpoint and restart that correctly. So if you were Running emacs at the time you took the checkpoint and when you restarted The display modes or the What is it called? The the terminal settings would not necessarily be correct. And so you probably have to exit Save your file in emacs and restart it to get your screen to you know, not look all garbagey, but Again, not exactly a huge hpc limiting factor there Yeah, I was gonna say it sounds like blcr is not something that's going to change how you edit source files Um, but it can be definitely helpful when whatever you were editing is running for 48 hours And the hardware looks like it's going to die 24 hours into it Right, and I should say a lot of these limitations that blcr currently has are Not huge technical limitations. They're more a matter of we have finite time finite resources And we prioritize things that we saw as being hbc critical first and then stuff that Is hbc useful The second and stuff that's hbc Not relevant just hasn't been addressed and Someday, you know, maybe some user will contribute the stuff necessary to deal with things like the terminal settings or You know other things that we haven't haven't covered There is so online documentation describing what sort of the priority list Was at some point. I think it's still reasonably Correct. The only thing that's probably out of date is the list of which things we have gotten to is probably incomplete Is there a way to tell cr run to automatically Dump a checkpoint every 15 minutes of wall clock Is there any way to do that or is that pretty much the responsibility of writing like a third script called from cron? So at this point, we don't provide anything that manages periodic checkpoints automatically as I said those three executables They're really meant as sort of the base the building blocks for building up Other types of scripts if you're dealing with a batch system, for instance It would be more sensible or more appropriate to instruct the batch system to be responsible for the checkpointing since it usually has some Requirements for where the checkpoint files have to be for restart. It's the logical agent to do that but Yes, if you wanted to have your Let's stick with the bash session If you wanted to have your bash checkpointed every 15 minutes you could write your You know cron or at or whatever or you could just have a little background Uh shell script that uh slept for 15 minutes then invoke the checkpoints, you know while one or whatever Okay, so what resource managers do you know of actually integrate with blcr? Like if someone wanted to run up make see if the resource manager supported it um What was some of the more popular ones that actually support it? So one of the challenging things about working with open source software is that people can Do whatever they want with your software. They don't always have to tell you about it So my list is probably incomplete at this point The one that we're aware of and have actually worked productively with is the torque resource manager So torque being in the pbs family Someone else could probably take their work their patches and port it to open pbs and we have Talked with altair engineering about getting this into pbs pro and uh So I would speculate that uh pbs pro will have this functionality at some point in the future if they haven't already We have been Contacted by platform computing about lsf. They only really asked licensing questions and Haven't asked any technical questions. So I would assume that they have Looked at doing this, but I can't even guess whether they're doing it or not The uh Other Major one that I can think of right now is uh is grid engine and Uh, there is some work that's been done That is available if you just google for it instructions for how to set up blcr Uh with grid engine, but I don't believe it would necessarily be a complete integration. I haven't haven't tried that one myself So speaking of which and and and based on your Organization where you work I can kind of guess but let's make it make sure for for real here What what is the license of blcr? Uh blcr is completely open source License as or as is required for the money we receive from the Department of Energy Uh, but the different pieces actually are subject to different licenses partly just based on Where code is derived from so we contain a great deal of code in the kernel in the linux kernel And that is um as appropriate and pretty much by necessity on gpl license We do have as has been mentioned the blcr library that for instance the mpi Calls into to handle its uh callbacks As well as the small version of the library the stub we call it That one ld preloads those are all under the lgpl license to allow those to be used Without conflict with with any application regardless of what its licensing may be and then the um other user space pieces are Well, gee, I can't remember if we ended up putting some of those in the bsd or not I know we had this question with our own licensing people. So there you go. I can't answer that question accurately I believe that we put the uh, I believe we attempted to put the user space pieces. They're not all that big Uh, it was the smallest part of the whole thing. I believe we tried to put those under a bsd style license so that people could Rip those up and do anything they wanted with them. So I'll stick with that as my final answer Excellent. Well, I'm glad to know we stumped the interviewee. So that's that's that's always a goal here Okay, so paul while we're talking here about integration with bl with other pieces of software Let's talk about with mpi. So I already mentioned that blcr is integrated with open mpi. Which which other mpi's is blcr integrated with Well, the first mpi we were integrated with is lamb mpi But that no longer really counts as an active mpi project But in addition to open mpi, we are integrated with envopitch 2 from ohio state dkpenda's group That being a very popular mpi for infiniband networks We have had discussions With the group at argon for mpitch 2 and they now have some funding from the same project that funds blcr That will get that work eventually done We've also had discussions with a few other groups as I mentioned before platform computing contact this about the lsf bat schedule, but they've also discussed Integration with their mpi and we've had discussions with intel about integration with their mpi So didn't you guys have an announcement recently where you had gotten blcr hooked in on the kray xt platform? Oh, yes. I'm sorry. That was probably the most important for me. Anyway answer to that previous question kray did the work to integrate their mpitch 2 based kray portals mpi With blcr and they also funded in part the work I mentioned earlier on torque So using I guess it's the p2.2 or 2.3 system software release from kray One is able to use blcr to checkpoint and restart through their batch system and through their app run job launcher mpi applications on a kray xt Did they have a lot of trouble with their Does this only work on compute node linux or does it work on unicost? It's a compute node linux Compute node linux or I guess they prefer just calling it cnl at this point When administrators are setting up their resource managers to use blcr, what's the common way you see it used? Do they checkpoint every certain amount of time to prevent against hardware failure or do you see more often the Checkpoint for preemption suspension chain, you know, um hardware change Well, again, we've gotten back to the Uh weak point of working in an open source project is that people don't tell you what they're doing with your software. So Uh, certainly the torque System has inherited from the original pbs work command line options for specifying periodic checkpoints. And so that's something that's available Automatically as a command line option at job submission or as a part of a pbs script But i'm not aware that systems are set up autonomously to do periodic checkpointing of all jobs on a regular basis, so Don't really have a good answer for what is a typical setup because i'm not even completely sure of what What people are capable of scripting on their own But uh, again our focus in doing this wasn't originally The fault tolerance works. So we haven't ourselves provided How to's or scripts or or tools for getting all of that stuff done. So again, um, I don't know what what is a typical deployment out there Yeah, actually we've we've heard that many times on this show that people aren't exactly sure what people are using So on that same note, then of the uses you do know, what's the most unusual case? You already mentioned checkpointing your login shell. Um What's some of the strangest stuff you never thought you would see blcr actually doing? Let's see. How do I do this without actually naming the well, I'll start with one of the two Two unusual uses that I've heard of that. We're definitely not part of our original expectation One is that we've had users from both the australian and united states defense departments and they're actually using the lcr to take checkpoints of Battlefield simulation software. So these are sort of deterministic event driven simulations used in You know warfare planning And they're using blcr to checkpoint those and I guess it's both fault tolerance as well as something that's Sometimes called branch points the ability to take this application, which is just another simulation Not unlike hbc except I don't believe they're parallel To be able to checkpoint that and then restart In their case multiple instances. That's why they're calling it a branch point With different inputs. So the scenario in the warfare may say you cross the river at this point And you don't and they want to see what both those options look like. So they start two instances One running down the you cross the river choice and one that you didn't So that's a sort of a scenario we never really thought of when when starting to work on blcr The other one is go ahead. So they actually took the checkpoint and they've basically restarted it twice Yes That's funny. I never even thought about that Yeah, well as a debugging thing It just sort of actually makes sense It's also a computational steering application if you want to think about that an application itself wasn't designed to do Any sort of steering now you have the ability to stop it at some point potentially modify an input file or provide a A piece of next input if it was going to to prompt or wait for for input So to continue with my my answer the other thing that we were approached about and I do not know If it was a completed project There was a Cash register manufacturer. I won't use their name Uh that was interested in using blcr as a way to checkpoint and restart the sort of inventory application the inventory control application that was running in Linux apparently In one of their cash registers I guess they didn't have any sort of suspended disk or hibernate type functionality and they were looking at using blcr when it shut down to save the state of this software and We inquired why they weren't just modifying the software to save its state to some sort of database um, and their response was that the Application the accounting nature of the application was certified by some authority and they weren't permitted to modify that without needing to resubmit it for some certification and they decided that That to using blcr and checkpointing it apparently they believed didn't require recertification Wow, that's cool Yeah, um, I don't know if they actually followed through with that. Uh, my guess would be that they probably wouldn't be new certification for Something if they did that but that wasn't their thought In a in a slightly different Line of questioning here. We didn't we kind of missed this one when we went by it before What what kind of determines the size of the the checkpoint file that it is created and and what do people typically do? With that or at least you know among among your uses of blcr What do you see users doing with that because it's my understanding that this file can actually be pretty large And if you're doing it with a large mpi job, you could have dozens or hundreds of processes running All dumping out these large files. What are what are some scenarios and and how do people handle that? Well, I may not be able to answer the how do they handle it, but um, I can certainly answer the first part How large are these files and So as I mentioned before we basically save all the memory. So if you're going to do ps or top and you look at the rss column resident set size That's often a very good first-order approximation to What our checkpoint size is going to be for a given process And then you multiply that by the number of mpi ranks and you've got a fairly large io problem to deal with For any sizable mpi job This is something that I know Is a big issue on something like a cray Where you know the amount of total memory in the system may be Very large And the time to checkpoint that may actually be on the order of an hour time to actually do io for the entire memory of Cray system to disk maybe on that order How do people deal with this? Probably not very well This is a fundamental issue with doing system level checkpointing that the io Required is significantly larger in most cases than what an application level checkpoint would do We are trying to address this in a couple different ways. So I'm glad you asked the question I get to give some plug for our current work There are at least Three things that we know can be done to try to address that One of them perhaps the most obvious is compression and One could try you know piping blcr's output to G zip or vzip 2 or whatever We're actually taking an approach now of doing that same sort of thing But at a kernel level so you're not talking about an extra process with a bunch of extra context switches and copies back and forth Between kernel space and user space But the compression can only be so efficient. We do have to use loss less compression. So The compressibility of someone's application is going to depend a great deal on what sort of data they're Their memory contains so there's no reliable factor. I can quote for what compression can achieve uh The second one that people are probably Familiar where there may think of is what's called incremental checkpointing the idea there Is that if we take a checkpoint and do all that io And then an hour later or two hours later we go to take a checkpoint The application may not have actually written All of its memory pages since that previous checkpoint. So the idea of incremental checkpointing Is at that second or any subsequent checkpoints? Don't do the io to write out duplicates Of pages that have been unchanged since a previous checkpoint In addition to those two approaches we've started working on something The ability for an application to give hints An application from my point of view also includes the mpi library. It's something in in user space, I guess Uh the ability to give hints to say that this Region of memory does not need to be checkpointed if you take a checkpoint This is something that i've talked to josh hersey about in the open mpi case There are received buffers that are known to be unused not to contain any useful data At the time the checkpoint is taken and since blcr and open mpi are already sort of having a Dialogue through the library interface Open mpi can conveniently tell blcr about these sections of memory that don't need to be included in the checkpoint the potential exists for numerical libraries or even applications to use a similar interface when it becomes available to Sort of advertise worker raise or scratch raise that may Consume a lot of memory but not have any useful data And finally we may actually be able with this to approach the efficiency of an application level checkpoint If the application wants to write To this interface instead of their own application level checkpointing they could Compose their data down into sort of the fundamental pieces And exclude the pieces that can be reproduced recalculated for instance from that fundamental data, which is what is usually Sort of the the gist of how they do their application checkpoints You know separating out the parts that are reproducible from those that are That's very cool. Actually i'll be very interested to see when when this stuff comes out And i'll probably be a fly on the wall in some of your conversations with josh Let me ask you since we're talking about future stuff. What what else is up and coming from blcr What what do you see is the future? so I can talk a little bit about things that uh, I know are likely to come in the next Six 12 months the the things that I just discussed are all hoped to be present In our late october early november release. We always try to get one out in time for supercomputing So we expect to see the incremental if we're lucky The compression definitely the memory exclusion almost definitely Another feature that i'm working on right now is actually based on some work that was done by A student at north carolina state university in conjunction with some folks at oakridge national lab That is a it was done actually with lamb and a very old version of blcr. So lam's not completely dead apparently Um, they called it a job pause mechanism, but the and you can go google on that for uh for the papers That's one of the reasons I didn't give the names because I'd probably miss one out The fundamental change they made in blcr for that particular paper Was the ability to take an application process that is running and without it exiting Direct it to reload its memory its registers its signal handlers its file handles all that From a previous checkpoint. So That to me is rollback in place so That is something we are working on Reintegrating updating from its from the old version of blcr it was written to and getting that In with standard calling conventions and such so that we could actually take a process that is currently running and force it to you know time warp backwards and Start back from a checkpoint that was taken. So one of my simple examples I use for testing is a Program that just sits there It said it you know it prints one it does sleep one then it prints two and then it sleeps one and so it's just counting And so if I were checkpointing that I might see it go one two three I take a checkpoint it keeps going four five six Then I say roll back and it would just pick up again back at the three four five and so This was something that they were using in this original paper to deal with rolling back the non-failed processes in an mpi job so With what was originally done in lamb and what's currently also done in open mpi and most of the others the Approach at failure is that you have to have the entire mpi job Fail and exit and then you restart and you go back and create all these processes Again from scratch and so what they were looking at was improving the efficiency of the restart by having just the One failed process exit and restart and having the other n minus one Do this in place roll back and not have to exit This is also helpful when dealing with the batch scheduler that tends Now most of them want to kill your entire job if one process exits you can get around that one process problem and Continue to use the others without them You know being destroyed by the the job scheduler Then you can you continue to use your existing reservation rather than going back to the beginning of the Of the or back to the end rather of the cues We would really like that because about a third of our cycles is provided by some sort of preemption setup And for the parallel jobs If we just need one of the cpus for that 32 cpu job We call off the whole thing it'd be nice if we could kind of give them just a Another or cpu if that's available And restart that it's not necessarily a failed one, but we kind of need to move just that one And we don't have to do all that crazy disk IO and everything else We can just do from the last checkpoint for that one process. We move someplace else Right, and that's actually the sort of stuff that blcr is is meant to be good at and so With blcr in the particular scenario you've described you wouldn't necessarily have to go back to the last checkpoint If you need to do it now Then do it now take a checkpoint write this instant at the time you make the scheduling decision that you want to preempt that one application node you would take a checkpoint of it and Blcr checkpoints to any file descriptor doesn't have to be to disk even so With a little bit of middleware that I don't have One could conceivably have this preemption Take a checkpoint of that one process to a socket that crossed the network to the New node and restart the process there Without ever having to hit the disk Oh, it should be nice because the disk is slow and it'd be two trips across the network Right, and then with this in place rollback you would direct the other n minus one tasks in the npi to roll back to Sorry, they wouldn't have to roll back. You wouldn't need to do any rollback for those with with the ability to you know do Do this you know on demand the checkpointing because blcr is in the kernel It is preemptive by its nature. It doesn't have to wait for Uh, you know a periodic checkpoint to to roll around so for preemption type things Blcr may have a advantage there over what application checkpointing currently Is usually designed to do so I know some of them you can send sig user one and it'll take a checkpoint at the end of this current solve or whatever But those solves may take a significant amount of time So it's not really all that preemptive in most application level checkpoints Yeah, I would really like to see that that would be I would love to get my hands on that soon. That's available Yeah, well a lot of this is sort of the missing pieces out there the integration the middleware bits We need people out there who are interested enough in seeing these applications these these uses of blcr To help with some of that development and some of that testing because those are not things that We have the resources or the funding to to pursue all those You know applying blcr to this circumstance or that circumstance Okay, so how is the uh project actually run currently? So how is it run so, uh Blcr is One of several projects that are funded by the u.s. Department of Energy through a parent project called sifts, which is an acronym for coordinated infrastructure for fault tolerance and systems again, uh Google for it. You'll find it. It's a multi institution project that is Coordinated through our gun national lab includes us at Florence Berkeley National Lab it includes uh Indiana University Ohio State Oak Ridge National Lab Uh, and University of Tennessee Knoxville, I believe I caught everybody Uh, and a lot of the university stuff on those is for MPI work I did say previously that there is money going to argon for mpitch to blcr integration. That's part of what their participation is So, um, that's our source of funding and our our collaborators and it's a DOE interest in Uh, the fault tolerance side of things is as well as the stuff that We also are can do beyond that the the preemption the migration things like that Okay So paul, I have to ask this to everybody since i'm an open source developer guy myself. What do you guys use for uh version control? We're still in the ancient days of cvs at this point being only oh my Mainly because it works for our small group. Usually we only have uh Two or three people at lbl who work on the project and maybe one or two students for a A summer or sometimes a longer period of time And uh with so few people it's not been worth our time to investigate and spend the time to to make a change But we recognize it, you know, this is just beyond the point of clay tablets, you know We're in the we're in the stone age and we know it but um part of that is also that uh, we do not have a publicly accessible cvs To to keep the lawyers happy we have a strong definition of what release means and allowing people access to An open source or sorry an open Source or revision control repository Uh makes the idea of release sort of fuzzy Is that a check out release? Yeah, yeah But i'm happy to drop snapshots and things like that for people if there's a You know a team out there that wants to do some development needs to see what the what the latest looks like So website mailing list Uh, again google is your friend. I think if you enter blcr you'll probably find it all but i'll go ahead and rattle off the the url Uh the http colon slash slash ftg Dot lbl dot gov slash Checkpoint all one word and that's actually a redirect to the longer url that I don't try to give out And uh, there's just like a mailing list and some other places against questions documentation download the current release It's all there all those Available all those things should be from that website, but if someone just wants the email address. That's checkpoint at lbl dot gov Okay, and if someone does want to actually hack away on the code Should they contact you on the mailing list about getting a snapshot or should they actually work with the current release? Um, well, they should definitely contact me on the mailing list because I'd like to know just from my own curiosity what they're working on But also coordinate with you know And coordinate and let them know if someone else has already told me they're working on the same thing Maybe let them coordinate with each other But uh in general it's usually safe to develop stuff with the release cvs version I mean with the released version. Sorry As I try to keep I'm pretty good at merging stuff together the cvs tends to be Quite orderly as the result of having so few people working in it Okay, well, thanks a lot paul. This was uh This was a good time and definitely I liked hearing about this Something a number of us should probably be using more often and get more value of our existing equipment Um, this show will be up on www.rce-cast.com you can subscribe to the podcast in there and get a feed we do a show every two weeks Um, but we will be skipping one this next week as some of us will be gone. So Thanks a lot for taking some time out and we'll have another show in the future All right, thanks paul And thank you both for your interest No problem. Thanks a lot