 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online at RCE-cast.com You can find links to Jeff's the blogs the Twitter's and all that other fun stuff Also, I have again with me Jeff Squires from Cisco Systems and one of the authors of open MPI Jeff Hey folks, yeah, it's that it's that traditional Posts SC let down time of the year right where everything kind of you're just trying to finish out the Year and you take a little lull and stuff like that. So we've only got one show this month And it's the last one of the year, but forgive us for that. We'll be back in 2014 with better bigger more stuff But in that light Today's show actually features some of our friends that we ran into at supercomputing And we have some cool stuff to talk about right Brock. Yeah. Yeah, so This is and also we have I think this is we this is a dubious distinction for this episode This is the first time our guests have ever attended via iPad Because we record via Skype, but I'm pretty sure we've never had anybody Attending on Skype via iPad before so minor first for the show Well, this is this is how you kind of get around some of those Workplace rules or not really get around. I believe they're also like on a trailer off-site to be able to even make this call So our guests today are actually Catherine Moore and Adam Moody Both of Lawrence Livermore National Lab who are going to be talking to us about a scalable checkpoint restart So Catherine why don't you take a moment to introduce yourself? Hello everybody. I'm Catherine Moore as he said. I'm a researcher at the Livermore National Lab I work In basically scalable tools for high performance computing For example fault tolerance tools and performance analysis tools Adam Yeah, so this is Adam Moody also here at Livermore. I am Graduated from Ohio State and then started my career out here in 2004 Mainly hired to support the computer center with an MPI background and with that Sort of also got involved with fault tolerance, which is how SCR scalable checkpoint research library came to be So I I got to say something in here because I got a point I'm at the University of Michigan, and I'm very sorry about a certain football game I'm sure you're very happy about a certain football game, but Most people don't know University of Michigan in Ohio State our rivals. He has buck eyes around his neck right now He has buck eyes around his neck right now I'm virtually punching you in the face Is sad panda that is more politically correct. Yes. Thank you. Yeah, okay, okay Let me start off the official part of the podcast by a hey Adam Catherine why don't you tell us what checkpoint restart is and let's start off with a simple case like a single process case What is kind of the goal of checkpoint restart and what is it? Yeah, so the idea of checkpointing is if you if you have a long-running job that could be interrupted You occasionally want to save the state of that job so that if it does get interrupted You can recover from that state and so the analogy I like to use with people is you know Imagine that you're working on a word document or something. You're writing a paper You occasionally save your work every couple of hours because in the case that Microsoft crashes You don't want to have to rewrite that couple of hours worth of works It may be every five minutes you save whatever work you've done and then you can restart from that in case you get the blue screen of death Now we don't usually get the blue scheme of death in high-performance computing though So what kind of reasons do you have for checkpoint restart? well, yeah, so an HPC it's It's sort of the same idea you don't get the blue screen of death only because we're not running Microsoft, right? But Linux also fails and the hardware also fails But the bigger problem is that you're using far more processes and so typical HPC jobs At a large computer center might be using say a thousand ten thousand hundred thousand even a million Processes it at once and all of those could fail either because of software problems or hardware problems And these simulations tend to also run for a couple of weeks at a time And so you're almost guaranteed that something will fail along the way and so to combat that People use checkpoint restart each process saves its state periodically and then you can restart the job in the case of a failure So it seems like a really naive way to do this would just be to serialize everything and RAM and write it to disk What are some of the problems with doing something like that? well primarily the problems are the high overhead of reading and writing the checkpoints To the parallel file system and this occurs because when you have a large number of Processes in your job and they're all trying to read or write at the same time that the parallel file system becomes a huge bottleneck and This overhead is of a pretty serious problem For example for a large-scale job It could take on the order of 30 or 40 minutes just to write out a checkpoint and during that time The entire job is blocked just waiting for it to finish and so the machine is not well utilized in that way So really actually just one clarification point. It's not dependent upon a parallel file system, right? It's it's writing to any stable storage, right? Just just to clarify that well, that's the The approach that SCR takes is writing to any kind of stable storage But typically without a library like SCR the only storage available is the parallel file system Okay, so then maybe you could broaden your definition a bit So I asked Adam specifically about you know checkpoint restart in a single process state And you've kind of alluded that you know with these big parallel jobs. There's lots of processes running on what? How do you apply checkpoint restart to an HPC job just to expand that definition little? well, so It turns out a little bit more that we we sort of force a failure every so often just because we time-share the machine so Applications will run maybe for weeks at a time But we only give them maybe 12 hours at a time on the machine so that multiple people can make use of the machine at once So at least every 12 hours they they have to stop the job and then and another 12-hour window Pick up where it left off and so the way people have dealt with that is that Each process right now Will open a file Write out whatever state it needs to capture close the file and then those files are saved on the Parallel file system because that's where they can persist the data between runs So why go directly to a parallel file system? Why not just use a lot of times notes have a local drive? Why not do file per process and kind of aggregate them up later? Well a part of the well that is what SCR does But the problem with doing that naively is that if you lose a compute node you've lost the checkpoints if they aren't pushed out to some Stable storage or if you don't apply some redundancy schemes like copy a checkpoint to a different compute node To prevent loss in case of a failure So that's where SCR comes into play is it does all this magic for you and you just have to write your checkpoints like you normally would But you experience much much lower overhead I See so one of the issues you're running into might be the actual failure of the node or even the disk itself Not just this 12-hour reset For time-sharing purposes and so in that case if you wrote out to the local disk, it's effectively gone and you can't get it That's what you're saying, right? Correct. Yes Now what other kinds of failures can? occur Well, the so the compute nodes we call them nodes, right? They're really an individual computer but because computer has too many syllables we say node all the time But a node basically is made up of a mother bird a motherboard a processor some RAM a network card And maybe you have a disk on the node or not So any of those components Could fail the the ones actually the ones that we see most often are the power supply on the motherboard Network cards are also a common failure every now and then we'll see some processor failures Or some RAM failures the other big failure mode though is really software and the parallel file system Oh, right. And yes, the parallel file system also fails unfortunately Now when you say the software fails though I would imagine that it would fail for all processes in the job like it's a user error Are you are you talking about something else? Well, I guess each of the different processes is going to be executing with different data and So one reason might be that there's some instability in the computation And they get a floating point exception or something like that on one of the ranks and not the others So what would happen in that case they would restart from one of the checkpoints taken previously, right? Right. So the MPI job would abort in that case and then SCR would restart the job using the most recent checkpoint which could be Stored or cashed on some node local storage or on maybe on the parallel file system if those got corrupted Now you say SCR would restart it I'm used to checkpoints requiring user intervention to say hey restart from a checkpoint This is something that's kind of like built in like is it a process that wraps your normal job or what's going on here? Right. So when you run your job on an HPC system, you usually use some launch command So for example on a slurm system, you would use S run and in the case of using SCR you would use our wrapper script SCR S run and This script does a lot of things one of which is launch your job for you and Notice if it dies and if it does it can restart it for you It does other things too like checking to make sure that all the nodes in your job are still healthy and If any of them has gone down and you happen to allocate a couple of extra nodes in your job It can pull in the spare nodes for your job and ignore the failed ones so you can continue computing It also makes sure that the last checkpoint that was taken gets pushed out to the parallel file system So that the next time you restart your job in a new allocation You'll have that last checkpoint Okay, so we started to touch on some SCR specific types of things What is it that make because there've been a lot of checkpoint restart systems in the past? And some that are still out there What is it give me like the top two or three things that make SCR super awesome and cool and different from the others and Why you felt the need to do to create SCR as opposed to the systems that are already out there? Well, so the the motivation really came I guess because of the MPI background So we we had a system way back in 2007 we brought in named Atlas And like any new system you bring in you encounter all kinds of software failures and hardware failures And so we had an application trying to run on that system and they had a deadline that they had to meet But the system was just too unreliable To to allow them to run it was taking something like 30 or 40 minutes For them to save a checkpoint to the to the file system But they maybe were failing every hour or two and so they were essentially spending All over the time writing checkpoints and not really doing any real work And just doing sort of a back-of-the-envelope calculation From an MPI side, you know that 30 or 40 minutes felt like a really long time to me So it it was apparent that if you could somehow store the data and memory on the cluster itself using the high-speed network Rather than writing it to disk You could checkpoint much faster and then checkpoint more often So that you could actually make progress and so what SCR does is it stores these checkpoints and node local storage At that time we didn't even have disk on that on the cluster or so We're storing in memory But because of that we were able to save a checkpoint in 10 seconds instead of the half hour it normally required And with that we were able to checkpoint maybe every 10 or 15 minutes instead of the four hours or seven hours Whatever the code was using before so do I just tell SCR like kind of what my Disk hierarchy is like where it should kind of do its first checkpoint and then where it kind of should move it after From node local, you know memory Node memory node local disk parallel file system Yeah, that's right. It's right now because the each system at different centers are sort of configured differently Some have local disks. Some don't some have RAM disks. Some don't Some have parallel file systems Maybe multiple parallel file systems that could be used All of that right now has to be configured in to SCR once using a configuration file and then the library will Based on the speed of each device can save checkpoints at different levels at different times so Does this require administrator intervention to get SCR going or is this completely live in user space? It's completely in user space So it the only requirements are that it's ported to support the resource manager on the machine so currently SCR works or should work pretty much out of the box on Linux clusters that use slurm as the resource manager however in I'd say the past year and a half. We've been making a lot of porting efforts And it will run on cray XTs Assuming that RAM disk is enabled because they don't typically have it enabled And we started porting to blue gene And we're also happy to help anyone set it up on their systems or for their particular resource manager Now you say this is a all user space How do you grab the the process state or or you really just grabbing? The the use you know the malloc state and the stack state and stuff like that Do you do you capture the program execution stack or is it just data? No, this is an what we call an application level checkpointing library What you're describing is more of a system checkpoint method where the system would take a snapshot of as you say the process and then dump it all out but SCR works When the application explicitly makes its own IO calls and put puts out the data it wants to save And this has advantages because you're you end up storing a lot less data You don't have to store the entire memory space. You just have to store the the data structures that the application needs I See so does the application do they they register the data structures that they want saved with you somehow or? Otherwise indicate say oh if whenever a checkpoint occurs I need you to save this this page or this block of memory or what it's it's It's really just all POSIX IO based so the application will open whatever file it wants to create And then write the data through that file and then close it What it what it has to do is ask SCR where it should open the file at so SCR will direct it Either to say RAM disk on the on the node or to a local SSD Or maybe to the parallel file system and then the application will open the file in that place Write all of its data close the file and then it has to tell SCR when it's done and So SCR is keeping track of all the files written by all the processes and then it also Applies any kind of redundancy scheme to the data once all of the processes have finished writing their data I See so the notification that the application gets that's more of what the service or at least that end of the service What SCR provides is a hook for it's time to checkpoint and it's time to restore and things like that Yeah, exactly Okay, now on the website for SCR you also described something called multi-level checkpointing. Can you describe that? well, that's a general term for the kind of library that SCR is and it refers to The idea that you take different levels of checkpoints that have Different costs associated with them and different level of reliability. So for example Level one checkpoint might be writing Just to the local storage just to the RAM disk and so that's really cheap But as we said before if you lose that compute node, you've now lost that checkpoint So then we move up to what we call maybe a level two checkpoint Which would be applying a redundancy scheme on that local checkpoint So you might copy it over to a partner node to prevent loss and level three could be Storing it out to the parallel file system in the event that there's some sort of catastrophic loss or maybe it's just the end of your allocation Wait, wait, wait. You said I'm a copy to a partner node. So like SCR can kind of be while you continue on with your next step be shuffling the local checkpoints among each other So you can lose a single node but not have to hit the parallel file system yet That's right except currently this happens The job is synchronized and stopped while this is happening It's not a very long operation. Yeah, it's it only takes maybe 10 seconds if you're using memory to save a checkpoint of say 600 megabytes per process And that includes the time to Shuffle the data to another node and save it there if you're using SSD it might take something more like three three to five minutes So it's it's a pretty quick process and they and the applications tend not to do it too often Maybe every 10 minutes 15 minutes and so that cost isn't too high So is SCR itself an MPI application that's kind of doing that communication memory to memory or But then also how does it interact with Like does it use MPI to read from one disc and read it right it to the other like how do you go like in memory? IO and disk IO are different and how does it actually move stuff around? Yeah, so it's SCR really is written on top of MPI and POSIX IO Those are really the two interfaces it needs and as far as getting data from a remote disk All the operations are synchronous. So all the processes essentially enter SCR function calls at the same time. There's about six SCR API calls and five of them are collective And so all the processes enter the call at the same time and then we use MPI to exchange data between nodes at that time Okay, okay, so everything's our function calls. There's not like extra SCR processes that are being started up It's really a library and you say I'm gonna save this and you say I'm writing it out and SCR kind of captures That operation so it's aware of what that is called and where it is, right? Yeah, there's I mean that's sort of there's sort of two components There's the library component, which is what we're describing here And there's also the set of scripts that that run outside of the job to manage sort of restarting the job if that's necessary or scavenging the last checkpoint from node local storage to the parallel file system at the end of the job if if that's needed We also do have a capability where We can write data asynchronously to the parallel file system So some of the some of the checkpoints that we cash locally on the cluster We want to drain down to the parallel file system and we have some processes that can run in the background to do that drain Now what you've described is somewhat some some asynchronous behavior like two things that I can think of off-hand is number one you have to receive some kind of incoming control message that says it's now time to do a checkpoint and all the MPI processes do that at least more or less Simultaneously and the other one is you said you've got this asynchronous Writing to the file system stuff happening. How do you do these kinds of things? Do you assume that the MPI is thread multiple aware for example? No, no, no, that's not how it works. So The checkpointing done by the application is globally synchronous in the SCR model So all the processes stop write their checkpoints as they're directed by a CR So to RAM disk or whatever disk and then they complete their checkpoints with an SCR API call After that point if you want to do the asynchronous transfer to the parallel file system There are separate demon processes that are not Part of the MPI application that will pull the checkpoints from RAM disk and slowly drain them to the parallel file system in the background There's there's a single sent from the library to these additional demon processes And and one of the ways we do that is just through the file system itself So the library will write a small file out that the demon looks for periodically and when it sees that it has instructions And it to actually flush the data down to the parallel file system and then what one other point to your question there is So that the application knows when it should checkpoint that one of the six API calls Is to inform the application when it should checkpoint And so the idea is that whenever the application gets to a point where it's easy for the programmer to checkpoint it He can make one of these calls to determine whether SCR thinks He should go ahead and take a checkpoint based on the performance of the storage and the failure rate of the machine So what are some of the things SCR? Can't handle like open file handles things like that like what kind of you know What state do you really need to make sure your application is in before you say okay? I'm going to checkpoint now Well, that's up to the application itself since SCR is not doing system-level checkpointing things like Sockets or open file handles don't matter in a sense It just matters what happens the data that ends up in the checkpoint files that the application writes Yeah, from the application side There's there's not much that it can't handle because they have there's sort of a set of semantics that the application writer has to adhere to and as So long as he does that SCR will recover the data fine The bigger issue that we've run into is if you're always forced to restart from the parallel file system SCR doesn't gain you much The real benefit from SCR is that you can save some of these checkpoints locally on the cluster which is way faster Maybe a hundred to a thousand times faster typically then writing them down to the parallel file system But if you're always forced to restart from the parallel file system Then it doesn't gain you much and so sometimes that happens actually one of the Main culprits is if the parallel file system itself is faulty so that we can't flush one of the checkpoints down to the parallel file system That turns out to create a problem so you can't restart From the cache very often you always have to restart from something on the parallel file system. It's also bad at squirrels What? I'm reminded of the movie up at this point what? So every every now and then unfortunately that the computer center will go down Due to a power glitch sometimes squirrels play on the power lines So whenever the whole center goes down it doesn't recover from that Okay So I'm curious then how You know you guys at LL and L Actually handle this next if you're saying restoring from the Parallel file system doesn't work very well But you also talked about like well you have to kind of you kind of get evicted every 12 hours You know because it's a time-shared system and that's definitely the kind of environment that I operate so How do you guys handle that case where you don't necessarily know? That you're gonna be put on the same number of processors or the same location of processors for those local checkpoints Right and when you lose your allocation And have to restart in a new one you always have to restart from the parallel file system On our systems, even if you got the exact same set of compute nodes in your next allocation They would be wiped clean after your job had terminated for security and privacy reasons We're looking at adding optimizations to the resource manager so that between jobs if you happen to get the same set of nodes Which actually turn turns up a lot in practice The you could restart from the cache rather than the parallel file system in that case But for that you have to start modifying the resource manager Yeah, so that's definitely future work other possibilities that you could Add in again Implementation to the resource manager is that if you know in advance which nodes you're going to be running on you could start start to preload the data from the the file system to some sort of cache Before the job starts to run and similarly at the at the end of the job You could start to drain the data from the nodes in the background while the second job has started running so another question I would have is how Intrusive actually is the SCR library. So, you know, you said you need support from the resource manager Well, what if my code is something that I both run, you know on a large HPC center or but I may run on like say a PBS system I notice you didn't mention that I run PBS. Please make it work on PBS and then What if I also kind of like you know, I do prototyping or something like that just like on a desktop or a workstation Do I just need to skip all my checkpointing or like how is it active? It's on a system that doesn't technically support SCR S run Well, it just wouldn't work unless we did the port so SCR S run and SCR AP run if you're running on a cray Requires certain information from the resource manager during the run of the job And if it can't get it, it's most likely the scripts would fail Yeah, a lot of that you can sort of hack around it reads a lot of that information through environment variables And so if you just know which environment variables to set you can sort of fake it so that it looks like it's on an allocation That's as far as the library goes, which is probably what you would be prototyping on your desktop Anyway is is actually just getting the application to work correctly without the scripts. Yeah, that makes sense Yeah, the scripts are the most dependent on the the resource manager Okay, so if I don't want SCR to be doing all that's like local remote like hierarchy of stuff If I literally just want to be able to run it on my desktop and get checkpoints every now and then it it will still work Yeah, it should still work I mean you wouldn't really want to do that for real checkpointing Because you you probably aren't failing enough anyway for it to to be useful But you know as far as development if you're just trying to develop your application and and encode SCR into your application Yeah, you should be able to get that to work Okay, so if I write something using SCR, I'm no longer stuck on I can only run it on a cluster that has slurm or cray You know AP run. I can actually still take that code other places and it will still work I just won't get the benefits that SCR provides Right that the scripts provide great So going back before we go too far away I want to ask a little bit about the IO that you do particularly when you're doing I mean, it sounds like the wins that you get from SCR can be very crassly Categorized as locality, right? So when you write to a RAM disk, you know, that's super local and it's super fast And but then you might write to an adjacent RAM and that's a little bit farther away So it's a little bit slower, but still pretty fast and and the farthest away is the remote file system And that can take really really long time So, you know, when you're in this sense, it's it's you know, just classic locality Types of issues just with potentially very large amounts of data One of the other traditional optimizations for large data to remote locations Is IO coordination you guys does SCR do anything about that like, okay you five right now you five right and you five right and it's and it's actually More performant than having all 20 of them right at the same time and and and things like that Does SCR take advantage of any of that kind of stuff? Well, yeah We do when we when we write a checkpoint down to the parallel file system So all of those checkpoints are cashed on the on the local Storage we don't impose any kind of coordination on those checkpoints. So when we're writing to local storage We assume that that's scalable And then when writing down to the parallel file system We do limit the number of writers that can can write at a time And that is up to the user to to set so it defaults to something reasonable But then the person running the job can always tweak that based on his system Okay, and then a follow-on to that is Parallel file systems tend to hide this but I don't suppose that SCR is networked apology aware at all Right, so it's like, oh, I know I'm close to you know The actual IO nodes or I'm far from the IO nodes and you can do a balance of these kinds of things Simply because you do have the distinction of near versus far already so near Ramdisk near local storage far Remote storage things like that. I was just wondering how deep does that go? Are you actually aware of the topology of the network? We haven't we haven't coded any of that logic and yet at least for IO purposes It's not hard to do though Like we've set things up so that it's possible to do that, but we haven't run on any systems where we've needed to Sort of a more interesting thing to consider is hardware topology. So Most of the time you'll just see a single node failure on a lot of our clusters Occasionally you'll get multiple nodes fail at once But even when multiple nodes fail it's often because one piece of hardware failed That all of those nodes depended on like a power supply or a switch a network switch something like that and so you do have to code some hardware topology or system architecture topology Into SCR for it to really be effective So when you're doing the node local, I assume that's file per process or file per node But then you kind of like do the drain off to the file system on these really large systems You may have you know, 10,000 nodes Do you actually end up with are you literally just copying a 10,000 files or does it kind of in that? Drain merge them back together with like MPIO. So I have one nice large Thing that doesn't beat up a metadata server Well the applications that SCR support Do write a file per process and the reason behind that is that's what our codes here at Livermore do typically It it has historically had the best performance out to the parallel file system for regular checkpoint restart so we have been doing some research into merging and compressing Checkpoints before pushing them out to the parallel file system But that's still in research prototype state. We have had much better IO performance using that model Right. Yeah, so it's right now We will write the 10,000 files out to the parallel file system and we have some options in there that you can combine Files into into a container and and we're we're doing some further research on compressing while also doing that combination Yeah, the pressing stuff I could see like, you know deduplicating ghost zones and stuff like that or something like that I mean, there are a lot of duplicated data sometimes in distributed memory applications Right, we haven't looked at that aspect yet We've been more looking at trying to put data that might be similar Across processes together. So for example each process might have a chunk of the temperature Array and if we merge or concatenate the different bits of the temperature array from say a group of 10 processes and then compress them We might get better compression than if we simply Compressed a single checkpoint from a single process. So that's the approach we're taking right now Yeah, you can imagine let's say like a weather modeling code where you've divided the space Above the US into two by two or two dimensional cells, right? And so processes that are neighbors to each other in that grid Are likely to have temperatures that are similar, you know, the state of Washington in Oregon are likely to be similar Much more similar than the state of Washington in Florida Now along those lines, do you attempt to do any kind of data deduplication again along the lines of optimizing far and remote file storage kinds of things No, we haven't we haven't looked at data deduplication at all yet We're hoping that the application writers aren't writing a lot of redundant data to begin with but we we haven't looked at that Are there other organizations who use SCR or is this pretty much a liver more specific technology? I Wouldn't say it's liver more specific, although we do use it here I know it's been used over at Los Alamos. I I worked on the port and installed it on Cielo and Cialito and Oh, yeah, we have a mailing list and we get I'm sorry. What are those see a what? Oh, they are Cray machines at Los Alamos Cialo and Cialito And then on our mailing list we periodically get users asking about Installation and running issues so they must be doing something with SCR Yeah, it's all open source of course so people can download it and use it we don't know who might be using it We we have gotten emails from users at different sites It currently seemed to be mostly researchers looking at it probably also researching fault tolerance Not yet aware of any other large-scale user But we're we're happy to help People get it set up on their systems if they would like to give it a whirl Have you had any requests for alternatives? You said like right now it's memory or POSIX? But you're also doing a file per process which looks very much like a bunch of objects I could stick into some sort of object store How hard would it be to make one of those hierarchies of stick it off into my object archive or something like that? That is some research that we have going on right now with Argonne National Lab In this case our object store. We're calling containers and you would put checkpoints into containers and then Behind the scenes and IO forwarding layer would move these containers between different levels of the storage hierarchy So we're planning the system to work on future Multi-tiered storage systems where there would be no maybe no local storage and then burst buffers and then finally Maybe some other storage in the parallel file system Right because I could see someone wanting to say like every Fifth checkpoint because it contains more data actually want to analyze I could actually in your drain drain it straight off to even an archive some other type of archive and Having pluggable IO modules I can see people finding useful Right those are things we're looking at and we're also looking at Trying to handle all different kinds of files You know for example visualization files that the application might write and wants to go to the parallel file system at some point But doesn't want to input or incur the overhead of writing them Yeah, we're trying to extend the API to do sort of exactly what you're you're talking about Brock Which is sort of be able to tag some output as having different properties and then treat it differently to based on those properties And so like Catherine was saying one of the things we want to do is Handle any kind of large output set because any of those are going to be expensive writing to the parallel file system And so what we'd like to be able to do is cash that in fast storage Applier redundancy scheme so that if a node fails we can recover the data and then use the asynchronous transfer Properties that we have an SCR to actually move that data from the cash down to the the permanent storage where it's meant to go Has anyone ever thought about doing the inverse of this like say I have you know some data intensive Input like a genomic set or something like that And I actually kind of want to stage different things to the local storage of each node in the background as the rest of the simulation kind of runs Right. There are people looking at that aspect is as well The that's different research group here at the lab But it is an important problem because that can be a huge overhead. It's also A Difficult when large-scale applications start up and they use shared libraries So then they're pulling in a lot of files as well And if you could stage those on the compute nodes or burst buffer or whatever you could save a lot of time So what kind of you've talked about a bunch of forward-looking we're doing this kind of research We're doing that kind of research. What other kinds of things have we not covered that you guys are working on What's coming up in the future for SCR? Have we hit everything? I'm trying to squirrel proof it. Yeah That would be a very innovative feature I think worth a lot of papers One other feature we're looking at implementing is something called cruise This is a user level file system that we've worked with from a student from Ohio State It intercepts all of the POSIX IO calls So that we can we can handle the data rather than running it through a file system And so for example, we're able to use this on systems that don't have RAM disk Which is a Linux way of implementing a file system in memory for systems that don't have that We can implement our own by just allocating a region of memory and then on the right call We copy data from the user's buffer into this buffer of memory that we've allocated The other thing that'll let us do is compress data on the fly So during the right call we can compress data and then write it in order to try to save space And we can also spill over to other storage devices. So for a node that has a limited amount of memory We can we can write as much as we can to the memory and then say spill over to a local disk Like an SSD or even the parallel file system So that piece of software will allow us to run on more systems and and support more applications And it's been released and can be downloaded and used Currently right. Yeah, it's really its own stand standalone piece of software. So we expect other people to make use of crews Outside of SCR. Although it was implemented with SCR in mind So what's some of the strangest uses you've seen of SCR? I don't know if we've seen anybody using it Strangely, but that there were a couple of interesting countertutive things we've run into While people were using it. I mean one of one of the sort of funny stories was while we were implementing it You know, we're focused on making sure that your job just restarts and always runs in the case of failure Well, we forgot about the case where you actually want to kill the job because you want it to finish And so we couldn't we couldn't stop the job. We would we would run a test And it would run to its completion and then our scripts would just automatically restart it And it didn't matter how many times you would try to cancel it say with control C or something It just kept going it was sort of like a zombie Process like night of the walking dead or something you just couldn't kill the job So we had we had to go back and modify that so you could actually kill the job on purpose Those kind of things we didn't think about There's another case where It's sort of counterintuitive We started buying SSDs for some of our clusters Well, it turns out that SSDs increased the failure rate of the nodes because it's one more component that you're adding to your node It has its own failure Properties and so the failure rate of the machine actually increased Once you added the SSDs, which you would think would be a bad thing from fault tolerance But because of the increased speed that you can checkpoint at rather than writing to the parallel file system You could use the SSDs You could still make better use of the machine even though the failure rate of the machine had increased which was a bit counterintuitive There was another example of that where we found that let's say you don't have SSDs. You're only using memory Well, some applications use so much memory that they can't fit a checkpoint and memory along with the working set for the application But what they might be able to do is spread the job out and use more nodes So they spread their working set out among more nodes and then free up enough memory so they can save a checkpoint but by using more nodes you're Increasing the failure rate of the job because there's more nodes that are likely to fail and you might also even slow down the job Some jobs don't scale very well as you increase too big So they might actually run more slowly If you use more nodes, but even in those cases we found that you can increase The efficiency of the machine so it can it can be running a more nodes be running more slowly and fail more often But you still get your answer done quicker By using something like SCR That's pretty sweet So here's a question I like to ask all software developers Just because I like to hear what the different answers are and the different reasons why they answer the way they answer What version control system do you guys use for developing SCR and why? Yeah, we figured that was your question Jeff Well, we're using it and I Think just because we like it we had been using SVN and transfer to get I'd say what early this year Yeah, I mean the real reason we switched over to get was because we especially being at the lab We maintain multiple repos or multiple repositories of the source code so we have one internally that we use for development and then we have one on the outside world on source forage Right now that we like to push to so that People can get a copy of it there and we've also put one in github And so we really like get because that allows us to easily manage the multiple repos in different locations It's something that we couldn't easily do with SVN Okay, so I think that's everything we had so guys. Thank you very much for your time and where can people find SCR Shoot, I don't remember Let's see I have to log in hang on so our main web page is at computation-rnd dot LLNL dot gov forward slash SCR and from there you can find a description of SCR and all the research directions We have going on and all the software that is available for download And also links to our mailing lists and all kinds of good information Okay, guys. Thank you very much for your time Thank you. Subscribe. Thank you