 Welcome to another edition of RCE. This is your host Brock Palin and we're going to be speaking today about cluster building I have with us three people. I have my co-host Jeff Squires from Cisco Greetings. This is Jeff Squires. I work on open MPI. You can find me on open-mpi.org And I have our other two guests first. We have Jeff Layton from Dell How you doing Brock? Yeah, my name is Jeff Layton. I'm what we call the enterprise technologist For HPC at Dell. So you know rather than spend an hour explaining what what an enterprise technologist is We just go ahead and move on That's okay And then we have Doug Egline from cluster monkey Hi, um And cluster monkey is a HPC site cluster monkey dot net um And uh, some of you may know me from Linux magazine where I also write an HPC column every week and some articles and uh I'm also You know consider myself a general one purpose cluster jock. So Okay, well, thanks a lot for taking some time out with us guys Uh, so what we're speaking about today is we're going to be talking about Building clusters and I think we're going to stick to the distributed memory. You know the pizza box network MPI kind of cluster and so first off we want to discuss what kind of requirements Are necessary to look at when Specking out a cluster So why don't we make this kind of an open forum here? Why don't you guys chime in? What what do you guys see customer? I mean Jeff Jeff Layton you work with customers quite a bit and people who want to buy clusters and whatnot and Doug you write a lot about and have a lot of experience working with people who are doing this stuff in the real world So why don't you guys both chime in with what kind of requirements do you see people do and And what are the pitfalls of requirements that people forgot about? Um, though, you know, that's kind of the gotchas here. Oh good. Good. I'm going to jump in quick because I'm not shy Uh, I see a range of requirements and I'll I'll tell you kind of what I think are good Requirements and then bad requirements the good requirements are hey, we're looking at HPC these are the applications we're targeting. We want to talk to you about ways to Optimize our cash or optimize the system for a limited amount of cash And and so that you know, that's perfectly great. Let's go in and talk to them figure out what's going on The bad ones are where we get a laundry list of parts and we want to quote Then it becomes, you know Dell versus ibm versus hp who's going to give them the lowest price at that particular moment and that's that's awful You know, that's from a vendor perspective. That's terrible but in general I think most people are pretty good. We do get the give me a laundry list and and give me a quote But to kind of go into what i'm seeing right now My personal bugaboo right now is hpc storage and we see just awful terrible requirements for hpc storage People are just saying, you know, I want 400 terabytes And I have to have it now and and they don't talk about what kind of storage why they need it How they're going to manage it how they're going to back it up No discussion about any of that and then so then That becomes a tar baby to get into so Anyway, that's kind of what i'm seeing right now. Doug what what kind of stuff do people talk to you about? Well, I think um That's an interesting question. I mean I I've run into people that I explained to them Well, they asked me what I do, you know, I come about clustering and they go, well, what could I do with that? and um For a lot of people the answer is Well, not much because they're not in the they're not in an hpc area Uh, which is, you know, something like where they're doing engineering Or you know science or bioinformatics or you know chemistry physics So the the areas where hpc seems to be a win are you know people who need to do solve numeric type problems um Usually large-scale problems And also non-numeric bioinformatics type problems um Where you have things that that you know that the problem doesn't quite fit on your workstation or whatever and and You know, there's there's things that will supplement your research and so forth so and to kind of piggyback on what jeb said is um So at one point in my career, um I would I would go in and help install itself hpc clusters to people and I found the best approach was to Not come in and lead with hardware, but more sit down with a pencil and paper and say what is it? You want to do what type of applications do you want to run? What type and and look at a couple of things, um The processor that would be needed type of processor the interconnect and then the storage and and I think as jeff mentioned the storage is often the forgotten uncle in all of this and You know, it's it's kind of like we'll just use nfs, right? And there's been more than one situation where they have the nice shiny cluster in their and the nfs uh non parallel nature of nfs comes back to bite them and they basically have processors waiting on nfs and so forth on storage So that would be the biggest thing to start is really get a good grasp of what it is You want to do what you want to accomplish there are and I think that Again, it starts with application areas and and maybe we'll we'll get a little more into that later what applications and there are um some I would say, you know in the 80 20 rule there's There's probably about 80 percent of the cluster users use 20 percent of the top applications. So Uh, we can like I said, we can probably talk a little bit that bit about that later All right. Well, that's a good, uh, let me let me roll this around back to what what jeff was saying earlier jeff laden You said the the worst kind of thing that you could see is when a customer just gives you a laundry list of of equipment and because it just turns into a price war, but why exactly Why is that bad? You know, is it they have they not thought through their requirements and they're just going for the latest sexiest processor or You know, why why exactly is that bad? Well, you're absolutely right. It may not be the best idea because We may know of some trick or some optimization That allows you to get away from those set of requirements. For example, we may know Hey, we've seen this with such and such application. If you go this route, we've seen good very good performance or very good Uh, um, you know performance per dollar per watt or however you want to measure it So we we it what it does is it gives up the freedom for the vendor to innovate So any and that's any vendor that i'm just not speaking about del. So, you know, I'm sure everybody All the vendors have really great technical people who can come in and talk to them about options So if you don't do that, you've just painted yourself into a corner and you're going to get what you get So let's go ahead and talk about some of those options that are available probably the first thing that many people say when I want to cluster they have a compute problem. So they start thinking about the nodes right away. Um When they start actually going down once they kind of have the requirements on paper should they Start looking at nodes right away or should they look at like a network first? Well, that's uh, that's a good point because ultimately unless you're, um, you know, maybe nsa or someone a government agency where Excuse me. You don't have a budget constraint. Everybody's got a budget constraint So the issue is I need to optimize my hardware to get to Solve my problem the best what's going to give me the best performance And in many cases, it's a nodes versus network. It can be a nodes versus network uh situation where um You you may have a set of problems and find that The standard on on board ethernet works fine for your cluster And you can invest more money more of your budget and nodes or you may find that And this is more common now with multicore that you really need Uh to use something like infini band or 10 gigi um And you so you need to invest some money of your budget in that direction So it's really finding that balance that's going to give you the best price performance And there can also be um an issue with the node itself in terms of Do I want to uh go with um You know as as dense a node as I can which would Or a fat node with lots of processors lots of memory Or where am I going to be looking at maybe spreading things out a little more because It just works better that way. So there's that's that's some examples of the kind of things you need to Homework you need to do before you sit down and start ordering hardware. The the thing you don't want to do is uh say I want the hardware that gives me x number of um Terraflops based on the hpl top 500 benchmark Because unless you're doing that type of Computing that benchmark isn't going to help you very much in your design criteria Doug say it isn't so that the hpl benchmark does not reflect real performance You didn't hear it from me Okay No sarcasm in that comment done none at all but you know the My my pet peeve is when you make your statement that you're the ninth fastest computer in the world Also say at running this benchmark And I'll I'll say good So But that but there was a time when it was the If we built it they would come we want to get on the top 500 list here's the list of hardware we want Um, which to me is the the wrong way to go about building a cluster Absolutely you you have to ask the questions like you said start with what are you trying to do? And what are your applications? You know another thing too is Clusters have been around a long time So people are now coming around for their second third fourth round of clusters So one thing I like to always ask is you know, hey, have you been running clusters? If so, what are the problems you've seen? What are the great things you've seen? So you kind of find out what the person likes and dislikes Because that you know that rolls into the to the requirements as well if you know I did work. I'm going to get into a rattle real quick, but The one that I've seen that really Creates problems is when you start talking about cluster management tools You know that that becomes a religious argument, you know Or the beer which beer is better, you know it at some point And you can argue which one's better on some basis But at some point it stops becoming that kind of argument and becomes I know this tool So I I'm very comfortable with it. So I'm going to stick with that and I think that's a perfectly valid comment So I I think those kinds of comments in in rfps is this has been our experience with clusters So this is what we're looking for in our next cluster. I think those are absolutely appropriate comments to put in there All right, good advice. Hey, so actually rolling right into that I'm going to ask about the hardware side of of cluster management here a little bit blades Pizza boxes what are advantages and disadvantages of of both here in hpc environments? Well, I think my What I've been seeing lately is I think we're we may be at somewhat of a threshold where the Price premium that blades had Is getting very close to the cost of the same cost of doing pizza boxes and Given the choice between the two blades do have an advantage that they normally have Shared clogging power Etc which makes it More more of a green solution. They're usually more efficient on power The cooling is usually a little more efficient because it doesn't have as many small fans running at very high speeds and of course it has the Usually some built-in management capabilities in the in the blade itself Now that said, you know, one of the reasons people like pizza boxes is the flexibility It's I whatever I can fit in there. I can fit in there and blades cuts down on some of that So I think from a cost perspective blades are changing a bit getting a little cheaper Um, ultimately and as Jeff Layton alluded to it's really what works for you. What what kind of Uh situation you need if you need maximum flexibility And at some point you're going to want to stick another hard drive in these things and maybe you better be you know go with the pizza box Solution now Jeff you Jeff Layton probably has some better insight than I do on that No, I don't know about better. Um, but no, I I agree it it all goes back to the application Uh, like you said, you know trying to pop an extra couple of hard drives blades Usually don't have that many slots for hard drives and there's some applications that need a lot of local I.O And centralized high speed I.O doesn't work. So you need a lot of local. So then That your application is going to start to drive you towards one configuration or the other But yeah, now I I agree with Doug. Absolutely. I think blades are kind of at a tipping point where they're going to become more popular and and there is the idea of Of the rip and replace concept with the blades you keep the Network infrastructure there and you just put in new blades and there is something to that It also allows you to maybe mix and match within one chassis makes things a little bit easier And some people like that On the other hand though, I've I've been seeing and brock I'll tease you a little bit security university that Some of the university is actually like the pizza box approach Because after the cluster is kind of done with let's say it's three years They like to take those and kind of pass them out Let's throw them over here to the the hr group or over here to accounting That just needs a server to store some database stuff. So I've seen repurposing kind of entering into that but So, you know, there is no I don't think you know less filling tastes great kind of argument of blades versus Pizza boxes to some degree. It just depends on the situation Well, actually You bring up how we distribute stuff. It's actually hardware coming in for us. That's more of a problem Our funding models are weird So far we've been talking about if you're going to build a big monolithic system We have problems where people are coming in and they've got You know 30 grand in a hardware grant. Well, that's not a full blade chassis So they don't want to pay for the full blade chassis and pizza boxes You can just get them to have a very small per unit Price when you're working inside that that funding box. Yeah, no, absolutely. That's a great point. I forgot about that Yeah, that's true. Um, I have seen one university. I don't want to name them but What they're doing to kind of counteract that Is the centralized IT group is buying all the chassis of the network And then the faculty just buy the blades as their funding comes up But the problem with that approach is you've got a plan for it And that's not always easy, but no, absolutely brock. I agree with you. I forgot about that point. That's absolutely great point So what about cooling these things? Um Blades are awfully dense awfully hot getting power to a rack. How are you seeing this moving in the future? I guess I'll go ahead jump in first Um, I'll give you kind of a couple of perspectives We've seen machine rooms all over the place to the guys that you know The cooling air goes maybe a foot up out of the vent and then that's about it or they're blowing fans on it All the way up to guys like mark seager and I'll call out his name because he's such a luminary in the hpc field Who's probably got the most gorgeous machine room on the planet and it's going to be that way for the next 10 years I mean he can handle anything that anybody's going to do So we've seen everything from end to end Um, but yeah in general the problem with blades like you pointed out is density a lot of facilities can't cool them They can't get enough cooling air. They don't have enough pressure or volume to get it up to the top blades So yeah in that case Density doesn't buy you anything Um, but I will say also at the same time. I've seen a number of machine rooms where They're just screaming for things like cold isle containment or hot isle containment Where just by enclosing one of the isle of the other they can solve so many other problems And it's not that expensive and so I think there's there's a good way to start and that's by looking at isle containment And then I think we'll take probably the next step in density and go to these massively dense systems Blades and then we're going to have to probably start talking about water cooling But companies like apc and lebert have some nice top of the unit rack Top of the rack units to cool all self-contained same thing with back of the door apc has some enrol coolers that are pretty nifty Um, I don't know. I've seen water cooling kind of making a resurgence just because people have it You know, so they turned it off years ago for the old craze, but it still works. So let's go ahead and take advantage of it yeah, I See things I I think the um What they refer to is the rear door heat exchanger type of thing. I think that's going to become a um a nice feature and And if they have water, um They actually can use these things to cool other equipment in the uh in the data center or in the not data center And what i'm talking about is um many of the data centers Some of the people I've talked to that they just don't have the room or power in the data center to put another cluster and so you may have some guy in a lab that wants to for to do some protein folding May want to put, you know, 20 nodes in a lab. Well, he can't he doesn't have the power in cooling But he definitely can have chilled water and he can get one of these rear door heat exchangers and pop Plop the system in his lab and not have to worry about maintaining a um a server room environment now the the um computer systems people of the organization may not agree with that deployment model Uh, the salespeople certainly like it but You know, that's one of the things I've seen is is uh, you know That there's there's just an issue of getting Having space and power for some systems the other thing that I that I um, actually I'm very happy about is um the processor vendors Which primarily amd and intel are are taking power seriously now not that they didn't before um But now it's really become a bullet item in in uh many of their presentations in and part of a sales pitch And I'll even mention that the noon the hell Which just recently came out Has the capability to power down individual cores Uh when not in use and actually on a system power down If you're not using a host power down the memory and and um i o controllers to I believe it's somewhere around 10 watts And in a standby mode Which is to me very important because One of the questions I've asked people is when I see one of these big clusters running I'll say what's your utilization rate? And they might say well about 70 percent So i'm saying so 30 percent of these or if you have 100 pizza boxes 30 of them are sitting there Basically doing nothing, but you know acting like good space heaters and they're kind of like well, yeah and certainly that's a That's some place where we could um Handle the heat better and save money at the same time All right, so this kind of is there's another good segue into the next layer up, right? So being able to selectively power down that that fits into the broader genre Of you know, the base provisioning and base cluster management So, you know getting the right operating system load on there and things like that. What do you guys See in this arena There's a whole pile of tools out there that do these things and everybody Front features for x and features for y and things like that. There's open source ones There's commercial ones and every vendor's got their own favorite tools and things like that What do you guys see people using and what do you see as uh successes and failures in this area? well This is uh dug edline from from my standpoint Uh, i'm a bit biased I want to be upfront about that and I think jeff leighton probably is a little bit Well, we we prefer a discless provisioning model um And that basically means that when the node is booted That it gets his operating system image and any other files from some central server Location and really has no a hard drive On this on the node that's responsible for for Holding an OS image that does not mean the the node can have cannot have hard drives on the contrary It can have a scratch disk hard drive or any number of hard drives um The other way of doing it is The kind of the opposite of that is where you have an image on every every node on a hard drive So each node is like its own little island and it it can boot no matter what it it was just just boot up and uh Whenever you turn it on that using a um a A non-local boot option or a uh discless provisioning means it has to that Has to have a server up and running in order to to get the image And the the reason I like the discless approach is to let you manage the entire Software stack that goes on the node from one location And and again, uh, this is this can be very controversial a lot of people have different opinions about it This comes from my experience of trying to manage clusters That I have a couple rules that if there is a hard if there's available storage somewhere like a hard drive Uh users will write to it. I don't know why they just seem to find it and write to it um, and in in ed in ed of fitably the uh The nodes develop what I call personalities and oftentimes you you You run into trouble with cluster management trying to manage the personalities of the nodes to just get upgraded on this node did that one get upgraded on this node and um, there are ways to manage that with shooting images around and re-imaging a node and so forth and um That's one way of solving the problem But when you have everything concentrated on a server you can make one change and every time you you just need to reboot Boot the node and then the change is there So that that's kind of the two the two approaches people take in in doing this and I'm sure Jeff you have some input in that as well. Oh, I absolutely agree with you You hit a dug and I've been by the way dug and I've been friends for a long time He I was actually one of his customers when he had a what he had his company, but no, I agree I think I love it for all that those reasons Um, but I do one of the other pet peeves that dug didn't mention that he and I always chat about Is this idea of taking job schedulers? And allowing if the node isn't being used To let just have it powered down so the job scheduler powers it down And it has to keep track of it. So if a job comes in and it's needed it'll power it back up And dug and I have been wanting this for a long time and you know, I've heard of some Possibilities of people that kind of sort of developed this but You know, I don't know if it's been tested out in a widespread. I don't know maybe Jeff has Squires knows a little bit more about that but I think that's another Uh avenue I think that people can address and it needs to be addressed as a community Yeah, to be honest, I would love to see something like that as well because uh, you know, my In Cisco my mpi development clusters, you know, 50 some nodes or so And there are times when I actually do have nodes idle and I I feel bad that I'm just kind of burning electricity while they're sitting there Waiting for the next job to come off the queue And you know, the nightly jobs are done and and they don't have anything to do until you know tomorrow night Um, it's sometimes it's usually just a couple hours worth of waste But still it's it's waste that could be better managed and uh, you know, someone's got a better solution for that I would love to hear it Yep, absolutely Doug and I have been threatening to do it and we we haven't and But uh, I'd like to see somebody do it On on any basis commercial and open source Doug and I that's another one of our pet peeves is we're kind of open source supporters, but at the same time we see the need for Uh commercial but we'd like to see somebody take the bull by the horns and do it I um I do If I have this correctly, I believe the the one of the and we're kind of stepping into batch systems now and uh, which is fine One of the the lighter weight batch systems class slurm Um, I believe that is part of the next release. It's going to be power system power control Um, I know there are some people Uh, who are trying to work with that with sun grid engine. There's actually a I've read some things about it. I don't know if they have anything implemented and I'm sure the other um The other scheduler that's popular open source scheduler Torque, um, that they're looking at that as well and I I'm absolutely sure a platform Uh is looking at that as well with their commercial products. So I think it's something everybody's looking at Um, I don't think it's it's there just yet and um I would love to play with it. It does bring up there. There are some, you know, it's it sounds like a simple idea. There are some issues is to um, you know, how how do you actually implement that and um, other thing we didn't mention about dynamic, um provisioning or bootless discless provisioning is That it's quicker than non Uh having a disc on the system So when you do these dynamic boots reboots power downs It the system can boot up a lot quicker and one other advantage we didn't mention about that is it's possible to have This different system images That can boot on the nodes. So for instance, if you need a certain version Colonel package you can actually customize That for a specific node or group of nodes and and boot those nodes with that image So you could you could potentially even have a su say set of images or a red hat set of images and It gives you a bit more flexibility So actually, uh, this is a story that I like to hear because um, One of the one of the trends in hpc that I've been hearing recently and when I say trend I mean trendee that people are saying wow, we should use virtualization in hpc So that uh, you know, I can have my job have this one gets susie and this one gets red hat This one gets chaos linux. This one gets debian This one gets whatever application load they need and we can just load whatever virtual machine We like and I've I've always been kind of dubious of that model For exactly what uh, what you just mentioned there dug that uh, well, you know, just get yourself a good discless provisioning And then you don't have all the complexity Of virtual machines you can just pick which image to load Rather than trying to virtualize the network and virtualize this and virtualize that and A lot of these other layers just kind of disappear and you get the same end functionality Go ahead Doug. Yeah, the uh, the one thing why you mentioned virtualization is and and um, I think it's a great idea I I really like the idea And at the same time in hpc, there are a and Jeff squires and late and you know this very well There's a lot of people that spend a lot of time getting software as close to the hardware as possible for the best performance and you know virtualization kind of has The opposite in a way to virtualize hardware and therein lies the the rub I think with hpc at this point is um We want to be as close to the hardware as possible where virtualization wants to move us away A little bit. So that's why I I I think there's going to be some virtualization playing hpc But I don't I don't see it as most people do Yeah, no, no, absolutely. I think that Dead on um, I I think the problem I love Jeff's idea about what you just as part of the job You just say what at what os what distribution what you want and it just It's got an image created and it fires it off But I think the like a lot of things in hpc the devil's in the details Because now what you're doing basically is re-imaging the nodes So how does the job scheduler keep track because the job scheduler usually puts a demon on there and they communicate So you have this kind of problem I'm not exactly sure how to solve that or There's got to be a good way to do it or you just bite the bullet and you just tell the job scheduler You know just it's a brand new node and has no idea of state So I I I think that's a great approach. I like that approach and and like Doug said, you know This idea of virtualization has become so trendy. I I can't tell you how many times people have said I've got a new idea for doing virtualization at hpc and then you start drilling into it and you find out that No, they can't The big problem like Doug said is performance From what I've seen when you start running VMs and you're going to run an application at the VM. You're taking like a 30 hit in network performance You're going to take a big hit and net and io performance And so first thing you're going to do is you're going to pay the VM tax So I've got and to buy more hardware to get to basically the same throughput And then the next thing that people usually like to do in a VM that the argument is Well, if the node looks like it's suspect and it may die I could do I could move it while it's running to another node in the VM world They call it v motion so I could v motion the app to another node great idea Except if you're trying to do if you've got a lot of io going on or you've got a lot of network traffic How do you do that? How do you intercept it all on the fly and remap it to another node? Do you just stop it and then move it? So I thought the devil's in the details on a lot of this stuff with with virtualization and and I like The best story I've ever heard is actually my boss tim carol and people probably know him in the hpc world He's been around for a while as well And and as the best analogy is With hpc you're trying to take a lot of nodes and make them work like one and virtualization You're trying to take one node and make it look like a whole bunch of machines So they're kind of opposite approach I I can't see it maybe being used for certain circumstances But as a general overall trend until we solve a lot of problems I don't quite see it coming out yet. I don't know jeff is that kind of matched up with what you've been seeing and thinking Yeah, pretty much. Um, I should backpedal a little bit so I don't get in trouble with my uh With my employer here virtualization is certainly great in a lot of environments and it has a lot of really good uses I'm I'm specifically limiting my remarks to the hpc world where I think I'm agreeing pretty much entirely with you guys that uh, you know Virtualization is is a great technology, but not so much in the hpc arena It it it introduces more problems Than it solves, you know, like jeff you were talking about some of the problems with all right Well now the scheduler has got to be involved somehow and you got to either have images with a new scheduler daemon on it Or you know, there's there's a lot of conditions and things that need to be solved Before this is even feasible, you know What I mentioned before of just boot a new image for that particular job But my personal view is that those are easier problems to solve than the problems that virtualization adds in an hpc Kind of climate kind of Thanks for that, uh, that I don't know what you call it limited segway that's usually here at the end of the commercial at about 14,000 characters a second. Yeah, I I agree with this. This is all about hpc virtualization is fantastic for certain fields And by the way, I'm speaking for myself not necessarily for my employer. So I just want to get that in there as well Let's ask one more question here before we we digress a little bit there in the job schedulers But before we completely leave the the hardware arena Yeah, what's your guys take on, you know, multiple login nodes multiple administration nodes IO nodes. I mean, I know a lot of this comes back to requirements But I just like to see, you know, what do you guys see out on the street? What do you see in the world? How many do people just do with one login node or you know, what are the what do people do? Well, uh, from my standpoint um A lot of people don't know that they can have multiple login nodes and um so the uh And that is also a function of what scheduler you use and so forth and and um and for those that that um and I I want to back up a minute because I just when we were talking about this I just thought If i'm out there listening to this and these guys are talking about schedulers and all this kind of stuff and Some people may not realize how a user uses a cluster so You know, we're all used to sitting down at our desktop laptop whatever and having complete access and and running a program and Having a run when things happen when we wanted to Well on a cluster We've kind of we in a way We take a step back to the old days of batch systems where you submit your job With a certain resource requirement And then when those resources are available it runs on the cluster and then your results come back to you So it's a it's a different mode of operating many times than most people are used to and um this This is required because we can't have For instance, you know 20 people logging into the head node of the cluster each trying to run a job and use different compute nodes because they'll be stepping on each other and and um The nodes will be overwhelmed with jobs and so forth and so on and the scheduler basically takes care of that so I Um, I think that's a good thing to remember and now I of course forgot the question Jeff squires asked Ha ha what I was asking about was login nodes and administrative nodes and you know non core back-end cluster nodes What do you guys see because I I imagine that's also largely a function of uh budget as well And size uh, it's it's important A smaller cluster you can have one node that does all that It can it can function as the node That runs the administrative software, which which in many cases is ganglia, which is a way to Watch what's going on on the cluster? It can run the scheduling software and it can Function as a login node as the cluster gets bigger these Tasks can be broken out to separate nodes and in particular administration nodes Uh can be important for a couple reasons is you want to Keep that away from end users Not that an end user would ever do something they weren't supposed to do But it's a good idea. Some people think it's a good idea to keep the administration stuff on Completely on a separate node the same thing with scheduler sometimes the the actual scheduler is running um monitoring jobs and And watching the cluster so much the batch scheduler that it really should have a separate node Uh to handle things and the same thing goes with logins if you have lots of different people logging in And then they submit their jobs and when they log in they may be compiling on that node Getting their job ready to run and then they submit the job to the scheduler Which then as the resources become available it runs so Yeah, no, that's absolutely true and and a rule of thumb I use For customers and and this is totally my rule of thumb and I'm I know it's probably wrong, but uh For about every 25 users that are on the system simultaneously. I kind of recommend another login node So if you got 25 people doing whatever throw another login node in there because by that time the cluster is probably pretty big Anyway, and one extra node isn't going to you know kill it as far as cost goes But yeah, that's absolutely right. You start small and grow it I like to see for bigger systems Uh a second node a node dedicated just for the job scheduler so you could do fail over So if the the node dies, it'll fail over to the other one and the jobs won't die Uh the separate management node. I like to keep that separate for bigger systems Uh and then storage as well. I mean, you know dug and I have our own Secret basement clusters that we've been building for years and and so we've got a head node that does everything including Act as the storage node or the for nfs But as as things get bigger you can't you overwhelm the node So you got to kind of split that out as well But at the same time I think jeff's probably got that tone in his voice that Okay, now I'm making my management problem about a thousand times more difficult and that's absolutely true I think there's at some point you're just going to get overwhelmed But this is my one and only kind of del pitch Is that del and uh another 11 companies including sun and intel and a few others Hey, and cisco too Cisco. Yeah, sorry. I thought I had thrown in there I thought cisco was there, but I couldn't to be honest. I can never remember but uh The idea is we all put some money in a vested interest in collaborating with laurence livermore To create this system called hyperion and what hyperion is to look at is this is scaling to look at scaling of os scaling of management scaling a job scheduler Because we're starting to look at systems with thousands and thousands of nodes. How do you manage that? You know, you don't want 46 login nodes and a whole bunch of different other nodes that you got to run around and manage So how are we going to tackle that kind of problem? So mark seager at laurence livermore has kind of led this project and and cisco and del and sun and intel and a Number of other companies that I apologize if I I miss your company Please let us know which one it is because I I can never remember. I apologize, but Everybody has a vested interest in a red hat. I'm sorry. I need to include red hat since I'm such a linux fan All looking at this problem of how do we tackle? I call it scaling of the admin, you know We always laugh that admins don't scale We've got to figure out a way to tackle that problem because the systems are just getting too big Okay, well moving on the concept of the scheduler came up quite often. It really sounds like the cluster revolves around a scheduler Kind of schedulers you guys see out there. Um, what's some of the more creative uses of a scheduling policy you've seen on a cluster? Well, that's a good question. Um Some of my experience has been that um Users have difficulty with scheduling policies And Um, that sounds very politically correctly phrased there dug when when you When they complain about how they would like to see the cluster work and you ask them to come up with a policy of How they want things to work and then that doesn't seem to happen then it just kind of Runs into the flat mode. I guess the way to say it um And and there are some cases where people take advantage of of cluster scheduling policies Which can be very very complicated and um sun grid engine for example And i'll i'm mentioning that because i'm most familiar with it has some fairly sophisticated scheduling methodologies and um It's the kind of thing that's very very uh organization dependent And it it gets in the the trouble i see with it it gets into organizational politics And thereby uh creates some issues so and by the way, um that the The the three main open source schedulers are torque Slash mali usually um sun grid engine and the other one's slurm And and that's from more and sliver more as well and slurm is kind of More targeted towards clusters where the other the other open source applications are more um Organizational wide in a sense And the other one like i mentioned platform Has a scheduling product as well So uh, that's that's my politically correct answer to that question Let me put you on the spot here a little bit here You were talking at the 50 000 foot level and i i gotta i gotta take you back to a comment You said earlier for for someone who's listening his podcast who has no idea about hpc clusters and whatnot That might be a little too high I wonder if you could ground this in a in a couple of uh Real world examples you've seen or maybe even some hypothetical examples But you know give us give us a little more detail on what you were talking about in terms of why policies are difficult And how things can get complicated and you know, what works best from what you've seen Well, I'll give you here's an example. Um Uh Some clusters i'm involved in come from uh various funding sources So there'll be a certain number of nodes that are owned by a certain group That um Have exclusive use of those nodes and then there are there are other nodes who And the exclusive use is dictated by the grant they're they're they are not allowed to share the hardware And then in other situations they will Bring some nodes to the table in exchange for using Other nodes in the cluster when their nodes aren't available So in in that case you may have a group your scheduling policy may have to have certain people Of this group can use only these nodes Or they can use a lot of nodes of the cluster Or you can have someone who can uh says that we have a higher priority on these nodes And we get them first and if we're not using them someone else can can have access to them So that's that's an example of one of of the situation that uh Crops up in some I would say multi-sourced clusters where schedulers can come into play another uh area where schedulers come into play is where you have the um I would say the separate Uh Separate hardware clusters. So for example five years ago, you bought 128 node cluster And last year you upgraded that with 64 nodes now. Well, now you have two different processors and they're just and uh memories prints and so forth So Which is a resource and you're you're scheduling software You can ask for certain resources and that way you make sure that you're not using half old nodes half new nodes and and running things in a really You know off centered kind of way. So there's it's what it's the scheduler can also be used to sort out hardware issues In addition, there are some clusters that have Uh fat memory nodes Where certain jobs just need gobs and gobs of memory and those those nodes A user would have to submit as part of their job I need this much. I need a node with this much memory And then they would wait in the queue until that node was available and it would be used so I'm not going to get into the politics of it But um, it does happen and you know, it's and you the biggest question You have is and he's just said admin will have is my job's in the queue. Why isn't it running? and um the uh The answer to that can be Lots of different reasons some of which are political some of which are resource constraints and some of which are user error Yeah, but I'll jump in dug and also say for you admins out there this kind of question Why isn't my job running is a chance for you to make a little bit of extra money? And just say well for 50 bucks. I can make it run a lot faster So I've used that argument occasionally. It's never gone anywhere, but uh And I Having been a user, you know for a number of years. I'll tell you one situation Where it was interesting because I worked at a large aerospace company So we have a number of projects and we each share the same machine But under each project you have various disciplines and under each discipline you have various sub projects and the relative importance of those sub projects would change day to day depending upon the situation so Every day we had a manager who sat down with the team and every day juggled the policies You know, okay, who's got higher priority today versus this one and it was a nightmare for the manager We didn't find a better solution at the time, but so it could get into a quagmire and it's awful. So I kind of default to first in first out just the old fifo Kind of thing although the Maui Scheduler itself is pretty good about backfilling. So if you've got a job that's blocking Let's say you got a job that's waiting for 18 nodes, but you got 16 free It's just waiting for those last two But there's maybe a 16 node job that could run right away You know the Maui does a good job of backfilling for that kind of thing. So It gets really complex and there's lots of research papers written about it And you know, there's no perfect solution yet. You know, people are still still looking at that But oh by the way, I wanted to mention one other thing Doug was talking about open source You know LSF is as a big commercial product. It's very very good Cluster resources has Moab which has gotten a lot of Traction lately, but actually the platform people have open sourced an older version of LSF I think it's like version four and they call it lava So if you look at some of the platforms open cluster stack their OCS products You'll find lava in there and it's total open source. So I mean if you're like LSF And you're comfortable with LSF four and you can build it. It's totally out there for you So it's another option people could consider So Jeff you you tossed out a couple names there and we haven't done a good job here in this conversation Can you distinguish between the scheduler and the resource manager? I knew you were going to ask me that question because you lectured me this one time in an email And you were absolutely right because I blew the definition Yeah, there's there's there's two bits to The whole scheduling running and there's one That actually does the scheduling And there's one that does kind of the resource management And and so what you do is you have a scheduler That actually figures out what jobs should run next based on some criteria And the resource manager's job is to actually take that information and to go make the application run on whatever Hardware is assigned to it. So I that's at a high level at in my little brain I think that's kind of the definition Jeff. Is that pretty accurate? Yeah, that's that's yeah Yeah, we just is dividing it into those two things the guy who makes the decision And then the guy who actually does the action right? So the decision maker is the the scheduler and the action guy is the resource manager He actually goes and launches your job and monitors it And then when your job is done cleans it all up and then asks the scheduler for the next thing to do And and so on and so we were we were throwing around a lot of names here So let's let's assign scheduler and resource manager To each one of those can you go through each of those things and say, you know, which is scheduler Which is a resource manager and which one's both? Oh boy, that's I'll do my best and then correct me. Let's put it that way So this is this is Jeff's final exam For for me. Yeah, don't get it wrong Oh man, I already went to grad school. I don't want to do it again. Um Okay torque Is bit is a is a resource manager, but it has a scheduler a fundamental scheduler It's called fifo first in first out And so it's just the first person to get in the queue gets to run first and so on You can you can tweak it and do some Programming if you're your own if you want to do the write a scheduler Maui is a scheduler that plugs into it Uh on s at the sun grid and that would that would supersede the fifo That that torque the real basic ones you can you can put in say Maui Or or one of the others that uh, you know allows you to do something much much more complex But use the torque as the resource manager Absolutely. So yeah, it's perfect and sun grid engine is can do both as well And I think Maui fits into sun grid engine. Doug knows that a bit better than I do Yeah, you can someone shoehorned it and you can use it if you want to And let's see lavas the same way and lsf is it's it has it's both a scheduler and uh a resource manager and and The one I don't know as well is moab and I think moab is more of the scheduler Yeah, moab's a scheduler. So moab and Maui are just pure schedulers Right. Okay. Good. I didn't flunk that one. Yeah, there you go. And and slurm is just like a torque, right? It's got a basic fifo one scheduler, but it's primarily a resource manager and you can plug in any of the pure schedulers into there. Although I I do believe they've announced they're Progressing more into the scheduler direction themselves are going to have slurm is going to contain more Sophisticated things than fifo in in future releases. Yeah, and and real quick when I always forget to and this is Kick me into head because I've used it for years pbs pro altair owns that now So we talk about torque is kind of the open source fork of pbs from years ago and it's developed and then there's pbs pro as well, but And then there's kind of grid schedulers that we're boy We're going to go down to rattle on that one. So I'll skip that but I know what Jeff has been very good and very polite but one of his Lectures for me and it was really good and I'm glad he did it because I didn't understand it And since he's leading the open npi project is a lot of times you need the resource manager tied in To the npi layer Because the resource manager may kill the job or may not and you have zombie processes running around and you're It's they're chewing up cpu cycles. It's a mess So they I think I agree that they need to be plugged in To each other and and to some degree even the npi can also just be the resource manager itself I think that's maybe one kind of extreme example so End user software stack If you had just a generic resource cluster like at an average university where there's all sorts of applicable site provide Well, there's a simple question In general for for users There's You need compilers and you need npi which stands for message passing interface libraries and that's at a minimum and If you have turnkey applications that can be pre-compiled Those should be provided as well. So The um beyond that there are things like debuggers profilers That work in a parallel environment, which by the way is not an easy thing to do And uh As we already mentioned there's administrative software Needed as well from the end user standpoint Beyond the the open source compilers the genu compilers. There are In in the hpc world in general and this is Also, I may get some heat for this Intel has a good The other one that has gained some favor in maybe two years is pass scale Portland and pass scale are more targeted towards a and d processors where intel obviously is targeted towards intel processors npi libraries If anybody knows of a good open source npi library, I'd love to hear about it Yeah, those are hard to come by aren't they? Yeah, they sure are And and while I while I have this just a moment here I I do want to mention something about the raw open source software and clustering um In in in clustering I I believe that open source software is not About saving money It's where we see the real Value of free use and openness in software where things can be changed and adapted to suit end users needs and if there were ever a An example of um where that the open source model fits. I would say would be with with clustering and everybody involved um Would would probably agree with me that we have made such progress in hpc clustering because as I call the plumbing is open And you can do things in terms of file systems operating systems npi libraries and so forth that you couldn't do Uh, if everything was closed or it would be much well, you could do it, but it would be much more difficult So and and it's not to um say there's that commercial software shouldn't be used I believe it's important. It should be used and it's also that that Open source software has has really been a big value add in the cluster world Yeah at amen reverend edline Absolutely, and it's it is not a matter of cost and I think pretty much everybody in the open source world understand Should understand that it's not a matter of cost. It's it's a matter being open and you know, there's a quick historical note one of the Early cluster guys that and dug. I can't remember his name at nasa Actually found a bug in the tcp stack And by changing a few things managed to improve the tcp performance hugely and his patch would run for years If it was closed that would never ever happen So I know it I agree. It's the software stack. You need that as a minimum You can mix and match commercial versus open source versus Uh commercially supported open source, which I think we're starting to see more of as well um And and to kind of go on to the next level to you can also add Uh provisioning tools, you know the oscars of the world Werewolf Perseus xcat. I mean we can name thousands of them ocs from platform So that's a way to image the nodes or provision them. That's kind of another layer The next layer is monitoring finding out what's going on with the nodes Although I think people sometimes focus on that a little too much And they start eating up their network just trying to you know, okay You don't need to look at uptime every five seconds. You know, it's not going to change that much and So and one thing that we also probably didn't talk about is the os itself It can be an open source such as linux, of course, it can even be closed source such as windows You know So We have to look at you have to look at what the requirements are for the application and and what the user requirements are as well Actually, I would add to that list them. I always like to have a the right blast library around for the hardware you're running on Oh, yeah Don't tell me Brock. You're going to run the top 500, right? Me heck no No, I will I will throw a plug in. Okay. I mean Doug was we're all teasing about the hpl the top 500 But I have used it in the past to help debug nodes nodes that don't seem to be performing well So I have used it as a way to figure out Which nodes in the cluster maybe have an issue and then so I can focus on those So it is useful in that and I do like to use the high speed blast library So I can make sure I get the best performance so I can start to look for those differences. So And for those not to know their blouse the blas the basic linear algebra subroutines. Those are What we're referring to here is very highly tuned Uh mathematical subroutines that can you know run really really well on whatever your particular platform is And there's a variety of different blouse Implementations out there tuned for different types of platforms And I want to go on record I want to go on record as saying I I support the top 500 benchmark As it's intended to be used so I I do I do believe it's a tremendous amount of historic measurement Of it has a lot of history in it and measurement of the market and technologies Um My version to it is that it gets overused in it in the wrong way so oh, yeah, and it There's That's all I need to say and the other thing I wanted to say before we run out of time Um, there's a little bit of a plug. Um uh, Jeff Layton has written a article on cluster monkey called how to write a technical cluster rfp where a lot of the things we talked about today are um carefully um presented by jeff and really worth reading if you're Interested in what's important in proposing a cluster or understanding what's in a cluster So to find that you can just go to cluster monkey or one word net And type in in the search bar box rfp and it'll come up in one of the the options there And while you're there that we have tutorials on cluster monkey um Actually one very good npi tutorial written by uh, jeff squires. So um It's a it's a community resource. Uh, we have lots of good good content And there's no registration or anything that required and uh, you can go there and and check it out Hey, neat. No, I think I'll include a link to that uh how to write an rfp in the uh notes for the show when it goes out Oh, that's gonna get some email You'll be getting rfp I'm a tech guy. I don't do sales Hey, by the way, I wanted to make one quick comment that uh, jeff when you're talking about blast You use the word subroutine and I think you just dated yourself as a fortress guy much as uh God, it's horrible because i'm the guy the poor schlep who's uh tasked with taking care of all the fortress and an open mpi. It's awful awful awful But that that is certainly our contribution to open source because you know keeping up the fortran apis has Absolutely nothing to do with selling sysco hardware. I'll tell you that Oh, and it's I I'm I love fortran and I I you know, I'm do they have an alcoholics anonymous for fortran users? But I and it's partly because I used it But uh, I can also think in other languages But so I I truly appreciate the fact that you guys still pay attention to fortran Oh, there's a there's a quite a bit of fortran still out there I mean that is don't get me started on an mpi rant here But you know, this is this is why mpi defined a fortran api for it because there's a huge amount of fortran code out there that they They they're not computer scientists. They're not even necessarily very good computer programmers They're physicists and biologists and at the end of the day, they just want to use their cluster They just want to solve their problem. They don't they don't know how the computer works They don't care how the network works. They just they just want their stuff to run and they want it to run in You know two hours rather than two weeks and and that's what it's that's what it's all about I mean, that's kind of everything we've been talking about here for the last hour hour and a half is You know, how do you get a resource like that and the six million things that are necessary? To enable that physicist to be able to just run their job, right? Absolutely. It's all about the users Always that's why I like it. What do you want to do with the cluster? So I want to know who's on it. What kind of applications? What do they like? That that Good good point Okay, guys. Well, I think that's a good spot to kind of wrap it up from here. Thank you Very much for taking some time out and speaking with us Again cluster monkey net get ahold of jeff at del And check out rce-cast.com felt a nomination form for other projects you guys would like to hear All right, thanks everybody. Yeah, thanks. Let me know. Thanks a lot