 All right, well a lot of this that I'm going to be talking about today. I went back too far comes from This web page is the SGE basics we use a scheduler called Sungrid engine or SGE It's the open source version of the stuff that came out a long time ago If you guys have been used schedulers before that kind of thing Um The command to submit jobs is Q sub you can see up here QSUB What it does is it takes the command you give it and runs it through our scheduler the scheduler is A piece of software that best figures out how to Get all the people as happy as we possibly can We schedule on a per-core basis so you can ask for a certain number of cores a certain amount of RAM It'll say okay this many cores and this many this much RAM has been spoken for on this machine Can I fit it there? If not, can I fit it there this kind of thing? Once again, and we highlight this just about everywhere we go Beocat will not magically make your program use multiple cores your program has to be written to support multi-threaded operations so When you know how much memory you need you know how many cores you need and We have some advanced options there, too But we're not going to go through many of those here today unless we unless I get through there's a whole lot faster than I think I'm going to so Let's just go ahead and Create a File I'm going to do this real time here. It's I have two sessions open. I want that there Have already said up here, don't I right? I'm going to create a create a file here Called my host and all this is going to do is this is going to find out what? Machine we're running on and come right back So this this will take all of a couple of seconds to run on a machine typing on a keyboard That is not my own and it's throwing me off hostname. This is a command If I run hostname on a on any machine, it'll just go to tell me what machine it's running on okay I'm going to write that so now if I look at this it says my host Okay, this is what it looks like just running hostname so far on hostname right now It says I'm on Celine, which is what the name of the head node We have two head nodes Eos and Celine whichever one you log into you might get one or the other so That'll be when your first log in there now talk about permissions being tricky this has to be executable So I'm going to make this that file. I just created executable So I'm again I didn't go through this part earlier because it would make your eyes glaze over and it's no fun to look at but This is the part. I'm using to make it executable. So now I have my host I say now it's in green So I know it's executable right So How many cores you think I'm going to need on this? One how much remember you think I'm going to need on this a little bitty-bitty amount not even much The default on Bay of Cat is for one hour one core and one gig of RAM per core That's going to fit well within our Parameters here. So I'm going to submit this to the scheduler with Q sub And in the file we're running group file we're one wanting to submit and then I'm actually going to Because it's probably going to run really fast See if it shows up that fast and darn it went through too much Okay So since I use the default it's telling me you've not requested memory for this job The default of one gig has been selected for you. That's it's going to tell you this whenever you Don't whenever you're using the defaults same thing for runtime. I've never I haven't requested a runtime the fault of one hour We don't get it as much of the use to we used to Get the question a whole lot of hey my job ran for an hour, and then it stopped. Why is that? Because you ran it asked for an hour. You didn't you didn't tell it a specific time limit So it ran for It ran until it ran for an hour and the scheduler said hey you already used your hour you asked for it done Was mark killable because it has a runtime left less than 12 hours. We have several machines here that are owned by a particular group a the Entomology group by the machine for us Part of the deal is they bought it for us We manage it for them that works out kind of good for both of us a part of that is is that anytime they want to use It they have priority access on it that will kick off any jobs that aren't theirs so they can run theirs but in return for that whenever they're not using it anybody else can use it That's why we have jobs mark is killable Anything if you mark a job is killable means it can run on those nodes Which means they'll usually run a whole lot faster It'll get started a whole lot faster because it can run anywhere But you also understand that if they need it during the mean meantime They're gonna kill your job and so their stuff can run That's we figure 12 hours is about the port cut over point there where if Your jobs under 12 hours you're not gonna care if it got killed and gets restarted because it's gonna run again Probably pretty quick anyway If it's over 12 hours, you're gonna be really frustrated that your job ran for so long and then you have no results out of it You can force that with you know If you say I'm not need something to run for four days, but I don't care if it Gets killed. I just wanted to run quick. That's fine, too A lot of jobs do what's called self checkpointing which means that it stays at state every so often along the way And so if it gets killed and restarted you don't care because it'll just pick up where it left off just slick, but it doesn't So in that case you might want to run job killable you don't want to waste a whole lot of time though So this one does that it's telling you every it is using killable So then it finally tells me that requested resources of this job H underscore RT is that's the hard-run time of one hour. Memory is one gig When you start submitting multiple Core jobs, you'll want to know that this is per core. We've had a lot of people that have Requested an insane amount of memory per core because they thought it was per job Not for the old job. It's per core and It's in the killable and then it says my job number two three zero five five one seven has been submitted So now I have a job that has gone into the queue as number two three zero five five one seven And if I look down here The first thing I did is after after I did that is I told it as soon as you run as soon as you submit this I want you to run a program called q-stat q-stat the monitoring program tells all the jobs that are sitting in here right now This is this is what this was as I have a couple minutes the live look at of the queue And you can see my job here at the bottom two three zero five five one seven Generally speaking the ones near the sop are going to run first the ones at the bottom are going to run last However, since I had such a small job, I'm guessing mine is going to get kicked up here pretty quick. So I'm going to do q-stat I'm going to look for my name. This is another Linux ism Grip means show me any any files with that line in it. So I'm saying grip Kyle Hudson is saying I'm going to take all this output and only show me the line in it if it has the word Kyle Hudson in it It doesn't that means my job is already finished, which is what I thought it would probably do went really fast So now back in my home directory not not where I was because I didn't tell it this back in my home directory I'm going to look through my files. I'm sorting it by time the LRT is sorting it by time And you see that I have two files out here One's called my host.sh, which is the name of the file I submitted dot e 2305517 E stands for the error file. That's any errors that would have come out of this 2305517 is my job number That it told me that I had at the beginning and I have another one here called my host dot sh dot. Oh, that's the output 2305517 so Always will create these two files the error file and the output file It tells me right here that the error file has a size zero So I'm not going to show you what's in there because we already know that it's empty If I do Cat is is basically show me what's inside the file Again, I'm using a lot of things here that we probably haven't gone over I'm trying to I'm trying to show you stuff without bugging you down the details and understanding that there's some details You're going to have to know somewhere along the way Cat is just so you show me the output of the file. So I said show me what's in this file the and it says elf 38 Well, do we had for class or yeah, we have mages. We have elves. We have heroes running right now So that ran on elf 38. That was the name of the node that ran the job So what I did is it took the scheduler said okay, I can fit your job right now on elf 38 And elf 38 said okay, and it ran that file the my host dot sh that I put in there That makes sense It's a very simple example, but it's actually going through pretty much the same things. I have there Now I showed you the Q stat shows what's on there. There's another one that actually Q stat is very Tricky to get more than just basic information out of so Dave over here wrote a program called K stat and K stat kind of takes everything that Q stat does and it makes it look pretty and easy to use So what this does is this is another way you can monitor your programs. You can go through here and look for your name or whatever Any machine that is in red means it's not running and we have a few of those right now We just made some changes and so we have several of them that are needing to be restarted and things like that So it starts to stop with the elves elfo one it has 16 cores you are used As 64 gigs on it. It's not using any and Q is disabled Elf 2 on the other hand has again. This is one of the things that it's Reporting wrong because of the change we just made but it says it's running six. It normally will put up here It should say 16 and 16 cores right now This is the name of the job if it was my job that I just ran it would have said my host that sh Get us the name of what I wrote on there and 16 cores it's telling you how much resource it how many resources are being used And it goes through that and does that for every machine in here It's a real good way of seeing How things are going Anything in red is something you should pay attention to Wonder why that one's 59.2 out of 64. Normally you'll see this that if you've requested more memory then Then are you using more memory than you requested? You'll see it read over here This the max VMA men a means the program program is not even running so that means there's something wrong with that one And this is one thing Dave does is he looks at this on a regular basis says hey something's wrong with what you did They you know that that kind of thing. He's really good at staying on top of keeping things running as efficiently as possible by by pointing these things out because right now Those eight cores are reserved for this job that isn't running So that's why he's said that because the sooner we can get that fixed the sooner those eight cores can be used for something else more useful So go through all the machines The ones in yellow means they're doing something strange These are these are a couple machines that we reserve for Dave so that he can do some testing So that's why the yellow anything else here looks particularly interesting Yeah, we're getting there Up here Over here the ones that are red those are the ones that we talked about there that group has bought those machines And so they have priority access on those on those nodes Again, you see the red they've asked for 32 gigs and they're old and they're using 34 That's a quick way we can take a look and see who's using more than they should be Anything else here You're actually A certain application When it gets done down to the bottom of all the machines then it shows the stuff that's sitting waiting in the queue things that are waiting for the resources So right now at the top of the queue is a one core job This right here means this is a this is an array job Which is something not again not going to go over today But it's a way you can if you're doing a lot of iterations of things she apparently had 30,000 iterations of a Probably a statistical package most likely that's the way most of those are doing you so I need this set of inputs I'm running the same program over again with this different set of inputs. You can do that all at once So it's only one job sitting in there, but does it over again? Tell about Q. They're in What or what group they're in So you can kind of get a feel for what's ahead of you in line Now a lot of these like I if we had like my one core job that I submitted it got to jump ahead of all these because all these were 16 12s or 4s, and it probably couldn't fit them on there anywhere But it can fit my one core job in there pretty easily You didn't need much much in the way of resources to do that tells how much memory How much time is available? Or how much time they requested for that? So my request is a whole bunch of jobs there And again tells you if it's killable or not in any special requests that need to go on there So that's essentially how you Submit your first job Questions on that. Is anybody here not submitted a job yet in Bay of Cat? Okay Good, I'm glad to have you guys here then. That's means I didn't waste my time all doing all this part Another page. I'm going to have you look at here. This is linked a few places from our Support site. This is ganglia ganglia is the name of a program that Keeps track of a lot of stats for Historical data that kind of thing so you can see on here How many what are how much network traffic we've got how much memory we've got right now you can see We have Total of 28 terabytes of RAM on our on our nodes How much CPUs being used? Temperature you guys probably don't care much about we care a lot about that But the yellow line here is Is Is basically how many see how many cores are in the queue right in any given time? So right now it's sitting about 3400 if you added up all those numbers from here went too far if you added One plus 160 plus 32 plus 32 plus 16 plus 16 ball all the way down You would come up with this number right here Now 2.8k so about 2800 job 2800 cores are being requested But you can also see how the trend lines are so you can see Last hour it kind of moves down a little bit last two hours kind of moving down a little bit last four hours last day We started off the day at about in the last 24 hours at about 30,000 in the queue and now about about 2,800 And last week you can see what's happened here We've had people go along they submitted a whole bunch of jobs and then it works to the queue and get some out of there So I submitted a bunch of jobs work to the queue Somebody submitted a few more and now it's working its way down. So it should be clear again here fairly quick Just looking at trend lines those trend lines can help about as much as anything is figuring out where Where you stand in line because the question that people always want to know is how long is it going to take? And this will give you a good clue especially Especially if you're not a part of a working group if you have a if you're part of a working group that's contributed to Bay of Cat And you have those reserve nodes for your group Generally your time is going to be really quick Oh, you only need to wait for other people in your group if there's anybody ahead of you in lining for your own group So do you want to go through more advanced stuff? What are you going to do? Does that sound like a plan or is everybody still asleep and just ready to go home? I don't know Dave. How much will I overlap with your what you're doing? What it looks like from a hardware point of view to have multiple cores Are you program for multiple cores? And then if you want to program for multiple compute nodes what a code looks like there I'll kind of walk you through some specifics. I've got some sample codes that you can look at and Compile if you have a Bay of Cat account so you can get used to compiling single small codes I'm gonna go ahead and go to the advanced page so you can kind of see some of the things that We have even if we don't go through you know examples of these I don't think I will because it'd be too too much too much work for not a much not enough Reward kind of thing, but we'll kind of talk about some of the things here that that we can do and again This is all primarily off of the wiki resource requests We have a few people using this one the ABX that is some of the newer chipset features. We have a few Commercial programs in particular that need to have what's called ABX extensions So our oldest machines won't run those but the CPUs they're compiled for a newer CPU than that CUDA we don't have any machines right now that do CUDA CUDA is GPU programming If you've never done CUDA programming don't it's a pain in the rear. I didn't say that out loud It's it is a pain in the rear, but it's if your if your Programs are met Will work well for it you can really get some fast acceleration out of it But if you're not going to work, but it's like I said it is a pain to program for so if you're going to do that Plan on spending some time to get there But we do have we do have some machines that we've taken out of Bay of Cat that we're going to put back in that Have the GPUs in there. That's the plan right still okay We've gone back and forth on that a few times HRT that's the one you'll that's the hard runtime that you can be put in either seconds or hours minutes seconds kind of format Infiniband, Infiniband is one of those things that sounds really cool And we've had a lot of people request Infiniband when they don't need it Infiniband is a technology that lets one node talk to another node so If you if you're running some big commercial codes they will Tell you how to how to do this or if you're doing Dave's stuff over MPI Though they will That you generally talks over Infiniband. It's a very very fast connection That's that was the high-speed networking piece that I was showing you on then on the node that I pulled out there When we went through our tour So don't request that unless you need it if you don't know if you need it or not You probably don't But if you think you might ask Dave or send it to the email list and Dave will answer you because he's Familiar with those probably about as much as anybody. This is like a dump from the The documentation pages of everything that you can ask for most of these you're Not going to be using. Ah, there we are Infiniband Things you can ask for parallel jobs Parallel jobs are when you have More than one machine talking once they use usually talking over what's called MPI Message passing interface. That's how the messages get talked from one place another Dave again is going to talk a lot about that on Wednesday if you're going to be programming that but a lot of Pre-compiled software that you already have will take advantage of MPI. Let's let's one node talk to another so you can add Work have more than one machine working on a problem at once If you're going to be doing that be sure to take a look at this page. We have all sorts of ways you can ask for For jobs MPI fill basically you're gonna try to fill up one machine as much as you can before you go the next one MPI spreads gonna try to spread as many across as you can One at a time two at a time four at a time 80 at a time All the way down the line The very important thing if you're going to request more than one core Is there I've said this before the memory request is per core So if you want 80 gigs of RAM total on 20 cores you ask for four gigs per core. You don't ask for Don't ask for 80 80 gigs of RAM because it'll do it 80 times 20 and it won't fit on any of our machines Email we have a couple of things in your On here that are we have a lot of people take advantage of and I usually do this for my jobs too And that is I have you you can have the schedule notify you by email when your job Starts stops or boards Which is really handy to know is that hey my job started, you know I can expect it to be done in so many very time or it broke because You know and give you a reason that kind of thing Very handy to have that in there I told you on the other that the reason why my output file and my error file went to my home directory was because I didn't tell it That I was wanting it in the current directory. You can use the dash CWD in here And I'm going to go matter of fact. I'm going to go through one quick Job script just because these are kind of good to know I Have in my baokat intro folder and again. This is world readable. So you can If you can want to come look at it and steal it feel free Okay, if it's on one page We'll less it. Okay Guess I tried to document pretty well so you can see kind of what's going on here. So Everything you put you get there's two ways of submitting a job You can actually a combination you submit it on the command line. I you saw what I did mine I said Q sub my host s H if I wanted to give it a memory requirement I should I would say Dash L memory equals 40 gigs for instance. I can also put those inside of a file When you create a script file Anything that starts with the pound sign is ignored So that's why you have all these up here. These are these are notes for whoever's reading it If something starts with a pound sign and then a dollar sign Sge our scheduler will take and take that as a command For itself. So if I was to come out out one of these So it said dollar sign or say Pound sign dollar sign dash L memory equals one G Then it would take that just as if I type that on the command line Same thing for the runtime for infinite band for CUDA all these kinds of things. So this you feel free to take this and It like I said, it's out on my home directory I Have a baocat intro folder that I have a few examples stuff out here Feel free to take this and modify it for your own needs own needs people generally find it a lot easier to put your Memory quests and time requests into a file like this because that way you don't have to remember it from time to time after you submitted it I know it's hard to believe but things don't always run right the first time You know you get error in your data here in the way you ran it things like that All you have to do is change change the part and resubmit it instead of having to figure out. Okay now How much memory did I ask for last time and was that enough and that kind of thing you can edit it all from from from inside here You can name it Instead of saying my host that sh I could name it Yeah, what host am I running on and that'll help that's how I'll show up both in the job queue and as your output file And some special things in here have how to do the email and for MPI which let's say Dave's going to go over So I'm doing this more to show you where to find things than to actually demonstrate them. Does that help you? Okay Last chance for questions This is in My home home directory under baocat intro so you can copy. I'm going to show you here I'll put this on the screen if you run that command it'll copy everything from my baocat intro to you to the whatever director you're in also if you're in Like mobile X term you see up here at the top. It says homes Kyle Hudson You can go there too and then baocat intro and they'll there you can see all my stuff that I have out there So you can just copy and move it over your own folder or whatever from there. Yes Windows Yes, absolutely Yeah, I can take this whole folder as a matter of fact. Let's go up one Yeah, yes exactly I'm gonna take this right here that baocat intro folder I'm gonna take and just drag it over here. There it goes Yes, yes other questions Well, thank you much so much for coming guys. I know that Sticking around past five o'clock isn't always in a classroom isn't always the most fun thing so Appreciate your time and If you're wanting to go on to the more advanced stuff, we'll see you with Dave here on Wednesday