 The Duke is kind of its own beastie, and Adam's going to be that because I don't know what to talk about the Duke. The first thing I'm going to talk about is array jobs. I've got Nora's smiling over there at this one. Nora smiles at everything. That's all right. Yeah, I was hoping she was going to be here for this part because I was going to ask her how much time she saved on this one. The way an array job works is you submit what we find is a lot of people submit the same job just with different data. So instead of saying, you know, creating a job that says, and run this program with this bit, run a program with this bit, run a program with this bit, run a program with this bit, it's a lot easier to say run a program with whatever I tell you next. So, here in my Bay of Cad intro directory, I have things that I actually got from, that I adapted from what Nora has, Nora, that I worked on here. I have a submit array Q sub, and I'm going to show you what this does here. Okay, this is just a quick script that I put together that I'm going to say is based on some R work I was doing, I was working with Nora. I get my memory requirements, my runtime, job number, gave an error with a task ID, task ID is some variables that it can put into the script there. And then I tell it to run my program with array example dot R. Now, in array example dot R, I set up some constants here, I set up a job number. A job number is grabbing some, this is grabbing something from the environment, environment variables. SGE will create what's called a task ID number when you submit an array job. So, I say I want to submit this job with numbers 47 through 102. So, what this did is it comes through, the first time it comes through, this is task ID and that's going to be number 47. So, basically this is just the way it's grabbing. In bash it would just be dollar sign, SGE, task ID. This happens to be R, so you have to import that in a different way. And then I said grab this file name, set up dash, job number dot R. And then I run some other program. So, that way in my other program here, what I just saved on it, there it is. So, I had set up dash one dot R. And this is where I have the stuff specific for this job. So, I can have, again, if I had 47 102, I had set up dash 47, or set up dash 48, set up dash 49. A lot of bottom block, 40 set up to set up dash 102. And each one of these can have slightly different data in there. It could be referencing a data file, it could be the data itself. There's no size limitation, this just happens to be a small one because it worked out well. And this is what I worked on. Now, every time a job gets submitted, it has to go through the job verification process where it says that yes, this is about a program, you're about a user, all this kind of thing. If you create a whole bunch of jobs running for different data, it has to verify every single one of those that comes in the queue. And it can take a while. When you submit an array job, it does that one time. And it just says, oh, it's all there, we're all happy. Nora, how fast did that change your time? Just a submission. I was submitting one job for each of the 120 things I needed to run. And just a submission in over half an hour, just a submission. And with this, it's submitted like this. I can send now five, six of those bunches of 120 in no time. So it really helped tremendously. Particularly those dead, short periods of time that you can't do anything with. You've got to be staring at the screen for 15 minutes to move on to the next step. But we don't have time to concentrate on anything else. So that was great. So that was the user testimonial I was hoping to have here. We took a lot of her stuff and moved it to these array jobs just for that purpose. And it doesn't make a whole lot of difference to the scheduler. It's not going to make a big difference. It does make a lot of difference on your end and how much time it takes to do that. And it also allows you to create 1.2 million jobs. We had actually had somebody, a CIS student, this grad student this last year, who would at least once a week and usually once over two or three days, would submit 40,000 jobs. They were small jobs and they would run through really quick. But you could watch our scheduler and you'd see, bang, a whole bunch, 40,000 at once. And then they'd work themselves off and another big spike, that type of thing. We like that. We like seeing that type of thing. But just the time involved in doing that would be overwhelming on just submitting the jobs. If you're doing, you know, she talked about that kind of time process from doing 120,000. 40,000 is another step beyond that. And it's all by using that environment variable, the SGE test. Again, you can use that within what you get out of it. The way you do that, again, on the SGE web page, right at the very top, there's a section on array jobs and it will tell you the limitations, provisoes, quick propos, any references to the Latin I got there, nobody taught me anyway. But it tells you how to submit those jobs, which is dash key, N being the first number, N being the last number. And there's step sizes and things like that you can do too. But most people will just say from 47-102 and that's how it will run your program. So it's a very handy tool and if you know that it can be done and even if you get confused by the web page, just give us a call. Because we'd much prefer to see one job in the queue with lots of parameters on the end because at least that way we know what's going on as opposed to seeing 100 jobs in the queue that may or may not be the same thing. The next thing is also on this page, on your number of pores that you asked for. Come on, find it. I think you missed it. Did I? Yeah. Fine. I have it on my slide, you know what? There. Once again, it uses the end slots environment variable. So in this case I was using single parallel environments, I was all on one machine. But I said I can use either 2 or 3 or 5, 6, 7, 8 or 10 or 16 cores. It's kind of, you know, most of the time it'll be, you know, either a multiple, you know, a power 2 or it'll be a range of 5 through 8 or something like that. But this shows you the power of the command there. Instead of just putting a number, you say, you know, 16, I need 16 cores. If you won't, this will speed your job along. If it'll schedule, maybe it'll schedule the system a lot sooner. Say, well yeah, I can run on 8 or I can run on 4. That type of thing. This works really well with OpenMP. If you use that command, it'll give the end slots environment variable. You can use the end slots with an OpenMP for your max number of threads. And that way OpenMP will use whatever it has available. Not saying it's limited to OpenMP, but that tends to be the best use case for that. And again, that's why it's useful because it'll schedule your job perhaps more efficiently. I mean, if it happens to have, you know, 16 cores open, it'll give you 16. If it only has 10, it'll give you 10. And so it's a way of getting your job started faster. Any other questions about this before we go to Hadoop? Yes? Actually, it will start with, like, yes, up to 16. So availability is like 10 cores and suddenly 6 more available. Will it increase to 16? No, not once the job starts running. So if it starts, what do we do is keep on running it too. Right. However, when you couple that with array jobs, if you have several jobs running, it's only on the one that job that's running. So if you have an array job that you've submitted 100 different things and it schedules the first one with just a couple cores, then it can schedule the next one with more, it will schedule those with different number of cores with the different tasks within that one array job. Does that make sense? You did the same thing with when you request for a range? No. For time? That would make sense. Any more questions before we move to Hadoop? Let's say if I request 10 CP using my search group, actually in my program, actually, I say from parallel to 16, what's going to happen? It will try to use 16 and you'll probably get some angry sysadmins that kill off your job if it gets too aggressive. Because sometimes I forget and modify it around 15. One of the things that will happen is if, for instance, if you're not bothering other users, so say for instance you asked for four cores and your job accidentally used eight or 12 or 16, but you're the only user on that node, we might just keep an eye on it and say, yeah, he's only hurting himself. Let it run. But particularly if we notice your job disarming others, and this happens a lot in memory, for instance, where somebody will ask for eight gigabytes and they start using, you know, 200 or something and it starts pushing out other users, then we will terminate it with prejudice and send you a nasty email. Anything else? Oh, I was going to ask just real quick. Kuda, we have stuff on the website for doing that. I'm not going to go over here primarily because it's a big pain in the rear and I have 40 old tears with it also. The GPU computing, it's very much like MPI, the cost of getting stuff into and out of the GPU is really expensive and wise. Once it's in there, it's very efficient. So you can compile those with the MVCC command and you sub with the, say you need Kuda as one of your resources. Adam, okay, turn it over to you. Here, you can even have the mic just going to you. This would be interesting. I finally got Hadoop working on Friday, so I've done minimal testing. Our Hadoop setup is brand new. We've set aside 10 nodes and about 10 terabytes of disk space and 160 cores, 640 gigs a round, specifically for Hadoop jobs. Previously, we were forcing users to submit a VeoCat job and it would set up an entire Hadoop environment within that job just for them. This is problematic mostly because we, using multiple nodes in that kind of setup is really painful. Getting the right number of tasks is really painful. Even making sure you're not stepping on other users because Hadoop likes to use as many cores as it can. It all becomes very painful in terms of scheduling and being useful in general. For those who do not know, Hadoop is a map-reduced framework. It's a framework that includes that for main paradigm. You write jobs that split or sort the imported data into queues, into smaller sets to be processed. We then process those queues in parallel and produce kind of a summary information of the types of information you're trying to pull out. Some people will be using this for more close. You've got information about all of the networking traffic across campus. You can put all of them into Hadoop, trying to pull out information about which direction they're going. Are they coming in? Are they coming out? Are we pretending to go off to the cloud? Are we doing more HTTP traffic than SSL traffic? Within a map-reduced job, you would have a jar file. This is going to be the jar, or it's going to be the job itself. You've got a job class. It's all written in Java. These are the technical terms for some of those. You'd have a job class which would define what you need, the configuration parameters, how many threads you want to run, how much memory you want each task to have, that kind of thing. You'd have a mapper class which would sort the data into your queues. You'd also have a reduce class which would consolidate and summarize all of your data. In this case, you've also got your own separate file system. Your home directory is not in Hadoop at the moment. To put data into it, you're going to do a file system, put it and name the file and put it in. You also need to pull it back out with a get. Because we're limited to 10 terabytes, please clean up your Hadoop files. If it's no longer needed, get rid of it. Let's start with a Hadoop example. Log in over here. I'm just getting my session set up. Okay, because this is a new setup, we actually have a separate head node for Hadoop. We've got Athena, we've got Minerva. To get into the Hadoop stuff, we're going to SSH into Thea, which I haven't had time to put all this up on the Wiki yet. It will be there, but it's not there yet. Let's see. Log in to Thea. You can see you've got your normal home directory mounted there. But it's where you're going to log in to do Hadoop things. Now, if we take a look here, we do a makedir. You can see that or not. This is better. So now we do any Hadoop, SS, and a dash makedir. That is... And then we make this directory data.in. I can't type on this keyboard. The same issue with the title was... Okay, we've now created a folder by data.in in my Hadoop file. You can take a look and be sure about that by doing a Hadoop fs-ls and then slash user noses, because that's my name. So if we take a look here, we have four or five files in there. We've got stage.in, we've got a data.in, we've got our dna.in, we've got an output folder. So now we do a Hadoop. Now it's nice that we want to actually put a file in there. So we do Hadoop fs-put. My file. If I can actually type... dna-medium. Let me put that in. Data.in. For instance, I'm just putting in a big dna file in there. So that gets put in... Because this file is the... Hadoop currently runs with a replica of three. So it has three copies of any data you put into it to make sure that A, you can always get to it. And because the way it runs, it tries to start off your... your map and reduce tasks as close to your data as possible. And so that gives it more places that it can start those tasks. We've now done my put. I want to look in dna.in to make sure file.put there. And dna-medium is in data.in. So now... Hadoop Jar. So we're getting ready to run our actual job final. User lib Hadoop. Reduce. And then Hadoop Examples. Really noisy. Okay. Now in this case, I'm going to want to run a word count... on dna- data.in. Because it asks for a file. And I want to give my... I want to give it an L. All right. Word count is not capitalized in this case. Sometimes it is, sometimes it isn't. So we're going to run a job. Here, it reads in that jar file. That's just an example jar file. It goes through, opens up files. It counts all of the unique words in that. And so it takes all that data. You see here. It's doing a map. So it's sorting all that data into queues. In this case, it's going to be about 112 of them, just because that's about how large this job is. And now it's starting to reduce and summarize that down. And most of these jobs, all the people I've seen using Hadoop have been running their own code. So you're going to want to get in and learn the framework if you're wanting to use Hadoop. I'd use this if you have IO intensive jobs. Mostly because it tries to keep the computation as close to your data as possible. And so you end up with less network traffic. You're letting less IO overhead. That kind of thing. That's where this is at. And it just finished a couple minutes to read through Deshman. The medium DNA file was almost a gigabyte. And so it processed that. Gives a word count. When you're done with that, you would want to do a Hadoop FS-get. And then we've got our out folder. And we give this a name, word count dot out, for instance. And that should pull the entire folder out and give us good information as to who's in there. As soon as it finishes. But remember the Hadoop file system and your normal home directory file system, those are unique. And so your data, you actually do have to move data in and move data out between these two. So now we pull the word count dot out. We have this folder. We have an underscore success file. That's just created to make sure when the job is done. Part-R0. That is the output file this creates. And it tells you how many counts there are in this format that they decided to output it. We have any questions? Anything else you want to know about Hadoop? Or how do you do it? You said I've not got my documentation up to date yet just because I finally got working on Friday. But it will be up shortly. What jobs do you envision being able to make optimal use of Hadoop? That's a good question. Generally, the types of jobs that have lots of data but a very high, a very low signal noise ratio. So we do a lot of noise in your data set. We're going to pull it out just the stuff that matters to you. Like I was talking about network flows earlier where we can get data from all the network traffic across campus and where it's going and that kind of stuff. But we can't analyze that. We can't make use of it in real time. Mostly it's just too much data for anybody to use. So you have to go through and you have to parse out what's useful. And that's what this is supposed to be able to do. Honestly, never used it for anything more than these tests. But I do know we've got several people here. We've got at least one user in the department that teaches a class on this. And we've got, in fact, the whole reason I'm setting this up is because, as I'm, I was, the way we were saying about Hadoop environment and at a job was causing another user's jobs to crash because he was, so he wasn't able to do his research with him. Typically Hadoop is used for a large scale data mining job because you've got a lot of data. Each individual will admit that the data doesn't really need a lot of processing. And so Hadoop takes it, divides it over the data, puts it over a parallel file system so you get lots of bandwidth because you're using lots of hard drives and will previously to do some processing on it. So Google, for instance, did one of the original implementations of MapReduce because they're looking, they're mining links. Where did these links go? And they've got petabytes and petabytes of data. And so Hadoop does great things like it, for instance, will say, hey, if I need to run 5,000 copies of this under a normal scheduler, the odds of 5,000 jobs simultaneously working together on huge amounts of data for a long period of time are about zero. But Hadoop, for instance, provides an environment where if a job dies, it'll automatically restart it or it'll say, hey, you're done now. Go off and work something. It does a lot of the hard work of managing really big data-based, not databases, but data-intensive type calculations on a large scale. Our system really isn't big enough to take advantage of that, but a lot of our users, particularly in bioinformatics, are looking at it and saying, either a lot of their colleagues they're working with are using Hadoop so we have an environment where they can be compatible. Or they're saying, let's build model systems with this relatively small Hadoop system so we can go to partners with much larger systems that are using Hadoop to do these type of data mining, big data, and data-intensive type jobs. Beocat is set up more for compute-intensive type tasks. And Hadoop is more of a data-intensive type task. And so, for instance, if we get the NSF MRI that I've proposed we're asking for four petabytes of disk and such, that's much more aimed at more Hadoop-style environments and then be at a scale where it'd probably be worth it. Right now it's mostly just a headache for Hadoop. Any other questions? Any questions about anything up until this point? To the other sections. I would say now that you've had a chance to attach some faces to EIDs, so that's Hadoop. That's Kyle in the back there. Have we swapped? And I'm Dan. Please, we typically aren't terribly ambitious in our telepathic, so if you're having difficulties, let us know. We'll see if we can update our documentation or get together with you, help train other users, and solve issues. And sometimes we may have to say, well, if we can't, but we'll sure try not to do that sort of thing. The other thing is probably there will be a survey that will come out the next day or two of my guests just saying, hey, what do you think, where can we do better? And as I've stated, feel free to be verbally honest, both in the positive and negative senses, and let us know because this is the first time we've done this and we're gonna be doing it again. So if there's something we can do to make it better, that'd be good to know. So thanks, appreciate your time. We don't take it lightly. We're always happy to update our documentation. I know it can be a wall of text sometimes, but it's coming out from where we are at. It's hard to see it from a different perspective. So if you have any better ideas that will help you learn or that kind of stuff, we would love to try and incorporate that. And for professors and PIs and those writing papers, if you use VeoCat significantly for a paper, if you would cite us and cite our grants that makes my overlords at the NSF a whole lot happier. So if you would be so kind and you grade those directions on the VeoCat website for how to do that. And if you feel like you'd like more priority for your jobs, talk to your funding agencies and talk to your professors and say, hey, how about something a little cash VeoCat's way and that helps your priority on the system. In some sense, we are very corruptible. Thank you very much.