 Welcome to another edition of RCE. I'm your host Brock Palin I have again Jeff Squires from Cisco Systems the open MPI project and the author of the wonderful MPI performance blog Well, there you go. It is a fabulous blog and I truly recommend everybody read the pearls of wisdom the drip out of it Actually is a great place to kind of get a view into What's happening in the upcoming MPI standards and stuff you put comments on deep dark secret world of MPI And we have a link to that blog off of the RCE website at RCE dash cast calm And from there there are RSS feeds and the regular iTunes subscription link if you are into that sort of thing And there's also a nomination form where you can recommend people we should talk to or topics we should talk about And actually that's something about today Normally we talk about a software project today We're talking about Something all of us had to go through at one time who run a cluster We're talking to brand new high-performance computing sys admins Yeah, change it up a little bit today So it changed the focus off learning about new and interesting things in HPC But rather talk to people who are are new to HPC And and see how we as a community are doing about you know Welcoming new people and getting them up to speed on education and hardware and software and stuff like that because Brock You and I have been doing this for forever It's so easy to get jaded and to forget that you know, it's it's actually our responsibility to bring new people in so yeah Yeah, so we were supposed to have two admins that were nice enough to agree to join us But I'm guessing one is unavailable Probably because of a problem with his high-performance computing cluster as we've all been there We've even canceled one of these recordings before because I had a system not quite literally on fire, but it was very very close So our guests that we have on right now, and if our other guest becomes available We'll call him in later is André Gaultier from Yale if I have that right I So why you tell us a little bit about yourself and specifically Why did Yale decide to get into the high-performance computing realm? Well first off I've been in IT for over a decade I was a sys admin before that before HPC and I worked for a research lab that had Very Had Big time huge compute requirements that They were there they were constantly running out of CPUs and storage and they had multiple discreet servers and so I saw an opportunity to fix their their problem and and I Thought of HPC and that's sort of how I got started, but they're they had They had dramatic requirements for a CPU and storage and they were IO bound and They had many discreet servers. So There was constant manual load balancing going on so users shuffling from one server to another and it was an ideal So as you can see that that would probably be a good fit for HPC because they're a lot of their processes were We're not necessarily parallel, but sequential So they would do parameter sweeps hundreds at a time So it was perfect for HPC So what what department what kind of field is this in is this somebody who you know HPC just doesn't normally occur to like? When I first started in this lab, this is a computer science lab in at UMass that specialized in search engine So at the time Google hasn't Google wasn't on the scene and a lot of people weren't talking about HPC So it wasn't something it wasn't something that someone would think of instantly, you know as a solution a possible solution to their their dilemmas so So the requirements for this cluster then was the ability to have large IO and then to run Large numbers of serial jobs or or perhaps even embarrassingly parallel jobs. Is that right? That's correct and so I Knew that there had to be a better way of doing this Rather than manually shop shuffling users back and forth and moving data back and forth So I came up with an idea of stringing a few computers together and I started looking at Schedulers and things like that and then I sort of you know came across a lot of HPC stuff and Figured that's the way to go The being be an IO bound that that's a little different most people always think about What's your top lint pack performance and how many CPUs and how many cores do you have and what's their clock speed? Yeah, so I wasn't at that level of HPC yet, right? So it was more about how do I solve their problems so that they get Better throughput in their research rather than always waiting on IO or Having to move stuff around so it's more of a practical solution, right rather than I want just high performance So I'd also solve their problems of having down times on servers that people relied on so When I first started we had a bunch of big Big Solaris boxes, you know and people would log into that and do their research and run their processes and they would do that You know on multiple machines and sometimes they would hog more than one Sometimes one would be completely loaded and they want to move on to another one and so they would have to copy their data So I thought first we have to centralize these resources And NFS was way too slow for what they were doing So it seemed at the time and so So this sort of just came into fruition just over time as I thought about it's like okay I want to centralize the storage. I want to I want to create easy access for the researchers so they don't have to jump from machine to machine and I really was naive about HPC and then it sort of you know fell into it And I started putting together what our requirements were going to be for this thing Requirements did you end up with what what was the final set of things that you started shopping around with? Well, it had to be NFS for worst-case scenario At best case scenario it had to beat our local drives, which at the time were not that great So I knew I could do it So NFS, you know, I think back then I was getting like not really horrible performance I you know, we were just getting into gig E. So it was like 20 megs per second at first I think and the disc was not that great either local drives. We had Raid five scuzzy attached. I Don't remember what the RPMs were but we weren't getting more than 40 to 50 megs per second. So Our so the new infrastructure that would be in place could easily beat that, you know So that was my worst-case scenario And so I knew that NFS on a dedicated network could beat that performance if I couldn't get a cluster file system too So like I said before you're talking about your main focus here was IO bound because you understood your specific problem One thing we'd like to talk about is is your experience with vendors Getting quotes even knowing what to get quotes for or bidding for things. Here's what they always ask me Well, what's your application and we'll give you this set of tunings. I have 500 unique users on the cluster I run none of them run the same thing so Right, yeah, I know that's that's that's sort after So that was in the beginning right and so as we matured a little bit more into that space We knew exactly what our parameters were for the given processes So we you know for the next gen HPC that for our lab I had, you know, five megs per second per process And if that if we were to have eight cores that would be like 40 megs per second and If you had, you know X amount of nodes and you multiply that by the nodes and that's what our file system would have to deliver But after that, you know after it was a success where I was working other people wanted to mimic what we were doing So we started building more generic general resources and that's where you know I had to generalize the performance more Is the group you work with now a generic research computing resource for Yale? Yes, yeah, that's right. So it's a It's the HPC group for all of Yale So it's for all the research scientists at Yale So what kind of resources do you do you make available? Do you have a couple of large-scale general purpose clusters then? We do we have we have Two right now that are online and we have a third coming online soon that's in beta and Then we have multiple clusters that we support for specific groups So why did you go the route of having separate clusters and not like confine them inside a reservation concept? So you have one management domain or one point of login Yeah, so I find this is a common problem in academics when I was at UMass. We had this situation where I Wanted to create this common HPC group and I wanted or resource right? I wanted a common cluster. It was really difficult to convince other professors or groups that had specific funding from NSF to Co-mingle with other research There was political considerations. There was you know There was considerations will my resources be used available when I need them and then there was the Legal considerations right because it's grants from NFS or NSF. Oh, you know Can you actually share that with anyone else if it's dedicated for a specific type of research? That's funny because we're actually At the University of Michigan Ann Arbor where I work. We're currently fighting with this right now in the College of Engineering. We've done the monolithic cluster users get cluster nodes That they buy with their funds either startup funds or grant funds and they're bolted on but they are theirs Yeah, we're going to an allocation model and making sure that it's oh was it called section 17? section 14 something like that basically NSF funds can go towards it That's a lot of work that's still in progress and it's amazing how much work has to be done just to get past the grant process Before any you even getting a hardware bid on right and so are you cut out there for a little bit? So I didn't catch the first part of that statement. Oh Well, we're doing the condo model like we have one big cluster where everybody who buys knows they get bolted on But it's dedicated to them. We we put a Moab standing reservation on it and only they can use it But to us it looks like one cluster. It runs one software load. We have one loading system We have one resource manager one set of log. Okay. I like that. Yeah, so we're looking at a maybe an allocation based model in the future But anyway, what kind of hardware do you do you have in your in your two? Let's call it two and a half clusters, right? So you have two clusters and one more that's coming out of beta there What what kind of nodes how many nodes what kind of interconnects? What did you end up getting for your general-purpose HPC? So the two older clusters we have DDR I believe I be for for the network and Actually, we it split up. We have one cluster has regular gigi and the other one has ddr and We generally go with chassis with blade servers We find that that you know the power consumption is better the cooling is a little better The cooling requirements and A little bit easier to manage With the newer stuff, I'm more familiar with the newer stuff because I just started at Yale this year And when I came on board That was what we were gonna start working on the equipment. It's just starting to ride so We have two clusters Or I'm sorry. We have one that's general that has 128 nodes dual quad core Nehalems each with 32 gigs of RAM in each node and Their disk list and the Interconnect is quad Qdr I be and we're using the Q logic switch For the master switch and each chassis has a met a milanox switch And that milanox which is actually pretty interesting. It's it's a module that sits in the chassis that has 16 ports that face inward to the nodes or The back plane and 16 ports that face outwards to To the network It's kind of funny because the first system you talked about how you got started down this row was a IO bound serial farm Right, so right and that's I mean this but those after this was at UMass so You know, that's how I gained my background in HPC just over time and You know at some point I knew that I wanted to be in HPC. So I moved on to Yale, but When when I was at UMass after these resources Were put in place after these clusters were put in place the requirements changed, right? So it was no longer that way I mean it was that it was that way in the beginning where people would submit these You know 10,000 jobs that were parameter sweeps But as they evolved to use the resources better a lot of processes started to become more parallel and The world came to HPC By the time that I was getting ready to move on we were adopting a Hadoop So were there new things that you had to learn to come into the HPC like for example in finnaband and open fabrics and MPI particularly coming from a serial job farm and and parameter sweeps and things like that How high was the learning bar that you had to get to to effectively be able to deploy and understand and manage this kind of stuff? Yeah, actually, I thought about that and that's a good question. Um That was one of my major concern was the OFED stack The learning curve on that is huge I was and I that was I was sort of in the dark before coming to Yale So that was the thing I had to focus on and I really started learning more about IB as we had issues with it. So You know, you really don't Learn until you get your hands dirty But yeah, it's I found that dog thing compared to anything else you have to learn in HPC I mean the MPI OFED and I be are probably To me the most complex piece of the puzzle I'll tell you it's very complex software to just right right Even like I mean if you consider like scheduling you think that would be complex and that's not that's not trivial either You know, I I was I was an SGE shop moving to a PBS pro shop now moving to a torque Maui shop you know or move so and If you come from the SGE world, it's it's it was really difficult to wrap my head around PBS pro or torque Maui's Way of doing things because it's kind of different kind of different. Yeah, they're totally different right The way I look at it is SGE is like learning Java before learning C++ because When you start, you know, when you look at PBS pro, I'm sorry if you want you look at Maui and torque you have, you know, all these parameters and it's more broken down for you and It's I think it's a little bit more complex in the beginning To set up and configure correctly But it's more powerful in the end So in regards to your experience of you know Learning all this new stuff and able to deploy, you know, a large-scale system for general-purpose HPC Give us one positive and one negative in your experience You know something that worked out really well and something that was just difficult to To overcome or difficult to understand and learn and deploy Okay So I think what was really difficult for me was wrapping my head around Customization here at Yale Wasn't used to that because I you know, I would stand up clusters With rocks and so that's sort of a really easy way of doing it, right? So rocks you can you know throw in a DVD and hit go and Have a cluster up relatively fast if you didn't have to do too much configuration So I had to wrap my head around a lot of customization so that took some time and that was very difficult And I think that as an organization I would probably recommend to stick with something that's adopted widely So we're so we have moved to rocks on one of our clusters and that was that went pretty well We're looking at X cat as a possibility So that I think is very difficult and I think that would be difficult for anybody Coming into a new situation So that sort of speaks to standards So one of the things that went well well dig deep you got to find something for us Well, you know this setting up the storage went really well We went with a vendor this time rather than building luster from scratch So that it took no time to get that up and going So that was pretty cool You know and we also when we moved to the new clusters we moved to new scheduling So that was a challenge I but I I enjoyed that challenge and I got to learn a new piece of software which was torque Maui So I enjoyed that a lot and I'm still learning it. I still have a ways to go As far as the the Maui portion goes You know, and we're not we're not going into full Moab yet So you sort of have to work around the shortcoming to Maui But so But I in general I thought it was pretty exciting it was a you know, I got to learn a whole new layer of hardware You know, we had I think we had a lot of difficulty standing up our new clusters because we're you know pushing the cutting edge of new equipment With the QDR stuff and we had some bugs we had to work through there So, yeah, but all I the way I look at is all these challenges are interesting and fun to me, so So I should Admit here that I ran into Andre and offered him to be a guest on this show at Moab con where I assume you were That was kind of where you were getting spun up on torque and moab and Maui. That's correct And I can tell you as a place that uses Maui on two machines and Moab on our big machine. I Know it costs money, but go to Moab. Oh my gosh all the problems that have existed in Maui for so long And people are still working on Maui. It's not it's not like they're intentionally keeping Maui like dumb Right, I mean, it's just we Moabs Fleshed out. Yeah, and I mean like I have to solve an issue with floating licenses, which of course, you know Maui Maui does not have You know doesn't integrate with that, you know, it doesn't have an option for floating license. It doesn't integrate with FlexLM Were you the one who asked that on the Maui mailing list? That was you okay, it was funny So actually I'm the Maui list. Yeah Moving on from this so you ended up using all these things but also sounds like you're responsible for Loading and tying together and running your own networking How much integration do you guys actually do and have you found is successful integrating with like your normal central IT? Yeah, so I think that I so again this I think this is a academic issue as well Where Where traditionally your traditional networks are you know maintained by your network services and your central ITS departments or groups and so trying to pull that back into your domain is is is it's it's not easy to do and and so that's that again that has to do with centralizing our resources and and Making sort of one big HPC domain I think that that would be a goal of ours in the future But yeah, it's it's it's a challenge And especially when you you talk to other groups who really don't understand what it is that you do yet So, yeah I think that's a challenge that But I think over time HPC is really growing fast, right? So people are starting to understand what it is and what what you have to do and what the networking is all about And how important that is so the understanding is is is Is being disseminated out there? So you mentioned a couple of packages there already so, you know Moab and The scheduling and things like that what what other Software packages are using both middleware and applications and things like that and then they keep this kind of on the theme of integration here How well did they integrate into your existing environment? Like does your MPI talk to your scheduler? Though does your scheduler talk to your enterprise authentication system? Stuff like that. How well did that mesh together and how much did you have to glue and duct tape yourself? Yeah, that's a good question and because The open MPI We had to we use a lot of different versions of MPI and But we are primarily supporting open MPI and of course that all broke all broke when We Put torque and Maui in place so we had to recompile everything for the torque and Maui libraries, right? because Previously they were reporting to them. They were pointing to PBS pro To integrate with the schedulers properly And pitch as you know does not integrate with any scheduler and you have to you have to institute some sort of cleaning Cleaning up script, you know, you have to clean up the The Lurking processes on various nodes, so That's one reason why I would like to move away from M pitch now, how about other stuff like your your enterprise authentication and and Accounting and any other, you know central Yale resources, right? So that was a that was another concern and Actually, it wasn't too difficult for us It really really really was simple. We didn't have to worry too much about that Everything just kind of worked out of the box, so We got lucky there We you know, we use various we have our own central Authentication service that we point the PAM modules to and it's just not an issue. I can tell you at Michigan I'm the main user support software guy and I have a software library of a couple hundred titles of multiple versions of MPI Math libraries and then we have a bunch of commercial codes, too The end user commercial codes have been the biggest fight If they support a batch system because they wrap around MPI run if they have one They never let you just run MPI run. Yeah, okay, if they support a batch system. They support in QS I don't even know what in QS is. Maybe I'm showing my youngness here But what the heck is in QS? Jeff yeah, I've never heard of that It's a much older batch system that I don't think anybody uses well no never say never I'm sure there's somebody still using it out there But it's it's one of these, you know Schedulars from gosh, I don't know the 90s the 80s I I never used it myself, but I remember it was kind of right before my time and so in my experience the whole process of Every applications way of how they want processes none of them use TM or some other TM's PBS is a Process launching mechanism. So you don't have to clean stuff up Um, so what do you do with these these multiple? instances of open MPI and And different versions of probably GCC and things like that. How do you maintain user environments? Um, we use modules We use modules. Yeah, that's what we do too. Yeah modules that are good. No modules. Good. Yeah, no Yeah, no, that's what we do too. I was just curious if you were doing that as well And I find that actually You know, I wasn't using that before at UMass before I came here when I really I find it useful and you know, it's funny I I found that Modules is kind of like the best-kept secret in the world. It tends to be used very heavily in HPC environments I haven't really seen it used elsewhere Even though it's not specific to HPC at all and the guys who developed it had nothing to do with H Well, no, that's not true. It originally came out of cray But you know, the people who do it now are not doing it specifically for HPC But this is still, you know, we are at least from my limited little view from my foxhole That's where I see module use the most. Yeah, I was really shocked that I've never heard of it before I came here It seems to be a very effective tool. I ran into an interesting problem where I One of our software developers Came across where a user was You know doing a bunch of system calls in his C++ code and that Was loading his environment every time so he was loading all of the modules so it is his System call was like four times more expensive than it was on his desktop computer And he couldn't figure out why I thought that was really interesting. Well, so this is a great transition Let's ask how, you know, how are your users reacting to these HPC environments because it's not only, you know Becoming mainline in the enterprise world is becoming a mainline in in academia as well But it's still a new, you know, it's still new to a whole wide range of people So how do you educate your users and how well do they utilize, you know, your HPC resources? Actually, you know, I can't speak too much about You know Yale's environment because I've only been I haven't even been here a year yet So and my interaction with the end user has been pretty limited until now so but they seem to be you know, they seem to catch on really quick and Before you know it, they really, you know, they really understand they really get HPC and You know, they're digging down an MPI and and doing all kinds of crazy things So I'm really impressed with how fast the user community here You know gets up to speed on HPC And we have we've always get, you know new users that we're just, you know, not expecting like You know people from forestry, for example So, you know, the word is getting out there and I think that's something we are now Moving towards is being Doing more outreach So I think that will be part of the the group in the future So I think that should be part of every HPC group because it's not it's not something that for For now, it's not something a researcher will have background in when they're you know When they want to do research and they haven't experienced HPC. I think in the future though I mean a lot of a Lot of universities are now starting to teach that how to you know, distribute your code and And even in you know, the general sciences, I think they're starting to teach that a little bit Yeah, I teach the introduction classes to PBS and even some concepts and MPI and I just finished out a round of sessions and I had 90 people come to those sessions At our institution alone and I do this three times a year It's three times a year. So once every semester and then once for spring summer. And so yeah, no training and it's amazing People's willingness to ask questions in person and it's what a great way to really facilitate research because we have to remember as admins We're not just running boxes and keeping our head down We're facilitating research in the wide words of my boss. He came up with this great thing We don't like computers. We like computing and if we have to have some computers to do it. Well, damn it. We'll have some computers I mean everybody wants to have the big box, but if it's not enabling anything like what are we doing Now Jeff your situation is a little different. I want you to have a bunch of computers and make sure this stuff works So then we can use it to enable science, but Yeah You're thinking we'll consume So Andre, let me we're running a little short on time here Let me ask you one kind of wrap up question here if you know knowing what you know now What would you recommend to somebody who's who's just starting down this path from from the admin Side, what should they learn? Where should they go look? What should they pay attention to? um That's a good question. Um I think I think you have to be a very curious person to be in this field and I think about I think you have to have a sort of a wide brush of knowledge of science Uh, or a general knowledge of science, right? So because there's so much Different types of research that happens on hpc. So Having the ability to speak the language Is I think a big help, right? So Uh, you know, I think when you're in college, I think chemistry is important. I think because there's bioinformatics um Chemistry and biology come and play Uh, your physics You know, of course your your computer science I think those are important and I think also that things that get overlooked And taken for granted is your your communication skills. Um There's a lot of writing that goes into our job, right? We do a lot of documentation a lot of communication with the end user And I think that's very important. Um So I think those things those those skills tend to get overlooked So I think that's I think to me those are important. I mean just your general curiosity, right? um, and your willingness to learn new things because uh, hpc is always we uh, it's always changing and and revolving and and or evolving and, uh, I think that if you if you're not, you know, a curious person, you just you're just not going to grow in this field. Um But far as of what to look at, uh, you need to know your networking. Um, you know get strong on your networking. Um, but, uh Infiniband is completely different than tci tcp ip of course and and There really isn't a lot that translates between the two of them. Uh, But if you can come across some ib resources, that's great Uh, so and then of course you're scheduling. I think that, uh, learning, you know, maui and torque are really useful. Um And also think about if you were to stand up on a club if you were to build your own cluster What sort of things? Uh, how would you automate that? You know, that's one thing you might want to think about you can look at rocks or how they do it But having an understanding how that works or how you would want to Approach that I think is a important thing too okay, well andre thanks for your time and um sharing your experiences with us and how you transition from regular it and what you found differently in the hpc research computing world and How you progress from a serial farm to a parallel Shared resource, um inside your entire institution One thing if you don't mind me adding a great resource I found of course is the mailing list for all the different pieces of middleware Um that we've talked about open mpi torque um sge. They all have these mailing lists have great resources of people Who have experienced lots of problems and of course there's always the bale wolf list, which is great for just standard standard guinea started questions So andre thank you very much jeff. Thank you for your time and this show will be up soon. Thank you All right, thanks