 Welcome to another edition of RCE again. This is Brock Palin and I have with me Jeff Squires from Cisco systems open MPI and also HW loke which happens to be our topic today Yeah, we're kind of cheating on this one. Aren't we? Yes, a little bit interviewing one of the other developers from HW loke even though I'm one of the developers too But if anybody looks at the subversion commit log You'll see that this guy has written a whole lot more of the HW loke code than I have as a matter of fact I I believe it was part of his thesis work and whatnot. I've been doing you know ancillary build system configure system And he's been working on the heart of it. So he can speak much much more intelligently about this stuff than I can So we're cheating but only sorda Before we introduce our guests we have a website RCE dash cast calm there you can nominate new shows pick up old ones subscribe to itunes There's an RSS feed. So please stop by there and pick up any back shows Also, I gotta I gotta throw in the plug for my own blog the MPIB cast blog out there that It's on blogs. That's this go calm and there's a link off the RCE cast and I try to get about one entry a week out There and I try to say something at least Nominally interesting last week. I talked about traffic and my my act my favorite acronym that I'm trying to get people to use Nuna non-uniform network architecture Because it's all about the networks these days. There's networks inside the servers, right? So makes it more complicated It's good stuff Nuna start using it in casual conversations Well, actually the software package we're talking about today HW loke is for kind of understanding the topology inside of a server I believe so let's go ahead and introduce our guests and he can give us a better idea. We have Samuel T bolt Who's from France so Samuel how about you introduce yourself say what are your work and how you got started on HW look? Yeah, sure. So I'm a researcher in Bordeaux between the University and the Inria Institute I'm working mostly on runtime support for machine parallel machines and and So beat multiple accelerators based up So that's why I was interested in in topologies and locality in machines On the side note. I'm also a deviant developer focused on the accessibility topics in particular for blind people So I feel compelled at this point to say that the standard disclaimer that we're ugly Americans and we're horrible at pronouncing Other people other language names and so on so Samuel Can you pronounce your name as as you say it because I'm quite sure that we're not saying it properly? In French we say Samuela with a U which is difficult to say in English indeed Okay, we'll try but I but I can't problem So Samuel, can you go ahead and give us a rundown of what HW loke is we just mentioned Nuna and networks and just give us a 10,000 foot view of what HW loke is and what it does Sure, so It's actually a set of tools to detect how HPC components relate with each other. So how Processors share cash on memory Network boards, etc. Things like this and then tools to bind the HPC applications accordingly So it doesn't provide the intelligent part of the mapping, etc Just the portability of the tools to actually discover and pin things around Although we do provide some examples for people to get inspired from So actually let me extend that definition just a little bit. I I think this is actually more wildly applicable than just HPC I think there's quite a lot of applications and enterprise and elsewhere that you know as We're going to more and more cores in a system They actually care about this stuff. And so I think this is not just an HPC Kind of thing. I think this is where it's gonna we're gonna cut our chops and make our name But I think it's gonna be wildly wildly popular after that just my opinion. No Yeah, I mean modern machines now you're looking at 48 cores and like a regular to you server So it's getting kind of crazy. So Samuel, can you give us a? A Where the name for hw low came from I guess this is an interesting story Well, it's actually a very long story The thing is you at the start during my PhD was thinking about the leap topology simply The thing is there was another leap topology that existed afterwards So we had to change the name and we had a lot of discussions We can find on the list and on the wiki the various names we Invented some very difficult to remember some very difficult to Google and eventually we thought that just hardware locality abbreviated into hw log Does Google quite fine and why it's good enough for what we needed So why exactly I guess Jeff you could probably answer his question too because you've wrote a number of blog posts about this Why is hardware locality and process affinity important in modern systems? So mostly from our part we we believe them Important thing is and portability of performance They are for instance HPC libraries like to and buying threads to get them better performance But they need a portable way to do this because it's not standardized So hw log does provide it More than that You need to be aware of the architecture of the machine because if you consider it as a flat set of CPUs Then you would get congestions on the links between Processors so it's better to split and distribute the application according to the underlying machine. So be it for communication communication between Processors or spin locked strategies and talking with network boards and things like this So give another disclaimer here, I think I'm gonna go back and forth between being interviewer and interviewee So everything that Samuel just said let me add just a little bit more on that is that with these you know more and more cores in a machine what is easy to forget is that there's actually a network inside that machine and You know if you're not careful You know most people don't think too much You know when they're when they're writing in an enterprise class application They just write their application They don't care and it gets some baseline performance and if they need to they optimize it a bit to get to the performance to where it Needs but with these larger and larger machines. It might be Pretty easy to shoot yourself in the foot performance wise because you're not careful and you don't realize that there's actually communication going on Just in writing a normal serial or even a multi-threaded app and that communication could get pretty intensive and you could be really Hurting yourself or hurting the overall per machine performance if you're not aware of the locality of your processing and the locality of your data and so You know HW look is really all about making The topology of the machine available to the programmer so that they can make some intelligent decisions about you know How are they gonna run? And how are they gonna? You know organize their data locality. It's just like real estate. It's location location location Yes So does HW look just give you information about topology or does it actually give you the functionality to do affinity? Well, it does both the thing is it separates the the two notions So that applications can just sit between the two And decide according to what they need the problem is that you cannot really build a tool that Say distributes your application of a machine because it doesn't know how the application is structured So that's why hardware locality provides the structure information. Let's the application do whatever Interesting heuristic Before for the between the application structure and the machine and then use the tools provided by HW log to actually enforce some binding or some some strategies for communications So what kind of information about the topology does it give you is it just the number of cores and which cores Share a socket or does it give you more information than that? Well the on the basic information that I wanted to put into it when I first designed it was To provide the tree of objects in in the machine So basically the base structure for it is a tree of objects the human nodes sockets caches things like this and then there are some attributes for instance the cache size or memory size and Then from that structure you can for instance anywhere turn on the number of sockets or No, which cache is shared between two and processing units and things like that So the idea was to have a very simple structure that people can use the way they prefer So how is this information delivered? You know if I'm a I'm a programmer or I'm a system administrator or something, you know How do I how do I use the hw? What's available there you have several ways to access it actually either you can Just use the tree so you start from the root which is in the whole system And then you go down to find out there are some human oats and then inside human oats You have sockets and inside etc or you can Browse it by level Say you look at first the system then you see that there are there is a set of human oats You know there are four of them and then you see sockets. You see that there are 16 of them In total and you do not consider necessarily the inclusion between human oats and sockets Or you can directly access to some level for instance a lot of applications may just want to know Which processor is included in which? Human oats will be accessed directly to the new mode level and then see from there which processors are available All right, so you're talking about data structures. So this is a C API, right? Yes, so it's basically structures with pointers pointing at each other and an array of arrays to to get access to Particular levels of the same kinds of objects But if I'm a I'm not a programmer, I'm I'm say I'm a system administrator. I'm a scripting kind of guy There's there's probably some tools available for that too, right? Yes. Yes, the hardware quality was usually meant to be a library But it's also quite useful for us for instance when we buy a machine And we want to know the structure of the machine So we have a tool that provides the structure in a graphical way We also have tools to get a textual output so you can mix that into scripts and get the CPU masks and to then apply them to to some processes or use CPU sets Etc. So we have both aspects the library for applications and also tools for Administrators on things like this Yeah, let me give a little anecdote. I was actually on a support call this morning with one of my customers I can't tell you who the customer was but they're a very large disk maker and their name rhymes with Schmidt app and I We were doing some things on some Cisco servers and they wanted to verify some processor settings And I said well, hey, you know, there's this great tool called HW low Can I said oh, what's HW Logan? So we downloaded it right there and built it took about 30 seconds And we ran it and they I have to tell you their their reaction was just almost comical to me They said holy criminy. This is great We need to get this installed on all of our machines and they were just going off They're like wow you can get it in XML to holy criminy we can script this up They were just thrilled with it and it was just fabulous to see that kind of reaction. I Can think of some applications of Vasp is the one that comes to mind It's a Fortran code, but normally it's easy to make a wrapper around C where when you compile it You need to tell it a preprocessor of how big your cache is for their FFT routine and then it's a static thing Could they just use HW look to kind of ask? Hey, how big is the cache on the thing? I'm running on Yes, so what you can do is just ask I do HW look The structure of the topology, but then you can browse it and see the various cache levels from the L1 L2 L3 and Whatever cache the levels there might be in the future and you get the size and how many processes are sharing it Which can be important where when you actually run some some computation there So a little bit of a clarification The affinity ability to use HW look to kind of pin Say executable things on a Linux system. Is it working with processes or with threads because they can be different Well on Linux we support both actually them There are some operating systems where we do not support all kinds of bindings, but on Linux we do either Bind threads individually or we can just bind all the threads of the process Be it the current process or another process So you said other operating systems. What operating systems is HW look support? Well, actually quite a lot there as many as possible probably there currently we support Linux Solaris Windows AI X and Darwin Free VST or SF that is true 64 HP UX and carry get so we are basically missing iric support Well, just because we don't have an account. Maybe some somebody listening to do as we'd have an account for us Thing is our code is quite independent from the architecture itself beat x86 or Itenium doesn't really matter Except in the x86 case we have some code for operating systems that don't provide enough information in that case and the x86 case and We can request the information from them processes itself So that's actually pretty impressive because the affinity stuff varies per operating system not every operating system uses it a standard way of specifying affinity Actually, the the idea of the library didn't came immediately the During my PhD, which was about mapping trees of threads on the tree of the machine Made me write on that topology discover a thing so that was between 2005 2006 I wrote the on portability code there Because I needed to to run that on other systems and Of course quickly I found out that it would be a good idea to make a library So I did it to my to-do file make a leap topology and that happened only three years later When people from my team Put some demand on this for and Priya MPI process mapping So we did it eventually during 2009 and then it became a hardware locality So what what kind of entities can HW loke? Detect so we've mentioned a couple already, you know pneumonodes and caches and things Well, what what is that? What is the full list of things that it can report in its tree? So the the full list would be Pneuma groups then Pneuma nodes sockets and caches Cores and the single threads within the course For now we only handle the objects that matter for For processing units that is objects that contain processing units In the future we want to add Network balls and accelerators and things like this which will be all the kinds of objects and as well as the BCI buses and Switches and etc So this is a phenomenal step forward There was there was a prior project that I was involved in called PLP a portable Linux processor affinity and an HW loke is is just way better than PLPA and in so many regards number one PLPA was Linux only and PLPA only understood sockets and cores and HW loke surpasses both of those in in so many ways and so we're very definitely deprecating PLPA in favor of HW loke simply because it's it's just better in every single respect and as a matter of fact I'm actually right in the middle of integrating HW loke into open MPI and That is actually the the current hold-up in in getting HW loke 1.0 out the door because those as I'm integrating it I run across a little thing and I go fix something in HW loke and so on and so I think the only thing that's between HW loke 1.0 RC 3 and Is is the fact that I've got one little thing that I need to fix that I discovered in Integrating with open MPI last night, so we're getting towards it. We're getting there So how well does HW loke understand brand new equipment when it hits the machine room floor? You guys haven't even seen it yet as HW loke developers Will it pretty well understand new equipment or do you does HW loke itself need to be updated? Well actually we had quite a few good surprises the thing is mostly the problem is the operating system support if The operating system layer that expresses things in the machine is already quite autogonal and things like this It works quite well for instance the MagniCore processor from AMD actually integrated the new manuals inside the socket So any interface that assumes that a socket is inside a new manual would be broken and Linux doesn't do this and actually somebody on the list reported that he's run a JBLD doc there and it just worked fine without any modifications The way that's for new kinds of arrangement of objects Of course if a new kind of object is introduced in the market We would need to add it to hardware locality However in some cases I'm thinking about AIX It provides an interface to express objects in very generic way so right at the operating system layer and hardware locality when it encounters these unknown kinds of objects just takes it and expresses it as Miscellaneous and so you actually have the object you don't know what it is But it at least provides you the structure information Yeah, there's there's no iPhone or Android app for a HW look yet, but you know maybe someday So Jeff a question for you as a I know you work on HW look But you also spend most of your time working on open MPI so I'm curious. How are you using and how? How as a user of open MPI will I see HW look in work? Ah, so that's the beauty of it. You won't as a user of MPI you're just gonna so we actually Already in an open MPI we sport a couple of command line parameters to MPI run like bind to core bind to socket and things like that and You won't notice the difference of our upgrade from PLPA to HW look But what you will be able to see is this will enable us to support hyper threads Because with the Nehalem series some of the higher-end Nehalem chips Hyper threading is now, you know somewhat interesting to HPC but before Everybody just turned off hyper threading because it was completely useless and now The picture is a little more gray. There are actually some HPC apps that benefit from hyper threading and so Open MPI's MPI run will probably grow, you know a bind to hardware thread kind of command line option and HW Local will allow us to to support that Internally We're gonna use it for for two purposes exactly what Samuel already talked about number one, you know the process binding and number two Also just looking at the topology of the machine So when we do our shared memory communications kinds of things will be able to be a little more Intelligent about it like oh this guy's local to that guy. They share a cache You know, we can do some stuff with that or this guy's just completely remote from that guy and you know We have to do we have to be a little more careful in in talking to him So wait a minute. Let me step back there You can tell the difference between a real core and a virtual hyper threaded core now with these with this library Absolutely Okay, that's actually really nice Yes, if you want to know if you want just calls then you request HW lock for the core level and then you know how many there are and they are CPU mask and When you have hyper threading enable you see in the CPU mask that there are two CPUs for that on the country If you want to address a threat hyper threads, then you will just request the processing unit level and Then you will really get the right side if they exist all course if there is no threats I have to say this is one of the things that actually was very exciting to the customer I was talking to this morning was the ability to distinguish between hyper threads and cores and It was it was a difficult thing before and now they have a tool that can just spit out some XML And they can parse it and put it in their scripts and they were they were thrilled with that Yeah, sometimes you're not really able to know whether you have Enabled hyper threading or not and just from the in bios settings of what the operating system reports It's quite useful for that Yeah, this this particular customer was looking at they they have hundreds and hundreds of machines and in some cases They want hyper threading on and in some cases they want it off and they were not Excited at all about going and manually tweaking all the bios is of all the machines to turn them off And so they're like wow we can use hwLocke and just take you know the second hardware thread offline In Linux and therefore I can script it up into you know an init.d script or however they they want to do it So the next thing I need is is I need I need to actually have this embedded in a resource manager I need to be able to tell PBS. I want one node with eight physical cores I want one node with With all the hyper threaded cores also and the resource manager somehow tell that to whatever MPI library or Open MP library I'm using and so it does the right thing and that would actually be a really killer integration All right, I'll take this one because this has actually been a lot of my contribution hwLocke is the ability to embed it in Other software and like I said, I'm working on embedding it into open MPI right now So we tried to make the library itself very friendly to you know dumping this into Other software projects so whether you just link to it, you know because hwLocke is installed on your system Or you actually have a small reference copy of hwLocke inside your software itself, which sometimes that is useful to do That is actually you know the the bulk of my contribution so far to hwLocke but to answer your question Yes, that is very definitely a target. We won't Want applications out there like resource managers to use hwLocke We want slurm and torque and all the rest to do it and I know that slurm is actually Fairly excited about using hwLocke number one because they use PLP a TA today And they want to be able to upgrade hwLocke and be able to see things like hardware threads and whatnot But they're also very much looking forward to being able to detect nicks and other PCI devices and accelerators and things like that They just pinged me the other day asking when that would be ready and so on so I think there are some real-world HPC applications that will be using hwLocke behind the scenes in the near future I know that the the mpitch Crew is also they've embedded a pre-release version of hwLocke in there in their code so far as well And to those lines too since I'm mr. Corporate suit here I have to mention that we very specifically made the license of hwLocke the BSD So it's friendly and you know we want people to to use this as as much as possible with no corporate lawyer fear or anything like that Yes, that's exactly the choice we had made for them the PC So let's talk about actual performance gains from using something like hwLocke Do you guys have any examples of a case where when we started using affinity with that application? We saw a significant performance improvement in an HPC application Yeah, sure. Well, you know it depends a lot on the application Of course the thing is merely binding threads on the on the processing units Sometimes can already give you 10 or 30 percent in performance the Thing is if you bind erroneously because you assume some process and numbering You might get really really wrong just because the numbering is quite often Interleaved and so you actually put your threads on separate sockets all the times And so sometimes if you just use our locality to know the proper numbering of CPUs You can get 50 or 10 times better that depends of course on the applications The arm what I could not is during my PhD I mean starting from a random binding to High record hierarchical binding according to the structure of both the application and the hardware I could get something like a 30 percent or 100 percent better performance Some cases it's actually two times or five times better than NPTEL and just because we do know the structure of the application the Breeze tried this morning a data broadcast inside the machine and using the topology information you could get 25 percent better On for new ma for core machine and We had some measurements for network bandwidth. We get twice bandwidth just thanks to getting the threads near to the Network Yeah, just an editorial note on this it's it's remember now it's it's Nuna, right? There are networks inside the machine and then our networks outside of the machine So you really do have a non-uniform Network architecture when you're dealing with say an HPC cluster and you know, we've seen Optimizations for years where people are saying well if I do a broadcast that's aware of the network topology I can you know cut down the traffic through the core and you know You get better performance that way the same exact issues are true now Inside the server and you need to be aware of the topology inside the server in order to be able to optimize that And I think that's exactly what Samuel was referring indeed actually that that's the Please work on what we call new you are which is in the non-uniform input output architecture so the Communications within the machine and between the external network and the insides of the machine as well So you go this is spawning a whole new class of N. U. Prefix to acronym So HW look on one dado Release candidate is out right now. What is significant about one dado versus the previous releases? Well, the thing is I don't think there is anything really fancy in it mostly it's a clean revamp of the the whole thing Be it the API that has changed quite a bit and the output of the tools a few details like not using the word The abbreviation proc because it would meet processor But what is a processor? Is it the socket or a processing unit? So we use just processing unit We revamp the documentation documentation a bit So there is also some new things like the free VSE and exit 6 port I also added the notion of online and allowed Processes so that you can show them in in management tools One new thing is getting the current CPU binding the issue is that Far from all operating system provided but well at least Linux does provide it. So it's useful and of course there is all the embedding work from From Jeff which doesn't really provide features, but which helps a huge lot for integrations in existing projects Yeah, and I do want to stress one point here is the documentation There is a nice glossy PDF that you can download and it tells you everything about HW look This is very I mean documentation can always be improved, right? but the fact that there's a nice PDF that that details, you know, the The rash or the strategy the overview of HW look and then shows you every single Public function and constant and things like that. That's there. It's a programmer guide. It's a man page guide It's everything all in one nice PDF. I think that's actually tremendously important and will help a lot of people start You know from ground zero with HW look and I think what what Samuel was saying. I have to completely agree with it's HW look provides some of the same features that that prior projects like PLPA have you know socket and core determination But it provides so many more features and in such a better Wrapped up package there. There's so many more Things that you can do in a nice way that we kind of expect people will want to use it to get that information That it's just a fundamentally better package than anything that has come before and you know I say that with absolutely no bias whatsoever But it's it provides all this information in convenient ways that we think will be genuinely useful to a a wide class of Applications and so that's what we think the value of 1.0 is it's it's nicely tied up There's great documentation and a really nice set of features surrounding that information that you want to need to get So what's coming after 1.0? Well there is one thing that we've been pushing apart for now is memory binding the thing is for CPU binding The interface is quite simple. You just set a CPU bind on a thread for memory binding You have to know whether you want to migrate some existing memory Allocate how should functions look like etc. So we just push that apart to get 1.0 out As soon as possible and then do this later. So we expect to do this for 1.1 We think we should provide as well the a new my distance between human oath since it will be useful for most heuristics There is also feature which is dynamic CPU sets for now the Number of process source Inside the CPU set is fixed to 4000 This worked on having that dynamic So it's already ready, but needs some testing. So we push that for 1.1 What is expected as well is the PCI topology to to know where the network unique is It is already quite ready. We have a branch that exists that does work. Maybe it will be in 1.1 I don't know yet the Race worked on a lot on this so he would be better at telling about it the I'd also like to push later maybe not for 1.1 the topology of the new manuals and the topology of the network around the machine that is the output of Hardware locality wouldn't only be limited to the local machine but also the other machines are on the for instance the cluster to express the the Connections between the machines and the switches How I key are also switches things like this, of course that would require External information from tools that we don't really maintain. So that probably is through some plugins And also there is a crazy ID that would be to detect the USB trees and all the devices Cool stuff, I'm sure we will have some fine Cisco based plugins for reading things out of Cisco switches and whatnot as well Just had to throw that in there But as One thing we didn't really touch on is the fact that HW loc is now an official sub project of the open MPI Project, it really has nothing to do with MPI But open MPI actually is kind of an umbrella for a couple of Projects open MPI itself the software package is probably the most well known But there are a couple of smaller things and HW loc we think is actually going to become just as well known as Open MPI and it really has nothing to do with MPI. It's a nice little standalone utility in itself But this is kind of a tongue-in-cheek question Brock is asking me here on IM saying well You normally ask all these open source projects what they use for for source code control And do you want to ask yourself these questions? So I'll say that we use subversion as our main tree But I do almost all of my work in mercurial and I know that brice does get samuel. Actually, what do you use? Well, actually for hw york and I just use svm actually But for all the projects I just use anything that the project uses Cool Okay, so is there a website mailing list? I mean, where do you get information for support and getting involved with hw look? one-tenth Yeah, well, it's it's the open mpi website So if you if you go to open mpi.org or com or net, you know, we have all those domains If you go to that on the left hand side, there's a little navigation for sub projects And the first project in the list is hardware locality And you go there it has our mailing lists links to the you know download the pdfs and the html versions of the documentation and latest our ball tar balls and everything and Just like anything else we love to see other people get involved particularly to support architectures and platforms that We don't natively have access to so, you know reports from the wild are are greatly appreciated Okay, well, thank you very much, uh samuel and jeff and we will announce this on the hw Localist when this show comes out. Thank you Thanks Thanks for your time samuel. Thanks