 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online at rce-cast.com You can find all the old shows links to our blogs or Twitter's Find old shows stuff like that on there Also, I have Jeff Squires on the line the usual co-host here from open MPI in Cisco Jeff. Thanks again for your time Hey Brock, this is always good stuff, you know as usual This is an opportunity for us to learn about some things that we don't normally get exposed to and Sometimes one or the other of us knows a little bit about the project that we're talking to today and today I think that's you you know a bit about our our guests in the project So why don't you go ahead and introduce them our project today is the Atlas project not to be confused with the previous Atlas project We had on the show that had to do with the large Hadron collider This is the automatically tuned linear algebra suite, which is something was one of the very first things I did as a sysadmin was building this once so our guest is Clint Whaley Who's at the University of Texas at San Antonio and Clint why don't you take a moment to introduce yourself? I'm a Clint Whaley the lead developer of Atlas and that's automatically tuned linear algebra software. Oh Sorry about that. See this is why we always ask next. What is Atlas? So automatically tuned linear algebra software. It's a package that Attempts to use empirical techniques to auto-adapt itself to run very efficiently on any set of hardware Without, you know, knowing priori what the best techniques are it tries a bunch of things and picks out the best ones and Produces for you a highly tuned library of linear algebra Kernels mainly something called the Blas basic linear algebra some programs as well as some routines in le pack linear algebra package So who should use these routines? What are these routines and why are they of interest? well a Whole boatload of people use them without being aware that they use them Liner algebra underlies most of scientific computing I'll get a lot of heat mail for that, but it's roughly true most scientific Modeling uses either sparse or denser linear algebra and Atlas is what helps people with with the denser linear algebra part and The ways in which people use them, you know, there's hundreds of thousands of users out there But most of them don't realize they're using it For instance, it's built into OS 10 Apple's package where they use linear algebra routines to do spam filtering and some graphics rendering Those call Atlas underneath If you use matlab maple octave these kind of problem-solving environments Chances are you've used Atlas or one of its proprietary counterparts that they link the different executables for different systems There's some proprietary packages like MKL which provide Substantially the same features of the Atlas in a proprietary package. So you've called it if you use one of those problem-solving environments most likely so what kind of performance do you see for like an Atlas routine verse say the reference implementation of Like let's take the normal dense matrix multiply the classic benchmark to everybody uses Yeah for dense linear algebra if you do a dense matrix multiply of a large size Let's say let's call large size two thousand by two thousand or larger something like that Then you can expect speed-ups that range from factors of three to factors of a hundred just depending on your system To put it in another way Atlas will often run very near the theoretical peak limit of the machine for a very large square matrix multiply So for instance It's not uncommon to get 90% of theoretical peak with that kind of an operation So then how much how much time does it take to hand-tune one of these operations for a single hardware platform if a user was Going to do this themselves well The problem is it's an indefinite length of time Because optimization is never done so in the past like but when Atlas was first beginning It was very common that Atlas was faster than any proprietary package But of course Atlas was and so what's the reason for that? Well, the proprietary packages didn't have a lot of competition in those days Intel didn't produce one for the power PC for instance. So IBM's only competed against themselves So when Atlas came along you can't lose to a free piece of software and the people hand-tuning realized Oh, there's more hand-tuning to be done. And so then they would improve even further So how long it takes widely varies depending on the platform and when do you call it quips? Unless you have some reason to know there's more to be achieved people often give up long before they reach good performance but in the past A heart when hardware would come out It might take as much as a month before an optimized version was was finalized So it could take quite a long time to tune these routines Now what kind of optimizations and tunings do you do? I mean, what do you do that that makes these operations so fast other than you know the the canonical three line matrix matrix Multiply that you see in textbooks. I mean, why why does that take so long and how do you do it better? Well, the gateway optimization for all of linen density or algebra is blocking or if you're in the compiler community You call that tiling Whereas you basically break the problem up into smaller problems that will fit into the cache and the reason of course is everyone knows because of architecture Memories are hundreds of thousands of times slower than processors effectively now so most operations including the The normal three-loop implementation the matrix multiply run at the speed of memory Which is much much slower than the speed of the processor so you first block them and Until you blocked the operation no other no other optimization matters. It doesn't matter whether you Do loop unrolling or anything else because you're running at the speed of memory regardless Once you've done blocking Now now you can run now your theoretical peak is is being held down by the speed of your computational performance And that means there's a whole host of now optimizations. You have to apply loop unrolling unrolling jam register block Instruction scheduling all these kinds of operations that become important only after you've removed the memory bottleneck And so Atlas does all of those things and it does them in several phases Depending on also on their different regimes. We're talking about and I can go into more detail if you'd like to to know more about that Sure, I'd love to hear more so for example, what what is this whole automatically part of Your name, so I mean how are you different for example than Intel's MKL other than other than just being free? right, so The answer is we're not as much different as we used to be But it's because they've become more like me rather than the reverse If you talk so, you know originally You had the proprietor people did mostly hand-tuning, you know, they had teams of very good as you know Usually assembly programmers who would do all the optimizations that I just briefly outlined But when I talk to the Intel guys now, they all use a lot of empirical techniques. So for instance one Intel Engineer described to me Like one of the issues you have is to try to find out the best scheduling for all the instructions and the exit at the assembly level And the x86 assembly is old and it has a lot of tricks in it They're, you know, maybe 10 different ways to do the same operation and So what they actually have now according to to this engineer is when they come out with a new chip instead of trying to Hantune and find what the best assembly sequence is they have a big Generator that generates basically you can think of it as a massive super compiler And they run for weeks to find the exact sequence of instructions that will drive that hardware at its peak rate Now that's a level of empirical tuning that Atlas just doesn't have because I can't Concentrate that much on just one machine like they can but that shows you the power of the empirical techniques that I think I'm guessing almost all the big vendors now use something similar So I think most people are empirical these days in an automated way So what kind of things are you looking for you looking for a cache size number floating point units? Availability of vector instructions. What what are some of the things you're looking for? right, so The as I said the most important thing that Atlas does first is it tries to find a good blocking factor and In order to do that it probes for your L1 cache size now Atlas blocks for the level one cache primarily Does multi levels of blocking but our first level blocking L1 Some other flaws for instance the go-to flaws block more for the L2 Which our modern systems is actually usually a good idea Because out of order execution engine can handle the scheduling part of In other words, you can hide a latency to memory, but you can't hide the throughput costs And so you can get enough throughput go-to Wrote a paper about this you can get enough throughput out of the L2 cache on most modern systems So you don't need to block for the L1 Atlas is a little conservative wondering about Systems where this is not the case particularly historically there was a lot of systems where there was a big difference in throughput between the L1 And any L2 so we blocked for the L1 which also has some nicer effects when you're doing Applications like LU and QR because you're blocking factors can be smaller, but it does sometimes Effect the asymptotic speed of the system So let's see if I can remember the question I believe you said also, you know, what kind of things we did so we try to find the cache size that Enter that allows us to infer a range of blocking factors We we can choose we then look through those blocking factors because that's just a bound on how big you should make the matrix We find What an actual good blocking factor is then we start so Atlas does a series of Different searches all doing various things the primary one the one that's been around since Atlas version 0.1 is a Generator and a C code that generates other C codes and what it generates is differing implementations of matrix multiply that have had various techniques Applied to them and there's a whole host of techniques in that generator But the most important ones are register blocking Unroll loop unrolling and unrolling jack And then I say there's a host of smaller ones But they they are usually accounting for less than 5% of the of the speed that you're seeing I forgot pre-fetch pre-fetch is also important on some systems software controlled pre-fetch And so that's the main search of Atlas and then Atlas has some subsidiary searches that do various things like we have a code generator specifically for SSE now we have a Series of hand-tuned cases that Atlas tries and just picks the best one empirically and so on so I guess I'll stop there You can ask me more questions if you're more interested in more details So you mentioned the go-to library in there They always write up that is that they're focusing on TLB misses as a translation look aside buffer. Are you guys doing anything with that? What's your feeling on that? Yeah, they make a big deal out of that, but it's actually not true you know I if you look at what they're You know they they wrote a paper Originally where they claimed that it's all about the TLB and that's why they're better than for instance Atlas On a they use some systems where they were you know beating Atlas by quite a quite a bit I want to say 20 or 30 percent Atlas was running like 80 percent and they were running at 95 percent of peak or something something like that I mean it was quite a startling difference but It's and they claimed it was from this TLB issue, but of course Atlas handles the TLB perfectly It's nothing to do with TLB. They they would beat Atlas by a similar amount for a matrix that fits in one page So it's nothing to do with the TLB. I mean what they're I Don't think that's a completely wrong point You do have to realize that if you if you get larger block factors That the TLB can limit your effective L2 That was the one of the big points of their paper once they'd fixed the fact that they were claiming the TLB made all the difference It is true that what you have to do is you have to say the size of the L2 I want to use is not necessarily the L2 if I'm blocking for the L2 It's the size of the L2 that the TLB can can index but Atlas handles all that correctly The real difference between me and them on that particular machine that they originally talked about the TLB Is their assembly was a heck of a lot better than mine So it sounds like you can really be in arms race given a particular platform a particular compiler a particular even chip How do you how do you stay ahead of this? Well, I mean the reality is I I usually don't You know Unlike most other efforts all my stuff's open source. So all my competitors, you know I usually beat my competitors for one release of their software, right? Then they go and look at look at my I mean I say competitors. I mean what I mean is people who also provide the blocks but you know, if I have a great idea I publish a paper on it and In body it in software and give it to everybody. So, you know The only reason that I beat anyone after I've done that is that they haven't you know, looked at the software yet so Like I say in the past it was very common for Atlas to be noticeably faster than the vendor blocks But you know these days that is very I mean it still happens because Atlas does some stuff automatically At least I know what happens in the past because they don't tune everything and Atlas tends more to tune everything So, you know, they'll tune the thing, you know Don't beat Atlas by substantial amount on the stuff that people are normally benchmark But then you take the transpose setting or you do Some kind of weird not non-regular shape and Atlas suddenly wins. I mean I've gotten users You know who back in the day I've you know, I've known I'm losing by let's say 10 or 20 percent to the vendor blocks And so I'll tell a guy, you know, we ought to use the vendor blocks because you know I'm losing by 20% on this platform that you're talking to me about And they say no I did and you're five times faster because I'm worried using the strange case, right? Even though you're actually much slower for an asymptotic one. So so anyway, I mean I think that You know to actually know You know What most people know about matrix multiply as a number. This is how big it is That is the speed you get on a gigantic square matrix That's what most people will know from a benchmark But that is really usually not at all related to what the performance you'll see in your application So you really have to if you really want to evaluate what one you should use You really want to just link it in your application see which one runs faster for you because it's so complicated that benchmarking it externally is Can be quite misleading That is quite a familiar story coming from the MPI Saturday the world I think it's a quite a familiar story at HPC right you hear a great number You buy the machine and on your stuff. It's just terrible, right? And then you have to if you actually want to understand it you have to spend, you know Three weeks studying it then you find all the things that separate you from whatever benchmark you were looking at and You know, then you realize what's going on, but it's it's it's it's an ugly truth So is Atlas generating Parallel or or serial code and there's a couple directions to go from here. Let's start with that one So at most most of the generation is done for the serial code and we then leverage the serial code in order to build the parallel codes and The parallel codes are also we've just started doing empirical tuning On the parallel side in other words trying to figure out what's the most effective way to spawn threads that kind of thing and applying that We're just now working on that Now there is some talk long term about the idea you know If architectures continue to scale like they are The the serial case will be interesting only in the sense that you use it inside of a parallel program and So there's some talk That we might want to start switching to doing the opposite where we tune for the parallel case directly and Then we give you a serial interface and not because you're going to call the serial flaws But because you're going to write your own parallel routine and you want to handle the parallelism and not have Atlas do it, right? You don't ask to make its own threads for instance But in either one of those cases it might make more sense in a more heavily parallel world to tune everything using parallel code and The reason you might want to do that is Parallel code is more memory starved than serial code And when you do an empirical tuning what you're doing is you're actually timing something you're saying did it get an improvement? But now let's pretend we have two implementations one of which Has a better register blocking on it so it uses less Loads from memory But if you time it in serial you might not see any difference between that and the lesser Routine they might run same speed Because you're not saturating the bus with only one thread running at a time Whereas if you had 24 threads running at a time you would see a startling difference between these two different implementations because the the serial Overhead was not being exceeded But the parallel one is and so that's an argument that Atlas on a reverse. It's it's an installation scheme And do all the tuning in a parallel mode where you artificially create Runs that are happening on all the threads and then we can distinguish more when the when you're starved for memory Between the guys so that's something we're looking at but it requires of course as you might imagine a substantial rewrite So we haven't done it yet and so far. It's not critical because we're talking Usually eight-way parallelism now and some lucky people have 24-way parallelism But when we start talking about hundreds of way parallelism, it's going to be critical. I think So are you focusing more on parallelism inside the server then so multi-core kinds of acceleration? So Atlas does only shared memory parallelization, you know if you want Distributed memory parallelization you you take a package like scale a pack and it calls, you know the flaws Which is Atlas, right? That's the way you do that. So yes, we do mostly Threading inside of Atlas and we're you know, we're doing a lot of research now on on the proper ways to do threads is there's the Atlas is a very a very Targeted systems package and so we wind up having to do a lot of research that no one thinks is research like Finding out the proper way to spawn threads for instance. Well, you know, we've just found out a work That Tony Gustavo did a PC student who just graduated here You know, we found out that Matrix multiply was running a factor of two slower than it ought to in parallel and the reason turns out to be some OSes Just do a terrible job of scheduling at least for HPC applications And so that can actually kill something like matrix multiplies parallel performance even though that should get almost P speedup for P processors and just by changing the way we spawn threads We can more than double the performance of the threading inside of an actual application like or what I would call an application Which is something like a LU factorization or a QR factorization and So we have to actually look at some very close details on this sort of thing And we're still looking at how we can improve that sort of thing so that we can scale with the with the architectures Have you looked at any of the work? We had Jack Dungara and one of his students on talking about the plasma project with Which was like a different type of parallelism for doing the blouse as well as Multi-precision like doing mixed precision. Have you looked at any of that work and have you implemented anything like that? Oh, I do talk with the plasma people occasionally. We we've actually written some grants together in hopes of supporting a collaboration You know the the thing that they are so and you know their group the plasma group and the magma groups at University of Tennessee in Knoxville I do a lot of work in this area and so does Robert van de Geijs group of the flame group at UT Austin, right? Both of those people those guys those groups are looking at When you think thinking about exascale right huge numbers of processors, huh, you know Thousands of say something like that, right? The techniques that we've been using for parallelization essentially break down and so you have to break the problems up You know you have to basically use some new math and break the problems up in ways We haven't done before so that we can keep things that When you're thinking about a scale of 16 or 32 processors, we're not three or broad max that now becomes little bottom X You can break those problems up, too so they're looking at that now those guys have to rest on top of good kernels underneath and just because of Historical reasons the only good kernels out there are basically the blocks But these packages once you break the problems up would actually do better if you had a specialized Blows in other words the blocks where you know that you're calling with small Sizes the traditional laws do better if you call with large problem sizes, which means less fine-grained parallelism essentially and the techniques they're using so they typically have a Those groups both of them they will tend to Get much greater benefit from high-level parallelism than for instance using a normal LA pack wood if you take plasma for instance But their peak performance per node is still kind of low because they're not extracting You know 93 percent of theoretical peak like a big fat DGM call would make And so the solution to that is to actually auto tune subspecialized kernels Some of them may actually be matrix multiply, but they're matrix multiply where you have a known format coming in for instance So there's a lot of work. I think that can be done there to make those guys get more like So they already have great scaling and then also to say now with each processor I want to get very close to peak and that's where some something like Atlas would come in which is generating specialized kernels That can get that close to serial peak so a lot of the vendor provided libraries also provide a LAPAC and I noticed that Atlas does not provide a lapac. Is there no benefit to optimizing lapac itself? Well, I'm guessing you must be looking at I mean Atlas has always for the last five years or so provided some parts of LA pack LA pack is a gigantic package and so You know the Atlas group typically is me and a a couple other guys helping me out So, you know implementing all of LA pack is not gonna happen So but Atlas has always had you know from Probably six or seven years ago we added a few of the factorizations L. You and Chalesky That was basically based on work done by Sivan Toledo and Fred Gustafson where they showed that basically was recursion. You can beat LA pack substantially So Atlas provided those in the words what Atlas does is when we think that we can actually provide a better Something better than LA pack then we provide it if we can do it in a reasonable, you know without killing ourselves so we have that for a long time with L. U and Chalesky And then recently we have added We've been working on qr to get it with recursion and a bunch of other techniques and we've also developed a a new technique for parallelization where Something like similar what plasma group does but without new math where you just exploit the hardware in a superior way We showed for instance, but that's better than what they do for certain types of problems So I was has got some stuff like that as well in there for LA pack that provides that and then finally We are just now finishing up Atlas has had for a while now the ability to tune LA pack by tuning Something inside of LA pack called highly and V which is basically the blocking parameter which can make a huge difference in performance and Atlas has already been automatically doing that and we're looking at extending that further and Finally Atlas will now if you give it the net lib LA pack and point Atlas at it during the install It will automatically build it for you. So you can get a full LA pack Interface in the Fortran area, which is what the net lib stuff provides And then you can get a partial interface in the sea area, which is what Atlas provides natively further in the conversation, too Are you looking to have Atlas support GPUs and other popular types of accelerators today? I mean do these Architectures lend themselves well to dense linear algebra Yes, these architectures are pretty much ideal for density and algebra You know people have reported huge speed-ups by using GPUs particularly if you do single precision You know, I'm sure you're aware Jack's group at University of Tennessee did some work where they looked at using accelerator single precision Then using iterative refinement to get a double precision answer and by doing that You can get on some systems a hundred fold speed up over using the CPU So they're they're they're very good at dense linear algebra these systems in single precision I am interested in them, but I have not actually done any work in much work in them And the reason is there's already a lot of groups with a lot of that expertise I myself am not a GPU guy and the problem with the GPU area is They use all the same things that people in architecture use people, you know in the CPU world I could call it But they don't really know anything about CPU people. I guess because they just rename everything So it's very hard to just go in and use your 10 years of detailed architecture experience on the GPU Because they don't call it a cache. They call the scratch area or they don't call it a vector unit They call it a Multi multiprocessor, you know, so you have to spend Hours pouring over documents to get a mental handle on the stuff And then you actually I need to adapt your software to run on it Which needs a whole different set of criteria That is hidden from you like you can't write an assembly on most of these systems, right? Even the things they give you with assembly for instance Don't have the assembly the similar does register assignment Well, if somebody else does register assignment for you, it's a disaster, right? You can't do anything with that essentially so you wind up having to manipulate it from a very high level So there's all kinds of problems with it that require a lot of detailed understanding to fix And so I don't see myself right now having the time To dedicate to that so I did have a student that I had work on it a little bit but They need more direction that I'm able to give them with my own present level of knowledge So right now I feel like there's other groups doing that better and other people because Atlas is open source can Roll their own mixture if they like between Atlas and you know, the Coup de Blas or something like that So if there's all these vendor implementations out there now and there's all these different groups with expertise on other GPUs What is the marker for Atlas like why why work on Atlas? well Atlas is is still Atlas is your best friends that if you've never used Atlas, right? Like I say used to the only what I began Atlas because I wished I wanted to run a Base piece of scale of fact codes thinking about that way wasn't quite right, but it's close enough on a cluster And I have to use HPF to do it. We were needing to use that But son at that point had just made their license their their Bloss package proprietary and you could only use it if you use their compiler Which meant that my 32 node cluster ran slower than a single processor that's why Atlas actually began was because My parallel programs were no good because the vendors were not allowing me to call their optimized libraries Unless I use their routines, which I couldn't because of certain parallel constraints and so What Atlas does is allow everyone access so it's certainly free as in price It's also free as in freedom right Atlas has an open-source license that everyone gets to use it however they like So a lot of people use Atlas. I think just for the freedom But also as I said prior to Atlas when Atlas first came out It was actually beating the vendor almost all the time because they didn't have any competition So even if all you ever want to use in your life is in KL the fact that Atlas is there makes in KL faster and So, you know, obviously you're not going to make a lot of money selling a library that if you're selling it That's slower than something that's available for free So they typically, you know the speed of your in KL code is pushed up by the fact that Atlas is there but you know even if even if The vendors libraries would be Incredibly efficient. There's still a huge place for something that's open and free and And as I said before in the prior question Atlas still wins often On certain cases by a large amount just because they haven't tuned those cases So let me ask you a question about binary portability here So if if I have two platforms that are binary compatible with each other Let's say I have a Nehalem based machine and a Westmere based machine and they're running the same operating system and same support libraries and all these kinds of things And I compile Atlas and my application on on the Nehalem Can I just take it over to the Westmere or are there potentially? Optimizations that I'm missing on that new chipset or how does that kind of portability generally work or or not work? It generally does not work right I mean the whole point of Atlas is it is in that automatic tuning processes It is tailoring itself to the exact system that finds so when you You know take an historical example when you tune for the P4 and then you want to run on the quarter to The P4 code may do it's almost certain to do better than The reference blocks right, but it's not going to do nearly as well as just Atlas Reinstalled to this new architecture Because you know the worst things that can happen is what if they change the L1 size? Give you an historical example When they went from the P4 the original P4 had an 8k cache But the P3 which came before it had a 16k cache so if you took the P3 kernel With a cache blocking for 16k cache and you ran on the P4 guy You would not get L1 blocking you would get L2 blocking Because that guy overflows that the little 8k cache and therefore your performance would be you know drastically reduced And then you know this is true of every place, you know the vendor blocks gives you an illusion sometimes that you are You can run it on any system and you can like MKL is very good at this You know you can link the same library But the reason is they've actually got they've got 50 different versions of that library internally And when you make a function call that actually probes your system real-time and selects which variant to call And that's you know they basically have five libraries and that's how they managed to build something that can work well across architectures But the whole point of tuning is to make it specific to the architecture So because Atlas is automatically tuned when it's compiled Should special care be taken when building Atlas to make sure that nothing else is running There's no other users on that logging node That's the ideal case in the ideal case you run Atlas alone on the machine and in that case So Atlas is built kind of at the lowest common denominator So if you install that was just with normal flags it uses a CPU timer so that For all the non-parallel work you can't use a CPU timer for the parallel work But for the non-parallel work use a CPU timer in order to with the idea that you're not able to run it on A machine alone when you are able to run it on the machine alone You can look in the Atlas install guide it tells you how to do this You can tell it to use a wall clock timer which has much greater resolution and will give you a much more quality install So Atlas will work Almost regardless of the load, but if the load gets high enough the timings it's doing to him to himself become Essentially random number generators So then it's not going to do a very good job, you know It's still we better than reference blocks when you're done, but it'll be far from optimal Another related area to that is you have to turn off CPU throttling which most OSes do you for desktops now because that really makes all timing completely random On something that you know no two timings in a row look the same So you have to turn that off to get a Good Atlas install. So what language is is Atlas written in? What languages do you support? You know for applications? I heard you mentioned C in Fortran earlier is the scope larger than that well Atlas provides Interfaces to the C and LA pack libraries I should be too sorry to the blogs and LA pack libraries in a C in Fortran 77 interfaces Now Atlas itself is implemented entirely in ANSI C so Now there's install scripts and so on written in shell and there's a lot of make files, right? So the dependency tree for Atlas is Unix style shell Unix style make and an ANSI C compiler if you've gotten those You can pretty much install Atlas now. It's very helpful It's very hard these days to install Atlas without GCC what GCC provides is that was also optionally Has a whole boatload of assembly files, which it tries on your system and those are we like to use GCC to do our assembly As our similar because we can use the CPP macro to Write much cleaner assembly that way. So those are really the dependencies that you have Now Atlas is actually callable from almost every language, but that's not done by me. You know like for instance sci-fi Scientific Python They did a lot of work integrating Atlas into their library as a matter of fact There was a guy Puru Peterson who did a lot of work on Atlas 3.6 in order to make it callable, you know He helped me with a lot of things like figuring out how to use dynamic libraries, which Python demands So you can find an API for other languages playing besides C in Fortran, but that's done by other people Not me Well follow up on that then do you ever expand or ever plan to expand your Fortran to support to Have say a Fortran 90 like module or any of the newer constructs that are available in Fortran 2003 in the upcoming 2008 and so on I mean, I know I know that's not strictly blahs, but there are some nice things available there for application programmers Yeah, I Have not thought to do so just because I don't think I'm particularly I'm any better suited for it than anybody else, right? The only thing I can think optimization wise, you know, that's nice about Fortran 90 modules for instance You know people talk a lot about wanting to do the tiny blahs, you know In other words blahs that are optimized for let's say what if you're doing a vector copy of only 40 elements Well with a module you can you can basically do overloading for you have a special case code for that that avoids a function call There's some tricks like that you can pull their optimization, but otherwise. It's really a packaging issue, you know, what's pretty And I don't feel like You know that is neither my expertise nor Something, you know that I would have any reason I would do well, right? So I tend to concentrate on the other end. So I doubt I'm going to do it unless The Unless someone outside of Atlas does it and said and and says okay Other people are using this far to put it into that that might be a way that I might bring in something higher level than what it's doing But I concentrate on a low level because that's really at this job So if you're targeting, you know parallelism inside the machine multicore and things like that What effects do you guys are or how much do you investigate the the effects of memory and processor affinity? And how much code you had to written to support that kind of stuff? Well, that's been as I mentioned earlier. That was a topic of One of the topics that my PhD student that did for dissertation Tony Costaldo The thing I talked about how to spawn threads. That's very tied to affinity and what we're finding is that if you don't have affinity on your Performance when you have problems that aren't huge the the operation has an enormous impact on how it schedules those threads And as a matter of fact, you can take something you can more than double the speed of an application simply by making the scheduling good So we Tony did that research He found a solution to it by using affinity and it's not just affinity, but you have to Look at when you start processes up on which processors so that they never interfere with each other is the basic idea So it's not just that use affinity affinity alone doesn't solve all your problems But you have to have it to solve your problems. So we've been doing a lot of research And so we originally did that we published a paper on it Published an IPTPS but What we've been trying to do is find some way so what happens if you're on a system like free BSD and its derivatives like OS 10 they don't have processor affinity and therefore they're losing a fact and They do seem to do the same terrible scheduling that we observed that some versions of Linux did that Windows did blah, blah, blah and And once a terrible scheduling I want to say that I want to qualify that HPC I'm sure what they're doing makes sense for somebody but for HPC It's terrible scheduling because what they do is they often start threads on the same process of the threads already running on So if you spot off eight threads, you'll only be using six processors or something like that So along the same lines then is you know Intel's hyper threading technology and other CPU vendors had similar things and they've been getting more HPC friendly with Nehalem and they got a little better again in Westmere But do you see any of the benefits with a really careful scheduling for Hyper threading or is it still pretty much either a wash or a negative? well the Intel's version of hyper threading you don't see a lot of benefit and in linear algebra and the reason is the following what what what their hyper threading does is it says Most applications cannot drive the back end of the architecture at their theoretical peak Therefore we have basically idle horsepower to be used So in order to get around that we're going to spawn multiple threads to the same processor And we're going to mix their instruction streams at the architectural level and Since nobody was running at the peak of the machine before The excess slack that was left over on the slowest piece of the back end that you're depending on can be absorbed by this extra thread Now the problem with that in dense linear algebra is dense linear algebra is typically already running with only one thread At the theoretical peak essentially of the FPU you're driving the FPU at 92% of theoretical peak Which is just about as fast as you can drive it and Therefore when you add the extra Hyper thread in all it does is it stops on your cash So you wind up with something that in an application anyway does not give you much of a speed up for an eight for a dense linear algebra code so For Intel hyper threading. That's what I've seen that that you know Because the serial code can drive the FPU at its maximum rate. It's not that helpful Now there are some other systems where a single thread cannot drive To be FPU at anywhere near its rate the one that I'm aware of is Sun Niagara Even though it has a very low floating-point peak if you need to spawn multiple threads to the same processor in order to reach that peak It's one of the only system I'm aware of that. You know normally when you have parallelism Multiple threads the same processor throughput goes up, but actual time to completion goes It's worse right But on that system that's not for you can actually get bigger mega flops if you run two threads Then you can if you run one so in other words parallel efficiency is more efficient with two threads than with one So anyway, so that's why hyper threading has so far not been a big issue Intel's version like I say Sun and IBM have a different take of something that's similar to hyper threading and that's trickier So what's the contact point for Atlas a website place to download it? Yeah, we're at math-atlas.sourceforge.net is Atlas's main home page I just recently Moved all of the base files in other words the other words the development stuff is on github these days Due to an extended sourceforge Outage and mightiest needing to use a more modern tool. I was using CBS before So that's where that mostly is. There's several Atlas developer several mailing list you can sign up to at sourceforge There's a support most of the support is done with a sourceforge tracker and You know, you'll typically talk to me eventually when you get down to the bottom of that Okay, well, thank you very much Clint this show will be out soon and We'll talk to you again soon All right. Thanks. We appreciate your time. All right. Thanks