 Welcome to another edition of RCE. I'm your host Brock Palin and I have again Jeff Squires from Cisco systems and the open MPI project Jeff once again, I hear you have a blog Again just like you were prompted to say that yes, I do have a blog where I I muse on MP and then high-performance networking kinds of things and we'd encourage everybody to go have a look at it because As it's be look good among my Cisco fellow bloggers Yes, and we have a link to that straight off of our front page at RCE dash cast calm But we have all of our old shows an iTunes feed an RSS feed as well as you can download the MP3 Directly from the site there for any old shows as well as see who we've had see who we plan on having and also submit a nomination for any other topic you'd like to have on the show. Yeah, we might as well mention the whole social network Working to Brock you have a Twitter account there and we accept questions for Upcoming RCE podcasts from random people So if there's something that you want to ask somebody on a future interview, please tweet that to Brock. Yes My Twitter is Brock Palin BROCKPALEN and that again is on the RCE website So let's go ahead and get into our topic for today. Today. We have two people from University of Tennessee at Knoxville We have Jack Dungara and Jacob Kurzak, and they're both from the plasma project and I'll let them explain what that is specifically But I believe it's a lot of work for doing parallel linear algebra, but they can go into that specifically Sure Brock. This is Jack Dungara. I'm a professor here at the University of Tennessee At Knoxville and I'm in the electrical engineering and computer science department. I also have a position at Oak Ridge National Laboratory and Occasionally spend some time at the University of Manchester in England I'm interested in high performance computing designing algorithms and software for those systems. I also have an interest in performance analysis and in How affect parallel Programs and applications. I've been here at the University for about 20 20 years now before that I was a researcher at Argonne National Lab outside of Chicago And until I guess before that I was a student at the University of New Mexico in the applied math program there Jacob as this is Jacob Kurzak. I've been working with Jack for 40 years now in the linear algebra group software like LAPAC, Scalopod and most recently Plasma and Magma Before that I did my PhD at the University of Houston Mostly doing large scale problems in molecular design Okay, so let's move right in Plasma what is the motivation behind it and what is it specifically first? Right, so Plasma stands for a parallel linear algebra software for multi-core architectures and Here at the University of Tennessee along with some colleagues at various places Jim Dumball at Berkeley and colleagues really around the world We have been involved in the design and implementation of algorithms and software for linear algebra and in my case this goes back to Really 1970 late 70s when we designed the package called LIMPAC So LIMPAC was a package of linear algebra software for solving systems of linear equations And when it came out it was sort of revolutionary it did relied on a small package called the basic linear algebra sub-programs the BLAS and At that time there was only one level of the BLAS namely vector operations So LIMPAC was a beginning of I'll call it a standardization process in linear algebra software it quickly became apparent With the advent of new machines different kinds of architectures in particular Not vector machines, but shared memory parallel systems which had cash as their basic operation mode for exploiting performance We had to redesign the algorithm. So that's that's when we designed Jim Dumball and I designed a package called LAPAC linear algebra package and that was done actually It was done in the in the 80s in the in the late 80s that package was was created and That was in use for for many many years it contains software for linear systems and equations and I can value problems and Has enjoyed again widespread use became a standard is adopted by almost all the vendors and commercial software software companies But it became apparent with the advent of Multicore architectures that some changes were necessary in the fundamental design of that of that package And that brought about Never on about plasma. So plasma was designed specifically for multicore initially for shared memory multicore architectures and design basically Took a step back and look that the overall structure of the algorithms and try to understand what the issues were In terms of exploiting multicore and trying to gain high performance In light of that, you know, it's clear that multicore is more than just two four eight cores. It's going to go on We would I guess Somewhere between 100 cores we would see in the in the near future and potentially even greater numbers perhaps up to a thousand cores per socket in the In the foreseeable future and we wanted to design a package that would take us through Tens to hundreds to thousands of cores on a socket and that's what the plasmas attempting to do So what's your made mode for that's different for Extracting performance for many cores on a single linear algebra solve with plasma versus something like LA pack Which yeah, many people have used Right so LA pack is based on you know a very common technique for exploiting parallel processing It's the it's the four join model. So you you have a sequential thread Yet comes to a point working fork off a number of things in parallel think of loops a loop being run in parallel number of independent processes being Being soft and then Waiting for all those processes So many are algorithms can be structured in a very simple way to fit into that Bussiness for join kind of parallel processing systems and LA pack is built around around that in terms of splitting off things and for Operations that deal with If you're gonna be one of level three blocks, which is engaged most often is matrix multiply So the sequentialized part of the code and then we encounter a matrix multiply operation where we think about forking off a number of tasks to do To do that matrix multiply and then we wait until all those tasks are finished and then we carry on the computation So there's a fork join bulk synchronous kind of operation that goes on and that that was okay For a certain level of parallel processing, but with multi-core that breaks down breaks down very quickly Of course, it works but the performance softness. You see nowhere near the kind of performance that we would hope to see out of out of the current generation and presumably next generation Multi-core architecture. So we want to break the fork join parallelism and We've been doing that And experimenting and looking at various ways people have expressed parallel processing in the past and So today we're using a technique, which is not a new technique It's an old technique, but we're expressing the algorithm in terms of the directed acyclic graph. We're looking at small tiles of work small blocks of computation that can be done and Unrolling basically the algorithm at a very high level to expose a lot of parallelism to get a dig Which is very wide which shows its Great depth in terms of the parallel and parallel processing so at a very high level Our algorithms are expressed in terms of DAGs and we execute the DAG On a multi-core architecture. So you break them up into tiles and so on how how exactly do you map that to? You know multi-core architectures. Are you actually doing things like examining cash sizes? Examining shared caches and examining the memory hierarchy as well or how do you assign that work? well, so so there's a there is a stage in the In the I'll call it deployment to the software which looks at things like cash size And tries to quote auto-tune the software to fit Optimally with respect to the cash size. So there's a pre-processing step Let's call it which does that so we already know what the right cash size is what the right tile size is and now we're in a phase where we're generating work and You know, we again, these are old ideas the work is going to be generated in the form of dependencies so before a task and execute its parents have to complete and When when the given task is finished executing it tells its children about the completion and So there is a think of a think of a DAG now think of a node in a DAG being a computational Component that computational component is dealing with tiles of the matrix think of it as as blocks of this matrix and each each of these Work units has some parents and has some children and before it can execute the work and start it waits for the parents to finish and give notification and then the given tasks can be it can be Put into a queue and told that it can it can start its execution. Jacob. You want to want to add to that? so Not really much to add maybe emphasize the fact that there is the auto tuning component is Kind of an important part of that. So so instead of Setting fixed sizes we would rather run it through automatic machinery that Makes a sweep through the parameter space and finds the optimal And let me just let me just make a comment here about what the user sees But the user doesn't see any of this the user makes a call to a routine and something happens Inside of the call where all the scheduling and dependencies and all the other things that we've just mentioned take place So as far as the user is concerned, let's call it an naive user He sees a he sees a call to a for trainer see routine that gets invoked in a conventional way And then something happens to engage in this this parallel computation across This multi-core architecture so that they magically get their answer faster essentially That's the goal is that they magically get their answer fast And we're gonna we're gonna take it on ourselves to optimally schedule things across that ensemble Okay, now you you were saying that these are old ideas and it sounds also Similar I wonder if you could comment on so Apple just kind of popularized these these ideas to again nothing new But they just you know debuted their whole grand central Kind of model in snow leopard. Is there any relation or do you have any plans to utilize that on on Apple machines? Or things like that Well, so these topics are going to be very relevant and we're gonna see more and more things like this coming out In in the new in the near future and we've seen a lot of work already take place So the idea of expressing algorithms in terms of dead eggs. That's an old idea the idea of executing or scheduling work Based on dependencies. That's an old idea You know, there are there are projects that have gone on in the past There's a project at MIT called silk which used, you know, similar kinds of ideas. There's a project at Barcelona today, which is called SM PSS which uses similar things, you know, there's we know about projects at Intel for doing For doing similar things called CT Microsoft has a has an internal effort that they're looking at similar things in terms of scheduling on multi-core so there's a whole wave of things which are coming and as you point out the apples Grandcentral model is very similar at a high level for the scheduling of parallel tasks and You know, what we're trying to do is to come up with a model that fits within our within our structure of numerical libraries We want to use a standard to be honest, but there's no standard today There's no standard which is easily accessible by the community So we are in a mode where we're experimenting and implementing something and as soon as a standard becomes available We would target it to that You you remember MPI in the old days. This is exactly what we did. That is we developed a process We developed some software called MPI sort of PVM and that was a message passing library It was something that we put together here in Tennessee with with a colleague from Emory and That became a message passing environment and then when the community was ready to develop a standard You know, we we we embrace that standard for our software and contributed to those ideas, of course And and then we use that standard from that point so this is a time when in the community where experimentation is taking place and When that standard is ready, we would we would sit within that standard and utilize it for doing the scheduling I had the benefit of actually seeing you speak at SC a couple years ago at a Microsoft lunch Where I was introduced to plasma. This is how I first found out about it and while we're doing audio here If you go to the plasma website, there's a presentation where it shows How you pack together these tiles as a scheduling Across multiple cores and how much more dense it is compared to the current lay pack with a threaded blast underneath model The impression I had is because of your tiles, you're kind of pulling the parallelism to a higher level It's much more the linear algebra portion is much more aware of that. It's running in parallel and it's running these tiles Is that correct? Yeah, I think that's that's a reasonable characterization that is if you take a look at a timing profile of the algorithm in terms of parallel blocks and how they fit together and How the how we're effectively utilizing the underlying hardware. We got a very dense Structure and by dense, I mean, there's very little idle time of those of those processes And as a result things run much faster than the conventional way of doing the for joint kind of For joint for joint kind of parallelism. So that's what we're that's what we're after and we want to do that Initially, we're doing this for shared memory systems. So today all of our software that's released runs on a shared memory environment We're now experimenting and designing the distributed implementation of those ideas and concepts so that we could do what's called message passing and Have the scheduling go on and have this DAG representation execution take place on this The coordinated parallel distributed system Okay, so that'd be like a successor to scale a pack or are you targeting different things? Yeah, that's exactly right. So scale a pack is the counter counter project to LA pack LA pack is for shared memory scale a pack is for distributed memory and Today plasma works with shared memory and will eventually do distributed memory I don't know that we would call it something different internally We call it plasma D, but you know, that's just a code that we have here And you know, we have nothing to release in terms of the distributed version. We're still experimenting we have some software for plasma, of course and You know that software that we're happy to let people use experiment with and quote support But eventually it would move to a different this different environment and let me just mention that you know One of the things in the computational science area, which is Very important today and it's going to have more importance in the future is accelerators and our accelerators today are GPUs So we're designing a package around that That concept as well. That's a project. We have called magma So magma is an effort to look at matrix algebra for GPU and multi-core architectures that's a magnet stands for and The idea is we have a hybrid system now and we want to exploit again the same kind of concepts Dag-based approach to executing across this hybrid architecture and Exploiting both the multi-core and the accelerator or the GPU as as effectively as possible Jack it's almost like, you know exactly what our questions are going to be before we even ask them That's amazing mess psychic power that we have So even though plasma D is not released at this time You can use the current plasma in a hybrid MPI Shared memory parallelism kind of application right now. There's there's nothing stopping that Well, okay, but let me be very clear plasma that's released today does no message pass them So you could use it, but you would have to then fit something on top of it to to do message passing between the between the nodes in your parallel system But plasma just works on a multi-core configuration and by that I mean Assured memory nodes so that we have multiple sockets on a node sharing memory plasma works fine in that arrangement Is there any benefit to using plasma in a serial in a serial case where you only have one core dedicated to the Whatever spawning plasma over existing lay pack Well, I'll let Jacob answer that he said a little bit more experience in that situation I think that there might be we rarely We rarely run plasma on a single core. I would say that probably LA part With really good vendor blasts underneath like MKL is probably gonna run a single core a little faster on larger problems And probably plasma is gonna run a little faster on smaller problems smaller matrix sizes So follow-up question on that Intel is talking, you know, they're talking all kinds of good stuff about the hole in the Helium line and whatnot and they're really trying to hype the fact that hardware threads should come back in vogue even even for HPC Or at least some class of HPC applications will actually benefit from their hardware threads So we've been talking throughout this conversation about cores. Have you guys? experimented with hardware threads particularly on some of these newer Intel or even AMD based hardware threading architectures So are so far unfortunately for for dense linear algebra It's it's the same old story that our hyper threading really kills the performance And that that's the short answer Unless the model of hyper threading changes Then there's really there's really no performance benefits There's a huge performance drop if you try running something that's compute intensive Using type of threading it gives benefits for for memory for memory intensive codes But so far it really annihilates performance for compute intensive codes What's what's the problem? What is it that kills the performance? So there is no performance benefit first of all because They don't really multiply your FPU the floating point units So so you don't get any advantage and on top of that you get our context switches Which you know no matter how optimized they are hardware they they introduce an overhead and They probably also mess up of the memory accesses and the cache systems If you're talking about cache and the different multiple floating point units What do you see the biggest downside is right now with these? CPUs we're getting and The current design of the memory structure for actually accessing data and memory. I Don't really see any any major flaws In the way that the caches are designed. I have a feeling that compute intensive codes like like our codes are probably happier with dedicated caches instead of shared caches But Plasma can exploit the cache hierarchy quite well So it's not a big concern for compute intensive codes We're happy with with the current line of Intel processors So right now you're happy with the performance as long as things are based in cache So if a if a user is writing a code from scratch and not using something like plasma for the algorithm if they're not effectively Using cache and relying more on main memory How do you see performance going currently then with the current systems? Oh You absolutely have to use some kind of optimized software for this kind of problems There is absolutely no way that Any kind of naive code can get any decent performance You have to use Alders like like plasma style algorithms and underneath you really need a good implementation of blasts Not only because you want to utilize your caches, but also another fact another important factor is these days No Intel lines of processors and AMB and others Rely heavily on vector extensions for performance Like SSC So if you really want to get your performance from your compute intensive Codes You really have to use good implementation of blast that Does the right things in terms of cash blocking and uses the vector of the Cindy vector extensions like SSC So if you're trying to write naive code from scratch, you know numerical recipes kind of code you're totally toast Yeah, that was actually a loaded question. This is something I harp to all of my users all the time So I wanted somebody who has more club than me to point that out to people So a couple other questions and so you're planning on having an MPI version soon Do you plan on doing a mixed model where there will be spawning? Threads on the multi-core systems with MPI between them or do you actually want to distribute these tiles over MPI to serial processes? Each being a unique MPI rank. I Think that The MPI model Stays the same as as it is right now So so I imagine it as a hierarchical model where at the MPI level It's sort of business as usual and and then we step in and do smarter things within the node and you know do things like like multi-core and DAG scheduling and also utilize accelerators and the MPI version is at some level also Using the DAG scheduling model But nevertheless, it's it's more of a hierarchy. It's more of a two-level approach than a flat level approach So extending that a little bit. I want to circle around to the accelerators bit here Do you see accelerators then adding a third level to it or would would you use accelerators in lieu of? The main processor I guess I'm what I'm asking is you know Where where do you guys see it going in in magma and how is the whole accelerator model working out? Right so today, you know accelerators Are connected over a very slow link to the to the main processor for a PC express connection and you know that limits What we can do or what we can extract from it and how we operate but you know It's clear to me that in the future the accelerators are going to be part of socket in terms of What we see on the chip itself, so we'll have we'll have a different level of a combination of conventional cores and accelerator cores on the socket itself, and then we'll have very fast communication between them much faster than we have today It'll be sort of like You know today we actually have a similar situation Jacob just mentioned the sse functions on the x86 architecture Those are embedded on the chip and we get it we extract performance Because we use that accelerated part of the chip which was there to do graphics in a similar way in the future We would see architectures being designed which are multi-core having certain cores devoted to conventional kinds of instruction sets and then other cores being devoted to more computationally intensive things more numerical or more graphical in nature and we want to be able to exploit that in some way and That's that's really where this magma hybrid architecture Ideas are going is towards that towards that direction and trying to exploit as much as possible Both sides of that architecture using not just one, but both sides that is trying to Assign work to the multi-core as well as for the accelerator part and do it in an intelligent way So they're both kept busy for roughly the same amount of time again if we think about that computational graph Trying to assign work to compress as much as possible Any of the idle time all of that all of that scenario So let me ask the other direction then so I asked a second ago about extending downward into Accelerators and you're saying that they're going to be brought up Into the you know the main Architecture itself. What about the other direction for networks networks get Completely well, they throw in a whole new range of variables With latency and jitter and with contention and all kinds of things like that. Do you see? similar efforts there in that networks will become closer to Processors or will it just be addressed by you know scaling out the multi-core side of things What's your vision on that? Well, I would see the networks getting faster as well the network the bandwidth and latency improving So latency going down the bandwidth going up, but it'll be still a great separation between what happens on the socket itself Or on the node itself, so we still have to face multiple levels in the hierarchy Where everything is improving Hopefully at some rate that that makes sense So that we can we can exploit things but there will be a difference So we will still have to use Distributed model in terms of our computation I would say that we're going to have the same programming model for the next five years That is a programming model where we do something locally in a shared context maybe using something like Open mp or some threaded model for exploiting the the parallelism and then a mpi basis going further further afield in terms of The distributed or non-uniform nature of the memory hierarchy and you know what happens five years beyond today I really can't predict you know people talk about You know P gas languages and things of that nature, which you know, which are very interesting and we would Look forward to to engaging in that, but you know to have the whole community switch and to change Over night, I don't think that's going to happen. So for the next five years, I would forecast You know still being mpi still using some mechanism for exploiting shared memory parallelism At a level of using threads or or some sort of thing like open mp So in a talk you had given that I'd seen You talked about using mixed precision in some of these kernels where Part of the algorithms not mathematically sensitive to actually require double precision and using single precision Is this something you're still looking at doing? I Played with the current plasma. I didn't look like it supported that So yeah, so the current version does actually support it We yes, we are we are doing this and the motivation comes about Because of a difference between 32 bit and 64 bits floating point arithmetic even on conventional architectures x86 space There's a factor of two So, you know if you did a computation in 64 bit and then looked at the comparison between 32 bit Well floating point unit is running twice as fast in single precision 32 bit arithmetic And because you're transferring less data 32 bits instead of 64 bits of data for each data item You gain in terms of memory bandwidth and casualization So you're getting a boost in terms of performance there on conventional systems We were originally motivated to look at this because of the IBM cell processor where between single and double precision for the original cell Cell chips the chips that are in the PlayStation 3 for example That was a factor of I think it's 10 between the single and double precision So there's an order of magnitude improvement that you can exploit and there are some algorithms that you can put together Which can benefit from that so in linear algebra There's a well-known algorithm called iterative refinement where you factor the matrix in And then you refine the solution that you get out of a fact of matrix So we've extended that to factor the matrix in 32 bit arithmetic and then do the refinement step in 64 bit arithmetic And the factorization part that's done in 32 bit arithmetic is order n cubed and the refinement part is order n squared So you get a big boost in performance and we've used that on Conventional processors we implemented that in LA pack we implemented it in in plasma we implemented in magma Well, we're using GPUs today's Nvidia the current release the Tesla chip Tesla board is I think it's a factor of eight between single and double precision in terms of the Execution rates and we see now, you know a big boost in performance when you use an algorithm Which exploits both single and double precision. I think there's a lot of promise for these algorithms in the future Even if it's only a factor of two that you're gaining But you know you see things like the accelerator getting eight or a factor of eight or ten by using these things It really becomes a big bonus So did Brock not notice that this was happening because this is part of the magic That just happens underneath the covers and it's not something you actually even need to advertise. It's just it's the magic This is why it goes fast Actually, no, it's not it's not just part of the automatic nature You have to call a certain routine and I'm not sure if rock when Brock looked at plasma It hasn't always been in plasma. So I know it's in the current release, which is version 2.1 point 1 Point point oh and it's It may not have been a former Version of plasma, but today's version of plasma has you know systems linear equations least squares problems multiple precision mix precision it has It uses an implementation of static scheduling It it has interfaces which look a lot like LA pack It does a thread safe model of of its threading and it runs on Windows Linux AIX on the Mac systems In general so the version that we have today is is a pretty fully functional Small numerical library which we're building out. So it's gonna be built out in a way which would hopefully Encompass what we're doing in LA pack and then beyond that of course So in terms of the maturity of plasma can it be used as a drop-in replacement for LA pack or it doesn't quite have everything That LA pack has right now So today the functionality is not not complete coverage It has a number of routines which are the most used routines Let me say the most used path within LA pack and You know plasma will eventually provide coverage for all of what we're doing today in LA pack and Hopefully can go beyond that and and take on more of the Functionality that's in linear algebra And and do it in a way that effectively exploits multi-core and do it in a way that effectively exploits Accelerator-based things as well So let's get a little bit of information on what's the license plasmas under so basically who can use it and Where's the location we can download it from and build it? Right so as far as the license goes it's a modified BSD License, it's the it's a license which has three three clauses and three statements It goes by the name standard BSD. I think today So it's freely available. It's you can embed it within commercial packages You know, we're happy to have you use it You can download it from our our repository which is held at www.netlib.org Slash plasma and if you go there, you'll see our software The other package magma is there as well. So www.netlib.org slash magma and you'll see you'll see that package there Again, they're completely open and freely available We we invite people to use them and we welcome feedback on those On that software documentation and we're happy to provide, you know, all the support we can Okay, well, thank you very much jack and jaco for your time today and Probably have you guys on for some other stuff in the future, but the show will be up this weekend and we'll talk to you guys later Thanks guys. Thanks for having us