 Welcome to another edition of RCE. Again, this is Brock Palin. You can find our shows online at rce-cast.com You can also follow me personally tween about HPC stuff and generally computer related stuff. You follow me at Brock Palin All one word. Again, also, I have Jeff Squire from Cisco Systems on loan to us again He's also a major developer of open MPI. So Jeff, thanks again for your time Sure, Brock. Man, I'm always happy to do these things always learn new things and by the same token on On my blog recently on blogs.sysco.com I've been trying to demystify a bunch of internals of MPI kinds of things So I'd love to hear what your questions are if people have questions Please feel free to email them to me tweet them to me whatever and I'll answer them on the blog Yeah, I've really been enjoying the last string of a couple of demystification of some of the things that MPI Libraries actually do in practice and so that's that's been quite an enjoyable read Good Okay, what are we talking about today? Today, we're actually talking to a guy who works on one of the very first libraries I ever use which is the AMD core math library is AMD's Implementation of some of the blocks and lapak and some other things like that Yeah, our guest today is a chip fry tag from AMD and he's calling in on a phone So his quality may not be that great, but chip. Why don't you go ahead and introduce yourself and let us know What your position is Sure, Brock and thanks for having me on the show here. I'm Developer and AMD math library section. I've grown up with the group We now actually have a group of people working on math libraries at AMD Back when I started on it was me and one other guy Tim Wilkins I started at AMD 18 years ago Started in the marketing position actually a technical marketing position I have a degree in aerospace engineering from the University of Texas and have Been in the computer industry for about 30 years working on a variety of embedded applications and communications and networking types of Applications and then eventually here at AMD working on floating point math libraries Which is the closest I've gotten to using my aerospace engineering degree So why don't we dive straight into this? So what is the ACML? ACML is the AMD core math library. It's a collection of floating point Math routines have been used in scientific and engineering types of applications So it's a relatively popular in the high-performance computing industry it consists of the the blast linear algebra package LA pack set of FFTs and a collection of random number generators The libraries available in of as you're used to with blast no a pack routines We have this available in All the the precisions that you would expect single double single precision complex and double precision complex as well So what's the history of the ACML? How did it come into existence, you know, why is AMD offering it that kind of thing? so that's an interesting question back in Well, gosh, I guess it was about the 96 time frame. It seems like maybe I'm sorry It was about around the year 2000 when AMD was inventing the AMD 64 architecture We had taken the Intel x86 32-bit architecture and extended it to To allow 64 bit operands and 64 bit addressing space back when we were Working on that as a as a skunk works project. We realized that That yeah, yeah, you know high-performance computing types of Programs would be willing to take advantage of this in in a in a big way at the time the only math library that existed was Well, the only vendor driven math library was of course MKL But the blasts and LA pack routines that were out there were We're only available in 32-bit flavors And in addition to C and Fortran compilers that were available at the time would only emit 32-bit Instructions streams so as part of a larger effort to to to develop a mature set of OS and compiler offerings for that the industry could use to run on our machines We also realized that we needed to do math libraries so a project was developed and at the time Tim Wilkins had just started at AMD and he had a propensity for taking Compute kernels, especially D gem and tweaking them for x87 and In those days the 3d now our instruction set and Making them run as fast as you can make them run and we turned that into a set of Library routines that we were able to make available as a product offering in addition At the time we were working on A set of math transcendental functions, you know sign and cosine and exponent log and those sorts of things and I personally worked on that and Worked with nag to to turn those into a product and Get those ported for the 64-bit architecture that we were that we were going to bring to market and those two efforts kind of converged such that when we were finished with the The math transcendental functions. It was a natural extension for us to then go to nag and Have them start producing have them start collaborating it collaborating with us to produce the ACML library So I noticed on the ACML manual that and you just mentioned there a few times nag comes up a couple times What how close is the relationship with nag and nag's numerical algorithms group from the UK? That's correct. The relationship with nag is reasonably close. We Work with a fairly small group of people there Nag is you probably as many people probably know Offers a relatively popular statistical library. That's used as a reference and by many people We license ended up licensing portions of ACML from nag As for instance, we we license their LA pack code and that's what we use rather than The standard net live LA pack code. They've done an extensive amount of work with that Solving some of the bugs that you find in the in the net live code, but also introducing Open MP pragmas to make that multi-threaded. So we we license that code from them And that's that's what we use for LA pack in the ACML library. We also work with them have worked with them very closely, especially In the beginnings of the library to productize the blast library and FFTs and we also licensed the random number generator code from from nag We can continue to have a fairly close relationship with them and We use them as as technical experts on on a variety of subjects So who uses the ACML the ACML user base is Pretty much anybody who's got an an AMD computer, although it's not limited to just ACML AMD processors Anybody who's got an AMD processor and Needs to to run a bunch of math code The typical user is either You know anywhere from a university student on up to you know, the the managers at the national labs the people that that Have to maintain the the software base for the for for the labs Invariably if they've got AMD, you know, if they've got machines such as craze that have AMD processors in them They've got ACML in their bag of tricks for for their for their user base In addition oil and gas customers are big users of it of ACML. They utilize our FFTs quite a bit Financial institutions use us for the random number generators that we have. It's a wide variety When I look at the the download information for the library, it's it's a who's who of Practically every university in the world and Practically every fortune 500 company So you mentioned in there the this is focused on AMD processors, but it works on non AMD processors What's the reasoning behind that and how much work is that for you? Well, the reasoning behind that is because I have unfortunately yet to see a shop that has only AMD processors and You know a common user request is that they want to have one Executable that runs on all the various machines that they have They also need to make sure that they get similar answers on on different machines So running the same library pretty pretty much ensures that you're going to get similar results based on different based on the different machines It's it's a little bit of effort for us. Obviously we have to maintain a set of Intel machines that we do our testing on and We actually have Implemented specific kernels targeting Intel processors for for instance, we have a specific Wood crass then the Halem kernels for for our matrix multiply kernels And we have to test those and make sure that they work on the on the you know in the in various applications so we do Spend some effort to make sure that that works the way we expect it to and and we've gotten some good feedback about the performance that we have On other machines. I would imagine that that's gonna especially over time gonna entail Quite a bit more software engineering for you because your platforms, you know AMD versus Intel You're kind of diverging a bit in terms of features and chip functionality and things like that Is this expected a trend that's expected to continue? Well, you're right there. It's not going to get any easier. That's for sure In the most recent generation of products that we have Everybody's aware of the AMD bulldozer architecture that's out there. We haven't actually launched the product yet, but it's a well-known quantity now we we have in bulldozer the Not only AVX instructions, but we have FMA for a for operand fuse multiply add instruction And we are currently working on an ACML for that and using the FMA for architecture Our competitor has chosen a different path for its future products FMA 3 And we will eventually support that as well and at that time then, you know, we will have to do work associated with that So we'll end up with two Code paths supporting FMA architectures And it of course doesn't make the the library manager's life any easier because the matrix of test cases Just you know doubles for cases like that So why implement these things as a library? Why not? Add this to say like compiler optimizations and work with some of the compiler vendors well That working with the compiler vendors certainly makes life easier for the guy building application If he can code something and the compiler recognizes it And and just does the right thing. That's that's the perfect world for the guy who's writing software He doesn't have to think too much about it Unfortunately, I think the opportunities for that is is relatively limited and tends to be more useful in benchmarking Capability than anything else. I think for the real world a lot of people are targeting these specific routines and Expect to link to a library So that's the the path that we've chosen and then with a library you can cover many many more cases than we could in you know Compiler heuristics, I think So compared to say a compiled matrix multiply code that a generic user would write How much better performance would you expect on say the current top-of-the-line AMD core using ACML versus that users code? Well, I think that's the difficult question to answer because you have to give me a baseline I What I will say about that is that ACML is able to achieve on the order, you know in the on the in range of 90% efficiency on a matrix multiply kernel depending on the machine that you're running it on And you know if the if the compiler can do that well then great And if it can do that well for a variety of problem sizes not even better ACML is tuned to run Just about as well as you can get out of the machine for a variety of different problem sizes Now as ACML Multi-threaded and Numa aware or is this strictly serial stuff? It is we we provide the library in both Both flavors right now We we have a single threaded version for people who want to do threading in their own applications And don't want to have to worry about the library trunk, you know over subscribing threads So they can call a single threaded a signal version and it will do what they expect it to do We do provide a multi-threaded version. It is To some extent Numa aware it is written using open mp-pragmas If if you have a program and you know, you've got some really large problems that you want to solve and Your program is not multi-threaded. You can for instance, let's say you had You had a you know a lot of 3d ffp's that you want to run You could call the open mp version of ACML and it will run it will multi-thread the the one-dimensional transforms that you need to do and And take maximum advantage of the machine that way Now this is interesting because this kind of blends into Brock's earlier question of compiler versus lab Library here. So if you're using open mp, do you have any influence on the compiler group? Who you know designs this stuff or are you strictly a you know an actual user of open mp? And not so much an implementer We we have people here in on the same floor that we work with In both GCC and open mp I'm sorry open 64 compiler groups. So we do have the opportunity to provide feedback We also have a relatively good relationship with PGI and You know when for instance, if we find bugs, you know, we we get pretty pretty quick service on Turn around on things like that We do have the opportunity to provide feedback You know in the vein of the earlier question about specific optimizations We haven't really done very much of that But certainly, you know if we find issues or you know code generation problems or something of that nature Especially with some of the earlier releases of the compilers. We're able to to talk to people directly about those types of things So so far we've been talking about mostly using the library to get good performance out of these processors Are there any other reasons a user would be? You know it'd be a good idea for them to use ACML or some sort of library over writing their own code Sure ACML provides Standard interfaces for the blast and the LA pack so you can use it and Be assured that you're going to get relatively good performance out of the library and and now that you're going to get the correct answers The same answers that you would get if you were using the standard libraries if you're using the random number generators we provide As an example we provide a lot of functionality there You probably wouldn't want to write That on your own unless you have a PhD in statistics that stuff's really hard to figure out The FFTs we've we've done an extensive amount of work there and You know, I'm sure you user may have his own special needs for FFTs and that might drive You know the need to write their own algorithms, but I think they'll find that what we've done is is very usable on a wide variety of applications So you mentioned earlier You kind of touched lightly on some of the optimizations that you do you mentioned vector processing units and open mp and things like that But could you could you go in a little more detail? Obviously without going into any proprietary information, but you know, what do you do above what you know the casual physicist for example would do and You know understanding that this stuff is kind of complicated, but give us a taste of how complicated it is Okay, that's a good question and allow me to Give kudos to one of my co-workers Dr. Tim Wilkins is is one of our engineers. He's based in California in Sunnyvale and and I mentioned him at the earlier in the show Tim Tim's been working for a long time doing optimization. Tim has a PhD in physics and was Early on interested in extracting good performance on some of his applications on on especially on AMD processors But also on Intel processors Tim's our secret weapon and to be quite honest in and Tim has an intimate knowledge of the architecture of the part Currently he's working actually in the architecture group and providing workloads To to the guys that design the chips so that they can see what kind of performance they get, you know back when the design is in the in the The VHDL stage the you know the Modeling stage so Tim knows the application extremely well. He knows FFTs and Gems and what they're used for and he's very familiar with the architecture of the chip So he's able to write this code in a way that extracts the maximum performance from the part After he's written, you know the basic software we are able to take that and then you know make it a general-purpose routine and and package it with with all the The interfaces that the the stent, you know the blasts and LA pack routines need We're able to do tuning on at a block level to make sure that We're take maximum advantage of of cash hierarchies and things like that We do a fair amount of testing on our machines that we have access to to make sure that Things work well with the multi-core Architectures that we have, you know, we have some machines now that are four sockets and eight cores and that's 32 32 cores on one machine so it takes a fair amount of effort to make that run well So so those are that's the that's the basic flow that we have Generally, we understand the architecture Much better than the typical physicist is going to have time to do and so they're able to take advantage of our expertise So let me ask you about this then the you know both Intel and you have have come up with concepts I think you have different names for it, but hyper threading, right? So multiple threads of execution on the same core. Do you guys take advantage of that? Are there any kernels that benefit from that or is the conventional HP? BC wisdom of you know turn off all forms of hyper threading a better adage here Our product offerings today do actually do not provide any hyper threading As everybody's aware Intel's Intel's machines do provide hyper threading and for age The the concept behind hyper threading is that you had resources on the machine that were underutilized by your application So if you ran more than one thread then you could more fully utilize those those resources in and in the the typical case was that the floating point unit was underutilized as An example you're running a bunch of Signs and cosines well the the instruction mix on signs and cosines is Relatively ad-rich not very many multiplies, but also there's a lot of branches and things like that So the floating point unit is is is relatively underutilized if you run multiple threads on that same On that same machine hyper threading them All of those meant all of those integer threads can probably run at full speed And still have plenty of floating point bandwidth left over. So that was the original impetus behind hyper threading In our case we have not done that we the bulldozer architecture offers or Introduces a new feature that's not strictly hyper threading it but it is a sharing of the floating point unit between two integer cores We don't do anything specific to take advantage of you know to tune for the hyper threading and in fact on Intel Machines if you're running a workload that really uses the the the floating point unit DGMS the classic example Our recommendation is to turn off hyper threading So back on the library What fraction of your code when people think about you know really crazy optimizations? I think about assembly a lot how much assembly is actually being used versus how much like you know higher level language is you know See Fortran C++ Okay, that's a good. That's a great question to to at the high level of course the majority of ACML is written in Fortran all of the black all of the higher high level blasts routines and The FFTs are in the number generators. Those are all written in Fortran. So we require a Fortran compiler for ACML The the heavy lifting in the tunes portions of ACML those are written in in in assembly. Those are handwritten in assembly language So The the the important parts of that are the one-dimensional FFTs and the gem kernels that we have in in all the four precisions things like DTR DTR SM and DTR MM those higher level Blast routines are written in Fortran themselves But they in and they end up calling the assembly language gem kernels to get the real work done So if you were to profile a typical application that's really taking advantage of ACML You'll you'll see some Fortran entry points in there, but the the library will spend the majority of this work in an assembly kernel So is your is the ACML? Polymorphic in the sense that it senses what processor chip is is being used underneath and Selects the appropriate kernel at runtime or is that a link time decision? It is a run-time decision and Or traditionally has been a run-time decision Our ACML 5 which we're working on right now is going to have it has an exception to that which I'll discuss in with the SSE code paths we look at two things. We do look at processor ID to know if we we need to run for instance a nihilum kernel or a woodcrest kernel or something of that nature But we also look at the instruction set Available as well if if an instruction if a processor has SSV instructions, then we know we can take advantage of those with the The new architecture we we have a Break from that with ACML 5 we will introduce a new set of libraries that are specific for AVX and FMA 4 We will continue to have a version of the library builds to support SSC and SSC to code paths So in that case, it's a link time decision and Along the same lines of you know, you were talking about auto sensing and having hand-tuned assembly How closely do you track hardware releases with? With your software releases, so you know a new chip comes out with new features that are really great for you you know you've mentioned the The new bulldozer and whatnot, but in general how how much do you have to keep them in sync? well It's it's crucial for us to keep in relatively well in sync with with our hardware releases You know if we have customers who are buying in these these machines to extract good performance out of them So they need a library they need software to to work well on them And of course the software won't work well on them unless we have libraries to support it. So it's We we tend to time our library releases are our major type library releases for a new hardware release as well So there's been a talk about some other stuff you guys have been doing Have you been doing much work optimizing for non traditional CPU platforms? Yes, that's a great segue we have If if you go on our web page and look and and I guess I could put a plug in for our developer central web page if you Either if you went to AMD comm and just added ace and search for ACML you would find us But there's also dev central dot AMD comm and you can get there through through that link as well You will see that we've Been busily working on us a new set of libraries to support our open cl software development kit offerings and and this library is Designs to be used with our GPU products as pretty much everybody's aware now that GPU is a big up-and-coming thing in the high-performance computing and Just so happens that AMD has GPU product offerings So we've been working on libraries to support the use of these GPUs in in High-performance computing the the tack that we've taken is we are supporting this through the open cl program programming environment and today we have Delivered and in fact we just released the version 1.4 of the andy parallel processing primitives for passing parallel processing math library to support the Open cl 3.5 SDK the library currently has a set of blasts set of level three blasts routines and a set of complex the complex Single precision and double precision FFT functions if you go look at the download page pages for For this you'll you'll find the 1.4 version of our open cl math libraries So on a previous show we had Jack Dungara and some of his lab on here talking about The plasma library and it's add on a magma. Have you guys looked at any of these other weird approaches to Doing the Blossom LA pack That's that's a another good question. I'm not sure I'd call them weird. I would say that Given the the complexities of programming with You know a heterogeneous Compute nodes and with the many cores available in GPU types of products that you you have to start taking approaches such as those and of course the answer the question is yes, we have been looking at those and In fact are collaborating with the University of Tennessee guys to to have an implementation of magma Supporting our open cl libraries and our GPUs So that sounds pretty cool. Let me ask you a follow-on question of that then In terms of polymorphism again, you know Are these libraries going to be polymorphic in the sense that they can determine not only what kind of CPU is there But also whether there is a GPU present or not And the reason I ask this is because you said earlier in the conversation that your customers like having one executable no matter what kind of machine That it is so does that extend to this realm as well that it auto senses whether a GPU is there and does GPU things or Falls back to optimize CPU things Yes, our our open cl library, which is of course designed to be run on GPUs Can also generate Code and will run on the CPU the open cl compiler that we're that we're providing We can generate code for either CPU or GPU and in the library supports both as well For instance in the example applications you can control which of the processors that you run on Via a an environment variable you can restrict restrict it to running on the CPU or on the GPU in in a perfect world, of course, we would divide workloads between Both the CPU and take advantage of all the CPU cores that are available and all the GPU cores We're not quite there yet, but that's but that's certainly a goal that we would like to work towards So what's the licensing for ACML? Licensing for ACML is relatively straightforward. We have a We do have proprietary license for ACML, but we basically allow you to use it As long as you're not in the Republic of North Korea and other such blacklisted come countries We Provide this for free. You can download it from our web page We have a click-through license agreement that you're presented with when you install the product We also have changed our licensing recently to allow Click-through redistribution agreements We now grant redistribution rights for people who want to use ACML and bake it into their products that they that they need to Shift to other customers So you've mentioned a couple of upcoming features like you said in ACML 5 and things like that Are there any other things that you can publicly discuss that's on the roadmap and up and coming? The primary focus for ACML 5.0 is support for our bulldozer processors will have a tuned d-gem and s-gem kernels in this release now and also Complex complex one-dimensional transforms in a future 5.1 release will be adding more of the blast kernels I'm personally working on the z-gem kernel at the moment and we'll also Round out all the precision in the gems and they'll take care of all the level three blasts that are available We will of course be extending our open CL libraries in the future and adding More support for our new GPUs Okay, well chip. Thanks very much for your time. What's the website and stuff where we can find more information on ACML? Sure the best place to find ACML is at devcentral.amd.com And of course, you can also get there if you go through amd.com and do a search for ACML And thanks for having me on the show guys Appreciate it. This was great. Thank you very much