 Before we get started, I'd like to say for the holiday season, both Jeff and I will be taking a break from the show We'll be back in the future with more episodes. So RCE will still be here Yeah, so just wanted to make sure that if you're watching the website of the artist feed You'll see a you know a gap of a couple of weeks before you see a start-up again in January. It's kind of a It's kind of an unwritten rule. You're not supposed to put time-related Notices within the podcast, but you know, we just want to make sure that you're all aware that we'll be gone for a couple weeks But then we'll be back after the holiday season Welcome to another edition of RCE. I'm your host Brock Palin and again I have from open MPI and Cisco systems Jeff Squires Jeff. Thanks for joining us from your offices halfway across the country Good morning, Brock. How's it going? Bad today. We've had a Guess that We've been trying to get a show for a little while and luckily we found some time to get them on we have a David van der Spool From Gromax the I believe it's an MD package I have a number of users that use Gromax and it's pretty familiar and I believe it's involved with some other projects out there He's at the University of Salah in Sweden. So we're actually six hours time zone different here so thanks David for Helping us out at such a different time And I want to say I want to give the standard disclaimer here up at the front a little known secret We do a little coordination before we start recording these things and we have definitely pronounced Determined that we cannot pronounce David's name properly and we cannot pronounce his University properly or his town properly because we're ugly Americans So we give all the standard disclaimers and apologize for all of that in advance Well, I appreciate that you guys are trying anyway, so Well, why you introduce yourself and tell us your involvement with Gromax Yeah, so I'm a professor of computational molecular biophysics in the biology department In the University of Salah and have been working on Gromax for close to 20 years now So let's roll right in I'm sure not everybody listening knows what Gromax is or uses a Similar there's a number of MD packages out there is Gromax MD package or am I completely off base here? No, you're absolutely right. It's a molecular dynamics simulation package and abbreviation is stands for is groaning and machine for chemical simulation Which is a quite ugly name, but Gromax is easy to recall. I would think What was a little bit of the history behind Gromax like where did it come from what was his intended purpose from inception? So in the late 80s people started experimenting with parallel computing and since molecular dynamics is a well-known resource Hawk parallel computing was an obvious thing to do And basically we started from scratch writing a code in C And based on all older and decodes that were around With the functionality that we knew and we started to doing stuff in parallel on Transputers and then moved on to more common things like workstation clusters from there Transputers So this is a special kind of chip made by an English company called in moss who built a chip that had built in communication hardware 15 years before the uptor on It was very ingenious, but a last computing and programming was a bit cumbersome So it was hard to use in the facts So I have to ask if you were Using transputers back in the early 90s and late 80s. Did you happen to use lamb perchance? I don't think we use lamb now We had the things in the box hooked up to a Sun computer and we were programming an OCam and OCam was held because of the weird rules for writing the code that you have fixed indentation and just when I thought I Got over that bit with fixed indentation coding then Python came on and I thought well Bad things never go away. Oh Wow, yeah, the reason I ask is because it's a little known fact is today's the day for secrets and little known facts Apparently that lamb MPI it's its very first target was transputers back in the early 90s And it was actually a publicity stunt to add the MPI layer Which eventually became the the dominant and most used part of it But I just haven't heard the word transputers in many years. So small small world Well, that's the history. Yes, excellent Well, so Who uses grow max and and why do they use it? So you talked a little bit about the history and you know that Parallelism was an obvious target. Could you give us some examples too about you know, why you needed to spread on the multiple Sheen's and how did that spread throughout the the community? So who uses grow max let's start there I think in chemistry and biology is our main target We also have quite a few physicists out there and people use it for anything that you can use the kind of classical model for So anything from biomolecules to carbon nanotubes to gases I guess there are about five to ten thousand academic users quite a few industrial users as well. Cool Yeah, so we have we have a mailing list with two thousand people on there So and they're willing to stand send out with 40 emails per day. So it's quite a quite a spam there, but So these there's probably more users than these that are subscribed to the main list You know, we always find it's very hard to tell, you know, particularly an open source You know, how many users do you have you have no idea because you know The the silent majority of people who just download and use you have no idea But it's nice to hear you have so many subscribers and so much of an active Mailing-less community. That's a that's a very good indicator So I'm sorry. I did ask you a barrage of questions there Let me let me go through a different one of those again because I asked you all at once When you when you started Gromax back in the history part there I what how did you do the parallelism itself you you mentioned some of the the models that you used and whatnot? But back in the early 90s and whatnot was before even MPI was standardized PV PVM was just starting to come out and there were you know dozens of different parallel models before kind of the community Settled on one. So how did you guys? Develop this, you know, how did you what models did you end up using and why? Yes, well apart from okay, and we we tried a few things. So this was basically before PVM even so There we had a homebrew homebrew Compiler that did parallel Pascal and was made at the computer science department in our university in Groningen in the Netherlands We played with that as well, but we found it wasn't stable enough. So the first try was to go for see with a Commercial library for parallel parallelism and I think we used that for a little while and then we moved on to Finally we moved on to PVM and later on MPI If it's of interest we actually also built a machine with another kind of hardware Intel I860s that used to end up in printers and stuff because they were good for that stuff With a special hardware communication, which was also made for a spy company And we had a big rag fold with 32 of these boxes 32 processors each of them five mega flops or so. So this was really breakthrough performance at the day So you guys really used a lot of specialty hardware You were really kind of digging to find every every little bit you could have performance that was out there at the available technology at the time Yes, initially it was even worse or even more ambitious I should say and in that we were planning to make our own Implement the algorithms in hardware basically But we soon found out that this basically was beyond our capabilities And this is when we moved on to a programmable chip with special purpose Communication but at least a programmable chip for the main part of the code And that was the transputer. I'm not familiar with them. So no, this was this was the I860 that from Intel Okay, yeah, this was quite a bit faster than the transputer actually So what do you say programmable you mean software programmable not like hardware programmable like an FPGA or something Exactly. This was just a normal processor that so this was basically Intel's first Try at a fast processor a risk-like processor like they did later with the titanium But they basically gave up on it Okay, so moving into how you derive modern performance when I was building Gromax And I've built several versions of it for several users with different options on and off Where are you driving most of your performance? I notice there's a lot of assembly in there It's pretty much all that is a lot of gromax in assembly or Well, this is fun that you should ask this because From version 3.0, which was released in 2001 We had lots and lots of assembly which was hand-coded by a PC student Eric Lindahl who basically Decided one beautiful day that he had nothing better to do So he sat down for half a year and programmed basically 1 million lines of assembly Now it sounds worse than it is because many of these lines were actually copies with small modifications But this has served us very well, and this is a lot to a large part where the performance comes from But very recently we actually dropped all the assembly So we went from a million to zero lines of assembly and what we're doing now is we have normal C code with this assembly intrinsics, which are portable And in that sense much easier to to use So basically we're moving away from that Interesting. So this is like a standard thing where you can basically call C functions that call the assembly for you similar to SSC intrinsics It's a bit like that. Yeah, it is SSC intrinsics. Actually, this is the real name And we're just using these And actually we're generating our codes as well. We're not Riding down all the code ourselves, but we're generating all our loops and that makes it much more manageable Another thing that we have been using extensively In order to get performance is not calculating anything if we know that the result is gonna be zero and That's actually quite a bit of that in MD and I presume in many codes So if you're going to multiply two numbers and one of them is zero, then you might as well not do it because you know the result beforehand Interesting so you can actually do a compare and branch and that's faster than doing a multiply Is what you're saying. Well, the thing is we know beforehand So we load up our molecules and we know beforehand the properties of these things and not all Fritz is not all atoms in our molecule have a charge Now if you don't have a charge, then we don't have to calculate the charge charge Interaction the Columbia interaction and we basically mask out these atoms for the rest of the calculation So we do the masking once and then we can use that mask for the rest of the calculation which can run for days So that's okay easily gained. I see so it's not you're not doing a test and compare before every single multiply It's more of an intelligent macro level kind of thing that you can mask out Like you said entire atoms and molecules exactly and and the same actually applies to too much of our code basically we're trying to use smart algorithms and Also avoid if statements. I mean who's ever tried to optimize any code knows that if you have an if statement in a loop that The processors gonna choke on it. So we just take the if statements out of our inner loops and Basically multiply the loops and then you have ten loops instead of one loop with ten if statements, but they go a whole lot faster Cool, so you mentioned a well Brock and you both mentioned as I see do you use any of the other intrinsics like MMX and some of the other Advanced things that are great for math Actually, yeah, so this is not my personal work. This is our Glendale, but he has implemented the stuff for MMX 3D now Vmx on PowerPC. We got some help for for tuning on blue gene Also, we had itanium assembly at one stage with help from HP. I think if I remember correctly So people are also eager to help us with with optimization Cool, do you do you have optimizations for some of the newer chips like, you know, Nehalem's and I'm sorry I forget the name of the most recent AMD. Is it Shanghai? Istanbul Istanbul, thank you Yeah, so I think we're not using anything above SSC 2 or 3. I'm not entirely sure I think the Nehalem's and things like that use They have more cores. They are better at predicting as stuff But what is actually quite fun about the Nehalem's is that we can actually now for the first time use hyperthreading So if you use two processes on one core with hyperthreading, we get about 80% gain in throughput That's fascinating because in the prior generations, you know the hyperthreading in the Harper Towns and the other Xeons and whatnot really wasn't all that effective from HPC but this is a this is neat to hear because Intel has told me that they really like the hyperthreading in the Nehalem and we should be trying it for HPC applications and It's really refreshing to hear that from out in the real world that people are seeing some some improvements. That's cool Yes, it's basically for free and the other stuff that we got for free a couple of times now is the widening of the bus So initially the SSC was 32-bit and in turn 64 now 128 bits and each time that happened has actually given us Almost a factor of two in performance for this part of the code where we are actually using that because you're basically doing four instructions Simultaneously instead of one So quick question though. I was under your impression that most MD codes were memory bandwidth bound I mean are you and doing intelligent blocking on cache sizes on these different CPU types and trying to intelligently use memory? I was understood they kind of operated a fraction of the processors theoretical performance because the memory couldn't keep up Yes, yes, and no so With infinite cash your CPU bound but it was not infinite cash, which we have to live with Memory is important, but you can do quite simple work around stare You can basically sort your problem sort your coordinates such that you are In a better order for the memory and in that way actually also gain like a factor of two So by sorting our Calculations such that they stream through memory faster you gain almost a factor of two in performance Okay. Yeah, because I'm involved with a project here at the University of Michigan that Uses a GPUs and their high memory bandwidth to do MD calculations and they're having good results with that Are you guys looking at using any of these new? You know strange hardware that's becoming available again Absolutely, we are collaborating with this feature ponder yet Stanford who's doing the folding at home project and they have a couple of people at the computer science at Stanford looking into GPUs and they have for a long time actually so the streaming computing is quite Quite big, but it has its drawbacks right now, but it may get there that is competitive Right now, it's not really competitive yet But it may get there absolutely What kind of drawbacks are you running into that we should be aware of? Oh? Well until very recently you couldn't have integers And that makes life quite miserable for a programmer, but this has been solved now with the latest What is it CUDA? and OPCL Also, I think only very recently the first cards have become available that they allow you to do double precision calculations But the biggest drawback is basically the interface between the CPU and the GPU and This makes it prohibitive to do a lot of communication so basically you have to move your whole application to the GPU and calculate for a long time and then Move some stuff back to the CPU and store it And this is makes also scaling at the moment very difficult So it's difficult to scale beyond a couple of GPUs that you have in the box Wow, so you're not doing the typical CPU plus GPU kind of thing or at least typical to me you're actually moving the bulk of your code down into the GPU and and The liberally limiting your CPU GPU communication. Is that inaccurate characterization? Yes, if we don't do that, then it's pointless basically so the GPU is three two to ten times faster than the CPU But that's one core of the CPU for us So if I have a box with with eight cores, then that is faster than Without one GPU for sure and depending on the problem a bit. It's it's could also be faster. They're forced GPUs So there's only for certain problems that it's actually useful That's interesting too because that is kind of the basis of Intel's argument about why they haven't really made much of a foray And then the accelerator market is because they're saying well we'll just put more and more cores on there and that'll do effectively the same thing and so you're kind of Confirming that or at least saying that the picture is very gray and it's it's you know Dependent upon your application as to what you're gonna see. Hmm fascinating I always love hearing you know confirmation of what the the marketing guys say with the people who are actually really using the products and doing real science Okay Moving on beyond the accelerator thing. Let me throw another question in here the effects of NUMA So, you know before we had these accelerator technologies You must have been dealing with NUMA effects for a long time not just with optron But with things even earlier than that and so on how how does your code take advantage of that or exploit that or have to be aware of that? Until now we have basically relied on on the MPI library to do this kind of stuff for us And that's actually so we basically haven't bothered so much. Oh, okay, and that has worked quite well I mean the open MPI that we have used a lot and Lama MPI before that has quite good support for SMP communication And we've just used that as is We are thinking about going into mixed parallelism though, and it's not entirely what you're asking for but where we basically Combine SMP computing on one nodes and with MPI communication between nodes And the obvious reason for that is that you basically streamlining in a communication between nodes Such that you have larger but fewer messages between the nodes then you would if you would have one core one MPI task That was just I think the more common paradigm for parallel computing. All right very cool So do you so actually with that model then the NUMA effects might actually become even more important Is that something you anticipate looking into? I'm not I'm not really sure how we could take advantage of NUMA I guess let me rephrase my question It's not so much to take advantage of but make sure not to hurt yourself because of NUMA Or are you just basically relying on the OS and the OS does a good enough job of the memory placement and whatnot? Well, I think this was an issue with Linux Previously that is Processes were not tightly coupled to a CPU and if you had an operating system Event going on That all the processes could be shifted to the next core or something and that would of course hurt cash performance quite a bit Which is by the way It's one of the reasons why Cray are developing their own Linux version which basically taking out the operating system noise as they call it in order to to streamline Performance even more, but I think nowadays this is quite well handled in Linux. So it's not a big deal Okay, do you use here's another curiosity which is I'll explain myself in a minute here But do you actually probe Linux and whatnot to find out what the cash sizes are or do you just kind of have hard-coded defaults in there? No, we don't do anything like that the only thing that we Sometimes do is we use the the FFTW library for fast Fourier transforms And they have this built-in algorithms for testing which version of their algorithm is fastest So that that can that can be done and I'm not sure what they are doing under the hood. So for us So for us the performance is really dominated by what kind of problem your happened to do and I mean, yeah, we're not taking advantage of more or less cash explicitly Gotcha the reason I mentioned is I'm sorry is a little bit of an advertisement is there's another new sub-project under open MPI called HW loke hardware locality that can report to you exactly what your cash line Sizes are and which processors share which cash and so on so I didn't know if that would be useful to you or not It could in principle be I mean we I don't think we we have thought about it In great detail, but I could imagine that you could Unroll your loops a bit more if you know you have more cash Or something like that and since we're doing this automatically it might actually be possible to Create multiple loops for the same kind of calculation Which can take advantage of that. So that's that's an interesting idea Okay, moving on to some other tools and complementary bits to grow max What tools do you normally use with grow max? I mean grow max is a core MD package, but it's it's not visualization as far as I know What what do you use in your own workflow with grow max? Okay, that's a good question. So there's a few packages I would say and it depends a bit on on on which user but So we like by model lot, which is a molecular graphics program by Warren Delano who passed away recently or fortunately Other people use a VMD from the Klaus Shulton group at Illinois Let's see and this is also visualization then then there is a couple of Quantum chemistry packages, which you can use in conjunction with grow max and that includes actually most of the of the Used packages to the Gaussian games CP and D And a couple more actually So these can be used Then there's a couple of Yeah, more related or more or less related programs like What if which you can use for some kind of Pre-processing And then of course we create lots of of simple line graphs and people use xm grays or glue plot to plot these or or Excel what have you? So anything and everything no, that's actually correct. Yeah I was reading and there was some mentioned to grow max and folding at home What's the relationship between grow max and folding at home? Yeah, so it's in a couple of years. I think five or six years to folding at home Project uses grow max for most of their simulations So if you're downloading the folding at home screensaver, you might be running grow max on the hood and the reason why this is is that Of course performance and One of the grow max developers was a postdoc at Stanford at the time I think and then this was initiated and since then they have been using grow max a lot And what is could you give a quick rundown of what folding at home is? right, so folding at home started out as Distributed computing projects Very basically volunteers download a screensaver for their windows or Mac boxes or Linux even Which runs in the background as soon as your CPU is idle it it gets A protein structure from the server and then it does a molecular dynamic simulation And the aim with this is to predict the structure of the protein From the simulation and this was actually quite successful. So they have been able to predict the structures or post-dicts I should say because these were known structures would say they could actually Check the results But they have done this for quite a few proteins and since then they have moved on into basically all kind of Biochemical problems that you can study with this kind of software Okay Going back just a little minute here. So Brock asked you what other tools are used. I want to ask you what other Middleware and tools are used inside of grow max you mentioned MPI for example, and you also mentioned FFTW Is there other building blocks that you build grow max on top of? Actually, we're trying to afford Too many dependencies because it can be painful, but MPI is of course absolutely necessary All modern desktops even have two to four cores. So you want to run MPI there Then we use the FFTW for the fast 3A transforms But we can also use vendor libraries like Intel's MKL or other things like that Other than that is actually not a whole lot. We have some analysis tools that use for instance the canoe scientific library But that's basically it No, hang on. We also use blasts and lapak for a very small subset of problems that we use Matrix stuff. Yeah Okay Well, let me go off on a different tangent here then how how does the grow max community work? How do you add new features? How do you do you know find out what people want and need and so on and actually then get that implemented inside grow max? Yes, all the Important communication takes place on the grow max developer mailing list And this is often where you will find people who will come with more detailed questions than just the normal Users of how do I do this? So people say I want to implement this and that and then we tell them well by all means and If they do it and they provide us a patch then we then we can implement it And other than that, there's a couple of groups who have Developers that can actually add stuff to our source code repositories So these are based in Sweden and Germany and a couple in the States as well I think there are maybe 1520 active developers right now. That's great So how do you find the the quality of random patches that come in do you find We an open MPI, you know periodically get patches from the wild so to speak and some of them are great And some of them are you know the it works for me quality and need to be hardened up before they can be put in the main line Codebase and whatnot. Do you find similar or or how does that work? I? Guess we we have the same problem there that not everything is a great Coding wise, but on the other hand our own code is not so great either because it has a history of 20 years And we're working on it, but you know This always has somehow lower priority than getting new features and bug fixes. So Cleaning up is not high in a priority list. Unfortunately Gotcha, so you mentioned 15 to 20 active developers and so on How do you maintain that as a community? Do you ride rely pretty much solely on? Email for communication or you know Have you been working with these people so long that it's just easy to communicate and work with each other? And you see your each other at conferences and things like that It's a bit of a mix. So some of the people I've known for 15 years, of course and others have become recently have become developers recently only so It's it's a mixture. So we use of course software for the source code management we use git We actually just switched this year from CVS to get which was a bit painful in the beginning, but it's getting there But communication is mainly over the over the developer mailing list and this is also good because it's basically The communications are stored in that manner in on different servers around the world On a similar line What features are coming to grow max into features or something in the pipeline right now? That's sitting in to get repo That's not an official release Yes, there are some Some things we're working on support for more force fields So more models to use with your chemistry that you're interested in Even better scaling So this is also Been done and has to be tested until it can be released, of course This mixed parallelism is on the line, which I would just mentioned a couple of new algorithms Most of most of the new stuff that so some of the of the performance stuff is basically what we're developing ourselves because Of course, we're trying to get inspiration from others But we have quite quite good performance already. So we have to find come up with new stuff to make it better We're trying to implement other algorithms that that people have published basically For stuff that users want and applications So it occurs to me that we actually completely forgot to ask you something earlier in the interview here What exactly is grow max is it is it middleware? Is it an API? How do you how do you use this as a user in your workflow? Do you export language bindings or their command line tools or how does it work? Okay, so it's completely command line driven. So we have 80 to 100 programs that you use from the command line Well, then get should have been a natural mapping for you guys, right? Yeah, that's true, but I mean this get is not one-to-one with CVS. So that is basically Yeah, no, but We have so 8200 command line tools, which can be a bit much for for the beginners, especially if they have never used a command line before but To our defense all the all the programs have basically the same interface So they have a dash H flag which you are used to for help and so it's it's not as bad as it sounds but we're Someday, maybe we'll get a gooey as well, but not not just yet I Had a number of users at my own site use some third-party packages that rely on libgmx from inside grow max So I do know some people who are using a bit of the grow max Library you could say outside of the command line tools that come with grow max This is new to me Yeah, no, it was as hard as like I need libgmx like what it was new to me Okay, that's interesting. No, but there are some there is quite a bit of stuff going on that and not everyone is community communicating everything to us as well. So people publish Basically variants of the the grow max software where they hacked in their own stuff So in some case actually we had to tell them to please put it under the GPL and the new public license Which we also are using But other than that everything is everyone is welcome to do whatever they like with it So the whole project's under the GPL Yes, actually this is Until now it has been on the GPL and we are considering putting part or all of it under the lesser the GPL To allow for instance commercial packages to link to it But we haven't taken a formal decision yet So how would you do a formal decision like that in an informal community like you've got? You know, what kind of consensus do you need are there actually intellectual and legal issues involved there? Or is it just kind of a straw vote? No, I think we're basically three main guys who who decide this between this between us and That's a simple as that Okay Okay, and this is a quick thing before we end here There's a lot of different MD packages out there with different focuses whether it's material science or biology or chemical systems What other packages do you admire as a you know an MD user and developer? Yeah, to be really honest I have not so much experience with other packages because basically from the beginning we still decided to develop our own However, very recently there has been a new addition to a new kid on the block. So to speak Which is the code called Desmond by it's made by a company D shore at Who are have their offices at Manhattan and they've also built a special purpose computer So basically this they did what we set out to do 20 years ago And this has shown phenomenal performance actually so it's like two three orders of making it faster than anything else that you can do So this is really an impressive thing So how does one get involved in grow max? Let's say I'm a random User and or developer out there and so on how would I you know go about joining your community and you know Actually contributing code and doing things like that Yeah, it's it's very low key. So basically you write an email to the GMX developer list And saying I would like to do this and that But I don't know how to do it or something or where should I start and then you will get some friendly advice and that's that's about it and People are also welcome to to drop by if they have bigger plans And and stay with us and work with us for a while or with one of the other partners in Stanford or in Germany or in Stockholm So we are open to all kinds of contributions really They never want to turn down a free contribution. That's how open source works, right? Well to a point, of course, and I sure So if the sometimes it's more work for for us then Then then the game basically so that's of course where you have to draw the line But it's not always so easy to predict that beforehand. Yes It's it's the enthusiasm that you don't want to dampen regardless of the quality of the work Sometimes and that is a fine line to tread indeed So contact information downloading grow max. So where's all that located websites mailing list addresses? So the easiest way is just to go to www.growmax.org and there you will find pointers to mailing lists to downloads and Also archives of of the mailing lists and stuff so you can really find everything there We also have a bugzilla where of course there's also link to that So everything is really on that one website Okay, well, thank you David. Thank you very much for taking some time out to Speak with Jeff and I and this show will go out soon Thank you. We appreciate your time. Have good holiday season the same