 Good morning everybody. It gives me a pleasure to introduce our next speaker, Professor Martin Guest from the University of Cardiff. Martin and I, we are going back quite some time now. So I'm not going into the details because you're not here to see me. You can see me on Zoom and whatever. You're here to see Martin and Martin is giving a very interesting talk which is actually very similar to the one in CIUK which he couldn't do this year. And he's talking about performance of community codes in multi-core processors. Martin. Thank you very much. And thanks for the invite. I was a bit surprised to get the invite because I didn't think genuinely that you'd be interested in this stuff. Hopefully within the next hour at least you won't all be falling asleep. I shall be checking that regular intervals for the sleds coming up from the audience. Right, so what are we going to try and do? As you always pointed out, typically these presentations, I've been doing them for a long time, are presented at Computing Insight UK. And actually, I actually kicked off the original machine evaluation workshop which was the predecessor of the CIUK and that was, I was trying to remember this on the way up, the very late 80s. So these things have been going on a long time and one has witnessed a really tremendous change just because of our saying in the landscape that one has reported on. In some ways this is a bit sad because one has seen a huge variation in suppliers and processors now everything aligning to a simple choice between AMD possibly ARM and Intel and in some ways that. That's a wee bit depressing but I can't waffle like this because if I do, my reputation of always overrunning with 100 slides is going to be even more pronounced so I need to crack on. So what are we going to be trying to do? And it should have been really straightforward if I've been given the CIUK talk now, then something has happened in the interim, namely access to Sapphire Rapids and that has kind of changed things pretty substantially. So that's the kind of good news. It wasn't good news for me because it meant I had to change every damn slide but actually it's probably quite interesting, more interesting than it would have been if one was just talking about isolate but what I don't have is access to Genoa so a fair comparison it's not fair to compare Sapphire Rapids to Midland because they are completely different in terms of capabilities. So what we're going to be doing is focusing on systems featuring predominantly processors from both Intel and AMD. So Sapphire Rapids and isolate the leading SKUs from Intel and basically unfortunately not Genoa from AMD. Baseline clusters will be a couple of the oldest Skylake and AMD Epic Roan nodes. There are two Intel Sapphire Rapids clusters. Now the high bandwidth memory one actually is really disappointing, should say that up from, does brilliantly on stream as you would expect so it's actually a terabyte on stream but in all the other applications I'm looking at here it had very little impact at all. So that was a wee bit sad. So two Intel Sapphire Rapids and then five isolate clusters and I won't go into the details. One was interested in the variations box the number of cores but actually they kind of vanish in the noise when you actually bring Sapphire Rapids into play and then four AMD Epic Milane clusters. We look at seven applications in total, hence a hundred slides and hence the impossibility again through that lot. So the organisers have kindly agreed that they will waive a certain number of fingers at me with 10 minutes to go because I'll need 10 minutes to actually conclude. So we'll be looking at molecular simulation, so DL Poly and Amber, materials, modeling, CASTEP and VAST, electronic structure code, games UK and then two ocean modelling codes NEMO and DEFINITON. And the important thing really to bear in mind is when you look at scalability and when you analyse performance, how are you doing it? Do you do it by cores or do you do it by nodes? So they're very different things actually as we will see and one of the major impacts of the very large number of cores for nodes, so 112 in the case of Sapphire Rapids is a very different picture when you look at processing elements scalability and you look at process scalability by the number of nodes. We'll see that. The one thing I'm not doing is pricing. I've been feel attained almost numerous times by suppliers when I try to mention price and actually the price is meaningless because this price is a joke as we know and what you get to pay in practice depends on how the supplier wishes to engage with you. I think it's as simple as that so pricing is a bit irrelevant. Okay, so methodology and approach quick summary here. So no special treatment, right? The codes are run as a standard user. There's no dedicated access involved. They've been running production mode so it's what you would see as a user is of interest and when it's not looking really at well the focus is on mid-range clusters. So the highest number of core count using these applications is 1000 cores. So we're not talking about real scalability. We're not talking about weak scaling actually. We're just focusing on the typical jobs that will run on the production cluster. Point four is a lot of waffle actually but it's actually quite important waffle and the last two machine evaluation workshop, sorry, CIUK sessions, I've wasted far too many hours mucking around trying to sort this out. So this is running Intel Parallel Studio actually on AMD hardware and what one routinely finds is huge problems in terms of stability. You move from one version of Parallel Studio, so 2019-5 to 2019-12, things fail, things fall over, no rhyme nor body reason and it has improved exceptionally frustrating Intel. One API sorted a lot of that out but you still see it in practice. So the final bullet there when you move with Vast and Castep between Intel 2018 and Intel 2020 on AMD hardware, you're running through all sorts of problems. I mean the performance goes down the toilet, right? So running Intel software on AMD hardware is a not great and it's certainly not. I mean if you look at Archer 2 they do not do that. There's a very good reason for not doing it. Okay so very quick run through both the Genoa line from AMD and also the Sapphire Rapid one from Intel. This gives you a picture and it's a real shame I haven't got a pointer and the real pointer is going to work. So four columns, this is looking at various generations of AMD processors from Naples through Rome, from the land up to Genoa and you'll notice clearly AMD got pretty carried away with how good Genoa was, but of course they changed the name they made from the Lension. They went from the 7,000 up to the 9,000 so I wonder how much marketing cash that involved. But anyway, but it is an exceptionally good chip. There's no doubt about that. You've got up to 96 cores, so 192 cores on a dual processor node, typical currency for HPC. It does indeed help our memory controllers, max of 86 terabytes of memory and 128 lanes of PCI. So pretty impressive. I did try here to actually go through, I guess most of you will be fully aware of that stuff. You won't be fully aware of all of those though, so wind back margin. Apologies for that. So this is just trying to take you through from the Epic Rome. So Naples was kind of a bit of a warm-up, right? But the Rome with its nine chiplets and its central 14 nanometer IO dye was actually a real really positive step in the right direction. A couple of the processors involved in the review here, 7.7.4.2 and 7.5.0.2, will act as part of our baseline systems. It's a pretty significant improvement going from Zen 2, i.e., AMD Rome to Milan, so Zen 3. And then some really neat stuff here when using stacked L3 cache. So basically providing a total of 768 megavillars of L3 cache on the so-called X series on Milan. And again, we have some numbers in this presentation which show that. And of course, the fundamental driver in all of this is the instructions per clock. And Genoa actually had a 14% height, which is almost identical to the 15% height actually on Intel, on Intel Sapphire Rapids. And that is actually looking at the IPC, the instructions per clock improvement here for Intel. So we've got two things we're trying to show here. The IPC improvements on the left as a function of process of family along the horizontal axis. And then the cumulative IPC change actually, which is on the right and is captured by the black bar. So actually Sapphire Rapids, although personally, I don't think it is as good as Genoa, nevertheless, it's pretty significant improvement compared to Islake. There's a few Islake facts and figures. I don't think we need to spend any time looking at that, except it went up to 40 cores. And the 8380 is actually part of this current presentation. Right. So let's move from a summary of the capabilities of those processes into real facts and figures and real performance baseline systems. Skylake Gold 6148 and also the AMD Rome, both of which feature in the cluster Cardiff. I've tried to capture on the next four or five slides, each of the processes involved. So a couple of cascade late processors 6248, 80280, a whole slew of Islake clusters, which actually the previous C.I.U.K. presentation would have focused really on these. So these vary by interconnect. So we're looking at Malonauts HDR on the one hand, and we're also looking at Cornelis networks to show the OPE fabric and alternative to Malonauts. There are some conclusions that I will draw on that. So not only are we looking actually at different capabilities of those Islake nodes, we're also looking at the interconnect, whether it be HDR or whether it be OPE. These are the two Sapphire Rackets clusters, so the H480, two gigahertz baseline clock, as I said, two gig, 105 meg of L3 cache and DDR5 memory. So actually significantly better credentials than the Islake system. And then the platinum 9480, which is this five bandwidth memory and other than string. I personally, I didn't see much improvement at all. And then the Milan clusters, we'd touch on these as we come across them. But the most interesting one probably is the 7573X, because this has this extended L3 cache. And that really does make quite a difference on some applications. So here are the applications we're looking at. DL Poly and AMBA, VASP and CASTAT, Games UK and FB Common Nemo. And that mixture probably enables us to try to hit that bottom line target in the last bullet, namely the check memory bandwidth and latency, no floating point performance, interconnect performance, particularly across those two classes of interconnect, both in terms of latency and bandwidth and sustained IO performance. The IO performance will come from Nemo and the particular data set which is being used. I should mention this, this is actually an incredible useful tool and I've always found it really useful. So from a linear who were taken over by ARM, who were taken over by Leonardo. So their histories a wee bit has changed, but this tool still is pretty important. It gives you a really easy picture of application performance at quite a high level. So if you look into any parallel code, I strongly advise trying to use this because it really is pretty helpful. These are the sort of pictures that you can very easily construct with access to the tool. So on the bottom right, you're looking at the CPU time breakdown. And I'm going to have to grab that pointer towards the screen. So we're looking here at scalar, a vector and memory access. What you don't want to see is what you're seeing here. Actually, what you don't want to see is dominated by memory access, hardly any vector instructions and a load of scalar. Right, that's a really bad makeup. And too many of the applications in here look like that. So watching as all processor hikes and actually the ability I think to actually capitalize on GPUs rely on being able to exploit the sort of vector numeric operations as they're tagged here. If you can't do that, then a lot of the capabilities in the current and future generation of processors, you're not going to be able to capitalize on. On the left, you're looking at the breakdown as you increase the core count. And when I use the processing elements, I actually mean cores because all of the applications in here are a single MPI process per core. So when I'm talking about MPI processes, I'm not using hybrid code, I'm just running straight MPI. So that's not good, and that isn't good. What you would hope that you would see would be the ability for MPI to kick in a lot later than it is actually kicking in. But these pictures I think are quite valuable in the sense that they give you a pretty good picture easily obtained. You don't have to change the code, you don't have to recompile the code, just basically run with module in place, just run that right off the bat. A bunch of pointers here about compiler and runtime options. I mentioned this issue, running code on AMD processors using Intel MKL. So basically, you have to do things like this, because this was the last version of Intel MKL which recognized AVX2 on non-intel processors. So if you don't do that, then you're not capable of actually exploiting AVX instructions using the Intel software. And I'm going to get very confused here between what pointers work and, anyway, crack it on. So this is the screen benchmark results. And they kind of look at, I guess, what you might expect. So on the left, we have processor generations here. Left is focused on Intel, right is focused on AMD. And as you move through from our baseline system, the Skylake Gold, and Cascade Lake, Ice Lake, and Sapphire Rapids, you see half a terabyte of performance here from the stream on Sapphire Rapids. And in fact, you see exactly the same on Milan, right. So these are on very high core count, relatively steeping nodes. So no difference here in these Ice Lake variations, probably not surprising. But that actually doesn't tell you much about what you're going to see in practice when you run a typical MPI code. Because what you're interested in when you do that is really the effective bandwidth that each of those MPI processors experience. And if you're running 128 cores with 128 processors, you've got to divide these numbers by 128, right. So what you will actually see in practice when you do that, things change quite dramatically. So if you go back to our Skylake system 4.8, it doesn't improve, right. It just does not improve. You come over here to the 7573X on the land, and you have the cash, the additional cash, and things are looking a lot better. These numbers are dreadful. So these are on the 128 core 7713 and the 7763. Well, it's just poor, pretty rather mildly, because after five years, you've gone down by effectively 25% in terms of effective memory bandwidth. So that's not good. So that's the first of the synthetics, the second is really MPI performance. So this is looking at ping pong between two cores on separate nodes. And this is a lot more sane and sensible than it used to be in the days of gigabit comparing gigabit within Cineban. You would literally get back to the 100 difference here in latency. The many codes are lanes in driven. So there's only a factor of two there. There's some really wobbly stuff down here in the Cornalis network systems, that if the code you were running were dependent on this sort of message length, and you were dominated by point-to-point communications will give you a bit of concern. But that is probably not the real driver behind these benchmarks. I mean, the real driver is actually on the collective operations. So for example, the materials codes often involve viewing collective. So you're exchanging data from real space. So this sort of thing is nasty. Different picture. So you're basically the lower the total time it takes to do that collective, the better. So you've got quite a variation here. But the biggest problem is here because if you look at the typical message length involved in many material codes, there are this order. And you've got a large variation here. Right? So does it matter? Well, if the end application code, in this case, tests that spends a lot of its time doing MPI all to all V, which it does, then this is a problem. And you better get your processor right. And you'll notice more to the point that this blue here is, boom, sapphire rapids. And what we will see is actually fast forwarding, a very fast forwarding, that if we actually go through all the codes, then sapphire rapids does a really good job, relatively speaking, except for cast except for this. And I think that's the reason why actually. And I don't know I have to say why is it or as it is. But it is not great. Okay, so I mentioned this core to core and node to node. This is quite important core to core. Really, the performance to jobs with a fixed number of cores. So you take 128 cores or 256 or 512 and you run that and look at scalability across different processes and different MPIs or our node to node comparison and the node to node comparison. Thank you. Node to node is actually far more representative of what you will see in practice. And it's more reflective of a true workload. Okay, so the first example is deal polly. And what probably bearing in mind the amount of time I've got left, I'll do three of these seven applications. And then we're fast forward to the conclusions because actually, the conclusions from the first three are effectively mirrored through the rest, but we'll summarize the rest of the end. So deal polly, CCTV five collaborative computational project five molecular dynamics code written by Bill Smith, Tim Forrest O'Neill and and like most vintage, let's call them vintage, parallel codes, they started life really as replicated data codes. And the problem with replicated data is you're constrained by the size because every node has to have a copy of the entire data structures. By the time you move to the main decomposition, you've distributed the data, you can handle much larger cases deal polly, its strength is in this huge variation in terms of application areas. So some of the many of the other parallel and decos so amber up the top there charm nandy lamps, grow max actually not so much lamps, but most of those others feature on biomolecular simulations, whereas deal polly actually has a much broader coverage. So I've waffled on that pretty much about the main decomposition, pulumbic energy remains global. And through code developed, I think by in bush, the spruce particle mesh a world, you handle the pulumbic forces couple of examples so to imply grammaside in just looking at the grammaside in. So there's numerous slides like this, actually, all the way through. So no surprises, but I shouldn't essentially say on the first one, the key things to watch out for and the structure behind this. So we have performance or use performance, not time and going to the reasons why up here we're normalizing typically with respect to this skylight system here, the paints you'll find will typically be the ice lake systems, the Cambridge blue, not deliberate is actually the sapphire rapids system. So when we move, looking at scalability by processes, then you'll actually see that at this point sapphire rapid is outperforming the other isolate nodes in particular. We're still got a couple of AMD things being shown on here. So and the time you get up to here, then actually the skylight 83 58, which tends to be the most performant of the isolate systems is actually when if you move, and I'm hoping it's on the next slide, when you move to node to node, but the picture changes, right? So here, on two nodes for no date nodes, you actually see that the sapphire rapid system is the dominant system, the most performant system. And that's not really very surprising. Why does it kind of pile up here? Well, again, if you think about it, it's not that surprising. So eight nodes corresponds to 800, well, 896 cores. And you're starting for that particular test case to run out of parallelism on 1000 cores. Whereas these other ones are working down still at the several hundred cores. And therefore, this lead that you're seeing here tends to be eroded, but it's still a significant lead. Okay, if you move on to amber, the second of the molecular dynamics codes, pretty similar story, I think it's said to say. So this is a biomolecular simulation code. It's pretty popular. There are four different benchmarks moving from small through to the M45 case. Benchmarks comprise major urinary proteins and IBM ligands, strange name IBM ligands. And you're talking about a minimum as a simulation there, well, when you're on the M45 case. Two versions of the code, number 18, not surprisingly from 2018. And then 22, again from 2022. And there's a claim here that actually amber 22 provides better performance on multiple CPUs. That claim is manifestly not true, actually. And as a proof of the pudding, this is again, this is looking at nodes. So nodes again, complete domination by sapphire rapids, 112 cores on a node. Theoretically, if the core performance was the same as skylight, which it isn't, it is better, then you would expect that back to the B2.8. Because you've got 40 skylight cores on a node, you've got 112 sapphire rapids. So on a single node, if you were looking at a single node performance, in fact, it should be 2.8. If it's more than 2.8, that means the core performance on sapphire rapids is better. If it's less than 2.8, there's probably an issue. But anyway, so that's actually the performance that you see on using amber 18. That's the performance you see using amber 22. And it's identical, actually, identical within a spitting distance. So there is the performance report. And again, this ain't great, right? You've dominated by memory, you're mainly scalar, you've got no vector instructions here. But actually, this thing will scale a lot further, as you can see, because even at 320 cores, it's still being dominated by CPU performance. So these are the scalability in terms of NPI processes, number of cores. So again, at small core count, one is actually seeing that the sapphire rapids is leading, that again gets eroded over time. The yellow bars are quite interesting, actually, because the two systems outside, the two purple systems that are cushioning those two, they're the HDR ones. The circle and the yellow circled are the OPAs, so the Omnipath Express Interconnect. And it's quite clear as the number of processes are increasing, these systems are falling behind. So I think in terms of availability, OPA is an attractive option in terms of scalability, maybe in some cases, less so. N45 performance, you see exactly the same thing again with amber 18. And on the M27 test phase, again, the similar sort of conclusion. So I've not really put any results in here about GPUs, right? So here's one slide which does. And it may argue strongly against the whole rest of the presentation, but there you go. So let me try and summarize. The most important thing from this slide, and the next slide is software is far more important than hardware, right? That is an obvious thing. Look at that slide. So let me try to explain. This is actually showing scalability of the Skylake. So there's the Skylake nodes, normalizing with respect to one node. This is a single node with GPU capabilities, the P100, the V100. Just focusing on nodes, then you can see on a single GPU node using amber, then you're actually exceeding the performance of up to nine Skylake nodes. And if you do the same thing with sapphire rapids, boom. So the blues go up. Well, that's sapphire rapids. And eight of these nodes is actually out performing the dual processor node just, but the big purple mother on the end is amber 22. So amber 22, I was slagging off before basically saying there is no CPU performance gain is because I think all of the effort and the code developers have been invested in the GPU implementation. And as you can see, that is, so it's this, you compare with that, that's amber 22, that's amber 18, same test space. And the performance has gone up by effectively a factor of three, because you've got mixed mode arithmetic, you've got all sorts of things in the amber 22 case that are not there in the amber 18. Okay. So how am I doing on time, Mr. Chair? Well, I'll do that. And then we're fast forward to the conclusions and I'll finish on time. But as a hand up, don't put your hand up. No, no, please. A single GPU. The bad news is, and I wasn't going to mention this, when I try and run with two, four wins goes down the toilet. So literally those numbers I literally got within the last 48 hours. Actually. So, but I'm sure the single node number is correct. The single GPU number is correct. And there's actually generating the code was dreadful. Right. You've got multiple versions of different software, some works and don't. You've got no MPI, you have Puder, and most versions of Puder, it just bellies up and doesn't run. So it's a bit of a bit of a nightmare. And these or none of this stuff reflects that I'm moaned earlier on about Milan and Intel compilers. Well, so that's chicken feet compared to trying to get that to work. Anyway, that's so all of these codes are quite similar. Actually, the degree of optimization in them, I think it varies quite markedly. They're all DFT calculations, so density, functional theory, they're all pseudo potentials, because if you don't use those, then you're in big trouble. We can't handle a core electrons. And most of them are based on plane waves. Okay, so there's not a huge amount of difference. Some are a lot more popular than others. Maybe because of history and because some are open source and some are commercial. But basically, they all come out of the same building blocks. So this is the Vienna Ab initio simulation package, VAS. It's the most on card, if it's the most popular package by a mile. And on Archer, when they used to rank their systems, it was voted number one. Repeatedly, we're not voted, but it was ranked number one. They stopped doing that for a variety of different reasons that I'm ranked many more. So we're looking and there are different modes of parallelism involved in these codes. So this is looking at two different benchmark cases, palladium oxide and zeolite calculation. The difference in them really is the zeolite calculation is bigger, but more of a point is the palladium oxide has 10k points. So can involve and invoke a point parallelism, whereas the zeolite only has two. If you look at your performance report, then you'll see that this diet plot on the bottom here on the bottom right has a much better usage of vector instructions. So the two simulation codes that we looked at were NAF in the sense that they were hovering around the 5% exploitation of those vector instructions. This looks a lot healthier. And in fact, if we look at CASTEP, then exactly the same thing holds. CASTEP's got that MPI auto or V problem. And maybe I was being a bit overambitious and I was going to finish on time because I need to say something about that CASTEP stuff. But anyway, instead of his first excuse, let's just whip through these quickly. So again, this is looking at the just scalability by the number of cores, by the number of MPI processes we're fixing it or that. And as you can see, by the time you get up to 256 cores, right, it's just flat as a pancake. There's a baseline system. And there are isolate and sapphire rapids, and effectively almost identical in performance on the core to core analysis. There's our point here about dodgy use of Intel compilers on AMD hardware. So combination of compiler with MPL, with MPI libraries means that as you increase and then seriously you move to later versions from Intel 18 to 19 to 20, performance goes down, which is not good news. It should go in the other way around. It does it at every node count. It's that decline in performance. And when you consider all of the Intel SKUs on here, your team things are just really flat. The time you get out of 256 cores, then the Milan 7573X is performing in very similar fashion to both sapphire rapids. And isolate, when you move to node to node, then again, total domination by sapphire rapids. But not really a fair test because I only had access to six nodes, so I can't do that eight node number for the 7573X. We'll skip over the Zeolite thing because it doesn't really have a lot to it. So let's just have a quick look at this because this is, I think the ramification of MPI will be, and it's pretty dodgy when this is being recorded saying things like that, because it can come back to haunt. It's a bit like making vague assumptions about pricing and saying that. And the next day, the bailiff sent up and on the door from Intel asking you why you said this or the other. So one needs to be a wee bit careful, but I'm pretty sure this is right. There is our, this is actually for the AL3X3 simulation. So it's the 270th and sapphire surface. Again, really good usage of CPU vector instructions. So that's pretty good. Memory is still an issue and scalability over there on the right is actually giving you a picture which actually says the moment you move beyond 160 processing elements, an MPI is starting to take over. This is the story which is not great about when you're trying to use Intel and pilers MPI MPL on AMD hardware. And if you look at the use of Intel 20 on the 7502, that's bad news. And then it's just flat lining compared to the performance that you see on other Intel clusters and actually even on the AMD Rome. So why that happens, but it does happen. So here's our looking at the performance gain MPI processors. Probably the bottom line on this issue with is captured on this slide and I think on the next slide. So remember, I said no to node is great, right? On sapphire rapids, you win every time. Not true here. So look at what's happening out on four nodes. The performance has declined dramatically and it's not a scalability issue. That's for sure. And I think when you move to that side, we've got the number on there, but that is still, I believe the chain of that pattern of behavior is a reflection on the issues around all draw V on the sapphire rapids. Okay, I have 10 minutes left. Do I, Mr. Chair? A little bit more. Now you're talking. Oh, all right. Okay, we'll leave this out. We'll leave Benjy payout. And we'll also leave out Nemo. And we'll also leave out SD on right. So this is actually, how do you put all these numbers together? What's the best way of doing it? What conclusions can you draw from it? So the easy way actually is just to compare system for system in this type of graph where what we're looking at is the code upon the vertical axis. And we're looking at the performance and we're comparing sapphire rapids. So this is core to core 128 processing elements. And you'll see variations between 1.2 times faster and 1.75 times faster, an average factor of 1.4. You can do the same thing with isolate. And they're the average improvement factor. I think it's about 1.3. So in terms of just in terms of cores, performance of a core, that is the sort of improvement you're looking at. The other way of looking at it is actually in the Kivir diagram as we're actually doing here. So here, each of the spokes correspond to a particular application and the particular data set. And what we do is the optimum performance that we're seeing across all of those systems for that case is normalized to 1. And then the others are rated pro rata compared to that 1. So remember here, we're looking at core performance. We said that there's not a huge amount of difference. And sure enough, that's what we're seeing. So the innermost circle, you've got something which is living in the heart of that Kivir diagram. That is the worst performing system. The best performing system will hug the perimeter because it's always going to come out with a factor of 1. And you'll see that it's on this 128 cores and not a huge amount in it. But actually, with the exception of the vast case and the Kastap case, then Sapphire Raptage clearly is the most performing system. There are examples here when you compare in the land 7573X to the ice lake systems in particular, that actually is outperforming those systems, 256 cores, similar conclusion. But if you look at the perimeter of that Kivir diagram, then in most cases with the exception of Kastap, and we've tried to rationalize that because of the MPI all to all V, then again, Sapphire Raptage is actually the most performing system. Now go to node to node. And I hope I've convinced you that when you do that, then Sapphire Raptage node performance is significantly better. Well, there's no surprise there, it should be. But when you actually look at the corresponding Kivir diagram, then it's really obvious. So there you have Sapphire Raptage with the exception of Kastap down there on the left hand quadrant, is way ahead of everything else. And not surprisingly, you have Skylake and actually Cascade Lake in the middle of that. Then beyond that AMD Rome, that has 64 processors per core compared to the 14 processors per core, sorry for node in Skylake. And you've got most of the ice lake systems being pretty close. But clearly Sapphire Raptage is outperforming those by stretch. And the same sort of thing on Fornode. But when you get to Fornode, this Kastap situation is even more marked. So for every other application you're finding Sapphire Raptage is hugging a perimeter. It does not do that when you actually get to Kastap. So the summary really core to core suggests an average of the Intel SBR Sapphire Raptage is outperforms all other Intel SKUs with the exception from Kastap. And you can draw various other conclusions based on those performance charts in terms of where Milan will sit compared to particularly ice lake and also the Sapphire Raptage based on the level of AVX512 usage. You can do a similar comparison node to node. Node to node, then Sapphire Raptage is has far superior performance. We're game with the exception of Kastap. But Genoa should be the example. One is contrasting with Sapphire Raptage actually and not Milan. But hopefully one will get access to the other and do that. A bunch of acknowledgments here, actually the people who've helped in terms of in particular getting access to these systems. Bottom bullet there is actually help that I got from this helping in the procurement from the guys down at Plymouth Marine Land for discussions on NEMO and at VCOM. And they were really on purpose to build actually. And I guess that's it, I think. So quick summary there which effectively reflects the choice of applications, reflects the choice of systems that we've actually looked at. The point about processing elements i.e. cores and nodes is important. If you genuinely have a processing comparison on a fact node which is outperforming all the other cores from other suppliers and that would be the system I think of choice. And again, the genuinely miserable bottom line is I can't say anything about pricing, which actually at the end of the day drives the bus, right? Anyway, so thank you and I hope some of that at least found interesting. Thanks very much. Questions? The man with the hand again, yes? Hold on. Well, let's use this one for the remote people. It should work and Simon will check if that's okay. So the one thing I'm missing which is on this slide, but not in the benchmarks you showed us is the 9480 because all the codes are memory bound. So I would assume that the HBM would make a significant difference. Where does the 9480 lie? Well, okay. So let me try to answer it and fail. So the 9480 is only going to help on memory bandwidth codes. It's not going to help on memory latency. I'm not going to do anything to that. All of the molecular simulation codes, so this is hardly getting out of jail, all of the molecular simulation codes are dominated by memory latency, not by memory bandwidth. Those performance reports where you've seen 70% memory doesn't say anything whether it's bandwidth or latency. Okay. Is that satisfying you? Yeah. Really? Okay. Let me argue against myself. The one code in there, actually, I don't have an answer for it, because it's too hard to build a whiny excuse, is Nemo. I don't think any of the other codes actually are truly memory bandwidth crippled at all. The one which is is Nemo and I've not managed as yet to build it. So all of those applications that I ran on the 9480 and not, remember the 9480 has a slower clock speed, right? So you're comparing 2G against 1.9 and that actually is quite a lot. So the glory days of 3G Earth processors and DL Poly doing really well to ramp up a clock speed, boom, they're gone. You've got these miserly bloody clock speeds and you've got hundreds of cores that you've got to somehow try and ramp up. So that difference to point one of the bigger herds actually on the 8480 and the 9480 is quite a lot. I need to run Nemo on it. Because there's no effect on Nemo, then I really don't know what is going on because I have run the stream. I didn't go on there and that is a terabyte second on the 9480, which is really scary. Anyway, did you have your hand up? Yeah, I'm just asking somebody to unmute online, hopefully, yeah. Hi, Martin, thanks for the presentation. Just one moment, we just need to sort out the sound. I can hear you whispering. Okay, they want to ask the question themselves. Hi, is it fine now? Hang on a sec, we're almost with you. Hang on, we're not almost with you. Could you repeat the question, please? Hi, is someone final? Yeah, this is a great way of getting out of answering tricky questions. All right, let's remember this. Sorry, at the moment, we don't have sound from you in the room. So can you ask the question and I'll repeat. Yeah, so Martin went on for that the Intel compilers are not so great. If you try alternative compilers on AMD hardware, and what is the Intel still the best, or can you help it by using an alternative compiler suite? To be honest, yes, I have. Were they any better? They didn't show the same problems, but the performance was always worse. Right. And I think genuinely, there are two factors in, it's not the Intel compiler, it's the MKL library. So you really see it on code such as VASC, where you have a significant vector contribution. And MKL certainly does not recognize AVX instructions beyond a certain date. And so I think why the whole problem that I ran into with, which was MPI specific, when I moved MPI versions, I found the performance getting worse. I don't understand that. I mean, I have to be honest. And I've not made any concerted effort to switch and to interrogate properly the variety of other compilers, which I know are out there. My attempts have been limited to the new and I know that's always going to be worse. I mean, it played catch up for a long time. And with that playing catch up in regular optimization issues and errors when trying to generate code. But no, I've not done a proper job on that. I will be the first to admit it. But if you say this MKL... You should be able to hear the remote people now. So does that answer the question? Whoa. I hope it did. If I just can follow on of what Martin said. So if the issue is not the compiler, if it's the MKL, which part are you using the BLAS or the FFTW, the FFT part of MKL? And if so, switching in a different BLAS library... Yeah, I mean, I'm pretty sure it's... Oh, my goodness. I'm pretty sure it's MKL, which is the source of one of the problems. So I'm certain that it does not recognize AVX instruction beyond 2019-5, and therefore you're going to see a performance hit. I don't believe it's as a compiler. But the second problem, which is why do I have numerous issues when using the later versions of MPI with Paradell Studio, which, by the way, go away with one API. So one API saved my bacon, actually, ahead of the previous computing insight, because I'd literally spend a month claiming, literally, to get any continuity between those different compiler releases. And I switched to one API, and almost, as if by magic, everything worked. I'm not that. That doesn't say anything about performance, but it removed that catalogue of woes in terms of execution, violation, exceptions here, there, and everywhere, everything hanging. And when I sought advice, and I don't know if he's online, so again, same. But when I sought advice, and this was on the Dell Benchmarking Center, as to how can somebody help? Because I'm just stuck. They said, yeah, the advice we would give you is do not use Intel software on AMD processors. Great. But that wasn't really an option. But it may go back to the original question. You have to use different compilers, as indeed Archer obviously do. So three possible sources. I don't really understand the problem. I know there are at least two problems that are real. Do I have an answer to them? No. And I'm also really mindful of the fact that my presentations always come over as Intel bias. And they're really not meant to be. It's just the experience I find when I try to use Intel software on AMD processors, it's pretty nasty. And I would kind of worry that when I do get access to Genova, if I follow this same path, I'm going to see exactly the same thing again. Yeah, exactly. It might be worse even because then you have AVX5 too, so a new set of workarounds that you will need. Well, hopefully I'll be gone by then. Somebody else's problem. And maybe I can add a question right away. So any plans to try this work also on Archer and probably on Lumi because there's a nice difference between Archer and Lumi. And I think it's Roman Millon and there is one and two network adapters per socket. Or is it just the fact that we don't have Intel that's stopping you from trying that? Well, no, that's not the reason. I mean, I think I did go and use a 7742, right? I did that on a system within ATOS because at that point, Archer 2 was really flaky. It was in the very early days. And I thought I'm not going to muck around with that. I mean, it's bad enough trying to solve this bloody problem out on the Dell systems. I don't want another set of corresponding problems to sort out. But you're completely right. The only way that I can really sort this out is just to walk away from using Intel and have a separate thread, which actually says no Intel. I say I'm not biased towards Intel. I'm not. It's not as easy as it seems, though, because some of those applications have trouble on many other compilers. Yeah, sure. But actually Intel copied some of, especially if it's a Fortran application like Vasp, Intel copied some of the bugs from GNU Fortran. Some of those codes only work well on GNU Fortran and Intel Fortran. And we all know that GNU Fortran is not that good. Indeed. I do think, actually, genuinely, and if people want to argue the toss against this, that's fine. I do think Intel did the community a huge favor for a long time. I mean, you got to pay for the honor, right? But their software for a decade or longer, you know, I think was first class. I mean, the last four or five years, I think it's been awful, right? But there was certainly a period. And it was exemplified when I went to Cardiff in 2007, there was an octa on AMD shop. And then AMD self-destructive for a decade in rather spectacular fashion. And the evolution of Intel processors together with the evolution of their software actually meant that I think that the environment was stable. And you did see improvements that were reasonable moving from processor generation to processor generation. But it's tricky, right? Because you can make a pretty valid argument which says we shouldn't do any of this stuff at all. You should just be using GPUs while you muck around with all this stuff. The cost of developing and rewriting a software package to exploit GPUs efficiently is arguably over a period of time, more than the hardware that you're actually going to run it off. So I look at the finances involved in the evolution of NW chem. And I led that group over there for a while. I mean, it's huge. Anyway, we're getting way off track with that particular bloody line. I've just put up on the screen here and shared on to Zoom. Andy Turner had pointed the latest Archer 2 statistics on software use, which is still vast bleeding. That's the last month. If you scroll further down this page, link is in both the Zoom chat and on Slack. Then it has information, historical information, that's the word I'm trying to think of. Yes, actually, I should have used Gromex for this presentation, and not Amber, because Gromex certainly is... Well, it doesn't suffer from memory latency in the same way that the other molecular dynamics do. Sorry, that was the question. Yeah, I want a little remark also. Thank you also because your presentation that I saw a couple years ago was very helpful for us to explore the landscape when it was time for us to buy a new cluster and we have to choose between AMD and Intel. So just thank you for your work. It's useful. Secondly, I find it useful when you do the stream bench. Mark, just for information, is that they have this performance pattern where you sort of hit the brick wall when you use like a quarter of the cores. Sometimes on the ROM chip you hit the maximum bandwidth already. So you have like a 3D core chips, which is essentially from memory bandwidth, an 8-core chip, because the other you can just disable and won't do anything when you went with them. True. I'm trying to use for that, like just the table thing. When do you hit the brick wall? I used to do that, actually. I'm still kind of doing it. But okay, that's a good feedback. I'll try and build those into it. Okay, the last one. Do you see any benefits from hyper threading? Historically, I've never pursued that seriously, because most HBC codes do not benefit from it. Maybe my vision on it is blurred. With DL Poly, I tried hyper threading. It just got worse and nothing improved. But it's quite hard, actually. I don't know a bandwidth. I'm sorry, I don't know a benchmark. I'm sure there is one for memory latency. Stream is just the popular, but that's just memory bandwidth. And there is a memory latency one, and I need to get on top of that. But the answer is no, but have I tested rigorously for it and tried to improve code now? I mean, if they were open MP enabled codes, then probably one would potentially see a lot more benefit. But DL Poly is not, for example. Thank you. Okay. Any other burning questions in the room? If not, I have a final one. So how much effort is it to go across all these systems to get all these things to work? To the point where you're confident that you're doing a proper job, because it's not that easy to go into a new system, get some kind of performance, and then know that that's what you're supposed to be getting if you're doing a proper job? Right. But there's ways you can approach it, which means it's likely to minimize that time. And you need to choose an application that you know, right? So, and I think of a counter argument, but I won't make it. So I always start with DL Poly, because I know it. And if I see anything when I'm doing that first step. So either DL Poly or the synthetics, right? So I picked up that all to all the problems really early on with Sapphire Rapids. DL Poly, if that doesn't work, nothing is going to work. It's as simple as that. So I think you can take steps which minimize the amount of time involved. And you get to a point which sounds very subjective and hardly quantitative, where you actually think, I know this system now. It doesn't have any surprises for me. You can rattle through six or seven applications like that in the time it takes to do the first one. So I think there is a level of confidence. And I wouldn't put famous passwords. I wouldn't put numbers up if I thought they were dodgy, or there were question marks around them. And the most depressing thing is when you suddenly have been through all of that and you suddenly there's a loud noise which actually says, but what about that? And then you have to go back and do it again. So I'm fairly confident in the numbers I put up. I've never seen this last word. I've never had anybody turn around and say that's absolute crap because of this. And in fact, that actually will be a good thing probably if somebody does that. But I'm blissfully aware that if I did the job properly, I can't do that job properly. I need to be looking at other compilers. I need to be doing a much more rigorous check. You can minimize that to some extent by using community codes because they call that because there has been a community effort to get them to the stages of their act. So they should be fairly well optimized at the time you get to them. They always kind of surprises me when I find something. And it surprises you also as a coach as well, actually, when you say to them, do you know how bad this is? And they say, oh, shit, no. So the deal probably was a good example of that, where I don't think they recognize that complete failure to exploit AVX instructions, stuff like that you can pick up very quickly and very routinely. Okay, it does scare me because you've been doing this for 30 years. You have a lot of experience in getting codes to run across a variety of systems. The flip side of that, there's new people starting now researchers who are getting on the biggest machines in the world. They have basically zero experience and going from one system to another. So there's lots of time that gets lost because people don't have the necessary experience or the right approach to this. Yeah, that's true. Actually, that is definitely true. If I look at the average user, Cardiff, they pick up somebody else's script. They're wrong community codes. They will not recognize that actually, oh, cake points, why are they? And they'll be running at when you look at the efficiency. So we monitor the efficiency of every job she's run and the number of them, which are like 5% or even less. And then you have to go to that user and try and rationalize why that is. And then they get pissed because, you know, why are you telling me how to run my jobs? That bloody barrier that you have to navigate? It is quite hard. And actually, it's getting worse, not better. Smartphones. I'm playing them on smartphones. People are just flipping around. They have no idea how it works. So yeah. There's another two questions from Andy. Yeah. Who? Yeah. So have you tried making the benchmarks using EasyBuild or SPAC to improve things? So building the software using tools like EasyBuild or SPAC? No, I haven't. And I should have done. That's why I'm here. They're late never. They're late. No, I haven't. And it's a classic thing, right? You start off doing it as a kind of pastime and you build up tar balls from here, there and body everywhere. You don't keep them properly. So I mean, I was the lead author of Games UK. I do that properly, but I don't do this properly because this is part of pastime. This is not what I'm meant to be doing. I end up doing it because nobody else would like to do it. But it's not my job. I guess you have experience enough in building these goals that you don't need tools like EasyBuild or SPAC to help you out. That's kind of true. But the problem is these are community codes which are well documented and everything else, right? So they are not a problem. If you try to take this experience and take it into AI and machine learning, right? That's a totally different ball game. And I've tried to get some of the other staff at Cardiff to do that. And I mean, I absolutely picked zero. To be honest, they just not progressed it. So I'm operating in quite a narrow space. And if I really wanted to move out beyond material simulation and Games UK, I would have to change the way I do it. You're right. Another question by Andy as well. Yeah, you should scroll up a bit. Yeah, and then a similar question I guess, how about using reframe to automate the running and the capturing of the performance data? So reframe is a tool for testing HTC software essentially. Yeah, could do. The trouble is, he said, trying to come up with an excuse that the trouble is this access tends to come and go very quickly, right? And if I had everything automated and everything set up that I could just move it around from system to system, then sure, that would be great. But literally historically, at least one will get access to a system 24 hours, 48 hours, and then you're off. And so, you know, trying to get the maximum out of it in that period of time other than working 23 hours a day. So it was kind of tricky. Okay. One more question. Yes, Sam, let's go for it. So I'm going to ask a quite ignorant question. So great. I love them once. I was always thinking that the problem with the memory that's the memory bandwidth. And I never heard about memory latency, especially for VASP. I was told that memory bandwidth was the most important problem for VASP. Yeah, sorry, memory latency are on the molecular dynamics codes. Molecular dynamics codes typically are sampling across a very wide broad space. You can optimize that process. But molecular dynamics is a classic code which really suffers from latency. VASP doesn't, yeah, VASP is memory bandwidth. Okay, thanks. And it's also very much a function of the size of the case and the example. That's the other problem with this. It's very easy to say I benchmark games you care, I benchmark this, but it's really a function of the data sets. And you can be running and I didn't really show you their VASP on two different data sets. And the conclusions you've come to are quite different. Community codes capture a multitude of sins. They're million line codes often with so much functionality. So you'll see people giving you an argument, which says that NW chem, for example, is really good at using GPUs. And the reason why it's really good at using new GPUs is you use a tiny fraction of the functionality, couple cluster, triples, which is totally dominated by matrix modifications with run light off the shovel, you know, on GPUs. But that doesn't say anything about 99% of what the code is actually doing most of the time. Sorry, that's a bit off pattern.