 So welcome back My name is Chris Edsel and this is reap the benefits of heterogeneous computing with sickle That's how you pronounce the the four-letter acronym there. It's very fortunate that JP in the last session gave us an introduction to the challenges of heterogeneous computing that is Writing programs that can run on both CPUs and GPUs So first off The what would be the who am I slide if I was doing a Unix talk So here's a sort of bow piece of sickle code On the far left is the the very first computer I ran which was a one socket one core BBC micro with a 8-bit instruction set that got me into computing the You may not Recognize because it's a bit zoomed in the next next one along the the red and black for the inverted L shape That's a the end of a cray T3E cabinet That was the first super computer. I was involved with and Programmed that this was a time when it wasn't entirely clear that MPI was gonna win So we we we were taught both MPI and PBM as the programming models for that machine That was a hundred and forty four cores the hundred and forty four sockets again single core deck alpha CPUs the third system along is A bog standard x86 cluster happens to be the one that the the university where I work the university Cambridge The cluster is unproziacally named CSD three the Cambridge system for data-driven discovery And that's typical of clusters you find nowadays that's got several sockets The the isolates have got 76 cores per node And we also have GPU servers in there. So we've got a bunch of servers before Nvidia a 100s With 80 gigs around so the big ones And that's that's how you get power efficient super computer is by using Very last thing is Before software engineering and We're That's that's who I am what I do But why would you care about Having a programming model that's portable Why why And Well you could a few years ago But the landscape has changed and so the compulsory slide and talk on on sickle I've got next but this is the European version of the same thing These are the the new upcoming or already installed Large super computers in Europe, and you'll see the Lumie on the left It's got AMD CPUs and the GPUs in that system are also AMD Whereas Leonardo has Intel CPUs But the GPUs in that system and video All fine if you're assist admin and you're just running that that one system But what if you're a user what if you're a research group and you are applying for Time on one of these large systems If you have a program that's written in the programming model that can only run on some systems You're cutting yourself out of the opportunity of using others And you might not know say it's a refined time next year You might not know what system you're going going to be running on so you would be wise to have have your software I'm meanable to running on on any of those architectures and The reason this had a lot of backing behind it was this was the same situation in the US The new exoskeleton machines their frontier, which is the 1.1 exoflop system on the top 500 Aurora, which is being installed and El Capitan and Perlmutter If you look at the architectures that they have we've got GPUs from from all three vendors and so How would you program one of those what what programming model would you choose? Ideally you choose a non proprietary one because that that might free you up to to use different sorts of hardware It'd be nice if it were standardized, so it wasn't a single Wasn't defined by a single implementation So the US National Labs have got some portability layers C++ Uncle Kokos and one called Roger Which are very good pieces of software and I wouldn't say they were they were bad programming models But one thing they don't have Is a standard organization behind them to say So that you could choose from different implementations of the standard same way that in I so see plus plus We've got a standards organization I so we define the standard C++ and then you can buy your C++ compiler from Microsoft or you can get your C++ Compiler from GNU or from LLVM or NVIDIA and so there are multiple choices of of implementation So with that in mind This is this is one option So sickle is as I said how you pronounce it It's standardized by the Kronos group and the bullet points here are actually off the Kronos groups Website about sickle. You may know the Kronos group from open CL and another number of other programming models that they've standardized sickle is You could say Open CL is a little bit C like you can program open CL and C++ But there's a sort of impedance mismatch between the abstractions and so sickle is Meant to be easier to program because we're closer to the application domain because it has abstractions which are Easy as a handle it hides some of the the implementation details You don't need to know about that if you were using open CL you would have to So it enables heterogeneous code for offload processes It does require a modern ish C++ Because it uses features like lambda functions To do the kernel offload And it provides API's and abstractions to help you use CPUs GPUs FPGAs Some people use the term XPU where lowercase X stands for any letter or any number of letters You might want to describe your your processing units with and we've sort of seen an explosion in the silicon space for AI accelerators and This may be a way of programming those I'll show you some diagrams next to the implementation so you can see this sort of thread to that this Used to go on the next two diagrams used to be on one page, but There are now so many implementation so many back ends that they've had to split it out So this is the the sickle landscape and so if you're an application developer you are here at the top With your sickle code and then you can choose in the case of this slide any of these three implementations and If the code is written to the standard and the implementation is standards conforming it will just work no guarantees there about the performance of How it works, but you know it will run and and will run correctly So one big player in the space is Intel and their compiler is called DPC plus plus Obviously, you know, they're motivated to make sure that they've got a path to Compile applications that will run on their GPUs. So if you have one of the new fancy into Intel data center GPUs, which Then then that's a way to program it You can also you see the arrow there that says any CPU the sickle standard mandates that You have a Default device always exists. So as a fallback You could run the single threaded on the core of a CPU or if you wanted to expose the parallelism your implementation could Could talk open MP and use is all called on the CPU So Tom's question in the last session about, you know, if he doesn't have the the right GPU for the AMD Compiler can you compile on a CPU to check that things just work? That's actually part of the the sickle standard. It's guaranteed to exist and it's a it's a Really good debugging technique. So I'll explain the middle bit and then I'll come back to the the additional lines on the On the Intel one. So there's a company out of Scotland called code play And they make a commercial implementation of sickle called a compute CPP Which targets and devices via open CL and the the intermediate language that they've specified Spear V But they what one of the things they also did is got involved with Intel and the contract for the large machines in the US Unwrote plugins to DPC plus plus so Intel CPUs DPC plus plus has a plug-in architecture and co-play wrote plugins that can target Nvidia GPUs and AMD GPUs and Given all the complexity of that it works surprisingly well. I was surprised I taught a workshop on on sickle about a month ago and we had a room full of Nvidia S tops and we just downloaded the Intel DPC plus plus and the co-play plug-in and Worked so we're up front the room demoing on the the GPU that's built into the laptop, which is an Intel embedded GPU and the the learners in the classroom. We're using the exact same source code and running on the Nvidia GPUs and getting the same answer Maybe if you're in the room you're not surprised about that. Yeah Look surprisingly well So the the the one in red is called Hipsicle So it's the the first two were proprietary so They are the Intel's one is free as in beer, but not free as in speech And the co-play one I think was also free to download but you could buy the point there is you could buy support from a compiler vendor From a choice of two compiler vendors so say you had an automotive system with some sickle code in it And you didn't want your your car to crash you wanted to you know ensure you could sue somebody if that happens You had two proprietary ones to choose from Unfortunately Intel have bought out co-play so they're still operating as two separate companies But it's now at your choice for proprietary sickle compilers is reduced down to one again Hipsicle comes from the University of Heidelberg and Follows a completely sort of different routes to the the other two compiler vendors It's called Hipsicle because it originally targeted a AMD GPUs with with hip But now it targets many many more back-ends many more bits of hardware and so the name didn't really fit anymore They decide to change it to open sickle Which is a better name unfortunately and If you look at the GitHub issue, it's not quite clear They've been a bit pagey about who is suing them, but somebody is suing them over the name open sickle So they're going to have to choose a new name. It's temporarily reverted back to being a Hipsicle and then So if we go if we look at that slide you start with your C++ source code You have an implementation and it targets a back-end and that pattern is is repeated for all these other implementations So if you have Huawei Starfoil if you have NEC Vector engines There's an implementation and let you target that One thing I think is quite interesting on this slide is the celerity one the one in green So that lets you target clusters with the sickle abstractions So you no longer says it's targeting MPI there. You don't write any MPI at all You just distribute the stuff as you would if it were just a normal single device connected to your host except in this case It's an entire cluster So you can go to the that URL there Sickle dot tech and you can see what the state of the art is currently and there are there are new back-ends and implementations being announced all the time What does it look like in source code So as I mentioned before this it's C++ only it's unlike open MP you heard about earlier You could which you can put inside C and Fortran and C++ Sickle is C++ only So you your hash includes the header file and all of the the objects and types are in this in a namespace and There are only a limited number of things you have to learn about so you have to learn about the device and a device selector and a sickle queue And this is what the kernel launch looks like so This is a C++ lambda function So the square brackets there sort of indicates the things outside of captured by reference You get a command group handler when this happens When you submit to this queue You transfer the data In and out with these buffers and excesses so outside of the kernel on the host you create a buffer and Then on the inside of the kernel which gets executed on the device You talk to that buffer via via an accessor and you can have sorry about the You can you can specify the direction of the the data transfer There and then the actual computing here is this is C++ A plus B is done here and that's done For each element in the iteration space In a manner that's determined by the implementation So it might be serially if you're just running on a single CPU or it might be walk by walk if you're running on a Nvidia GPU or it it depends on on what that on what back end you've got and so the end of this C++ lambda function here this this queue submit at the very end when the destructors call the data is moved back to the host And then it's available to be to be written out. So it's advised not to do lots of IO from the device And so that's that's that's the sort of hello world of Sickle, it's a few lines. It's about it's a handful of C++ objects and types that you need to deal with does get more complex than that But you can get started you can offload a kernel with with just that that knowledge That's great if you're writing writing your code from scratch But a lot of people will have code that's already written that they'll need to port and they'll be coming from Probably the Nvidia world. So I'll have code that's written in in CUDA I chose it one before I get on to that but you step back and say all of this on the screen here was just C++ syntax there was no triple chevrons for kernel launches There were no dot CU file during or anything like that the point of that it might seem like a minor point But the point of that is you can now use the whole Ecosystem of tools that understand C++ syntax and they will understand your sickle code So that's linter's static analysis tools Anything that understands C++ source code won't be tripped up by something that's not in the language because sickle is just C++ code So say you have something that's not just C++ code. It's in CUDA You can convert it or you can go a long way to converting it with a tool called sickle-o-matic It used to be so it's been it's has been open sourced It was a tool from Intel called DPCT the D plus plus DPC plus plus compatibility tool Sickle-o-matic is a much better name And that'll take your your CUDA source and give you back sickle source code It gets about 90 to 95 percent of the way there some simple kernels will will will go straight through with no problem sometimes there's no exact match for The CUDA API when that happens it puts a coin to the translated source code Which you can search for and basically says you've got to do some work here to Complete that last five or ten percent that takes a lot of the drudgery out of it and a lot of the error-prone repetitious stuff and so I would advocate sickle-o-matic if you're needing to court CUDA What do you lose by by having a programming model that runs across all All architectures Little bit unknown. It's not that you can't know This is what we knew about it in 2020. So some colleagues here from the University of Bristol produced this paper that was in ISC in that year and they were testing Sickle versus OpenCL across four different architectures and as you'll note that the the difference in performance between the different programming models varies according to the architecture and Not shown here will vary according to the implementation And also not shown will will vary over time. So people will fix things We'll improve the performance. And so we just had the International workshop on OpenCL and sickle-con in Cambridge last week and the chair of that conference said you do need to Recheck if you if you base your assumptions on on what was true two years ago. The space is moving so fast That's not the case now So as I said the sickle will guarantee you the correctness portability doesn't necessarily guarantee you performance portability But if that's something that concerns you you can you can look into it So who does use sickle at the moment It's a handful of Domains a handful of industries It's not super wide spread and I think Kenneth and the rest of the easy build crew would be able to Look through the the source of the easy build stuff and see who some sort of survey about what what is using it And what isn't yet probably It gets by the number of the volume of tickets coming in Many probably not yet One that is is Gromax So they put a lot of effort into making sure that Gromax works very well on sickle so that they can now target all three of the the GPUs with the same source code base If you want to learn more about it you missed out last week as I mentioned I walk on sickle-con was on there's a sickle both and a sickle workshop at ISC in Hamburg next month and There is a sickle both a sickle workshop and sickle training proposed for supercomputing in Denver in November And so since this is an easy build conference I thought I'd look into easy build and see what see what was there as a hiptical is Supported software according to the documentation so it's about two minor versions behind and I would probably hold off on renaming it to open sickle because you just gonna have to name it back to something else so I'd leave the name as it is if you're using the the Intel toolchain the the new compilers the one API compilers The default to none so you would set that to through If you wanted to use the the new compilers, which you would if you had sickle code So the classic Intel compilers I ICC except I thought etc More full-time clearly one understand C++ for the the C C plus the classic C++ compiler doesn't understand sickle You do need the newest Intel C compiler one for the race on LLVM to do that. So you would need to flip that setting I finished a bit early, but those were some of the people who produced the pictures and there and there Things this is the list of handy URLs. I've got so the first one is the chronos group Who standardize it? So if you want the definitive answer on what the current standard is, that's where to go they just released the The latest version 7 of sickle 2020 so a word on Naming it the first one was sickle one point two and then sickle one point two point one And then they decided to change the naming scheme and then it's down to sickle 2020. That's the latest standard They base it on the ISO C++ standard that it's based upon So once ISO C++ comes out with a C++ 23 You'd expect it via a sickle 23 23 23 coming soon after sickle.tech is Run by the co-play people, I believe and that aggregates a whole bunch of community news useful resources Including a link to the third one. I've got down there is the sickle academy. So that's a set of online Materials for learning how to write sickle Short plug for my sponsor the the Cambridge Open Zeta scale lab So if you want to know what we're about that's the link to find that and I left Twitter in November last year. So you can find me on mast on HPC Chris at scholar dot social and I will take questions Do we need a microphone? So the first question the two questions there the first one is Is sickle translated to to CUDA PTX if you're targeting an Nvidia device. Yes Exactly how it happens will be implementation dependent. So Hipsicle will do a completely different way to DPC plus plus But they're they're all using LLVM in in the background So you if you remember from JP's talk earlier where he was talking about the host code and the device code splitting out some separate streams and having separate compilers and assemblers and then getting munched back into a Fat binary at the end. That's the process that would be used and so You delay me from Co-play gave a really great talk Last year now on how to debug sickle stuff and there's a flag you can give the sickle compiler to get in there They have their versions to get it to dump the PTX So you can actually inspect it and see where things are going right or all going wrong the the second question of Where things might not go right if if you're using the sickle ematic to automatically translate to Puda First one we found so the group of us in the direct project took a look at sickle about a year ago see how it would work in some example codes that we're interested in and The the tool Used some Intel specific name spaces some Intel specific Implementations so it had a DPC T name space and a DPC T I can't remember what the kernel was whether that mathematical function was but the the solution there was to Rewrite your own version of that thing It's a year ago. I've forgotten. It's forgotten. Well the exact one was but there there are I mean most of the most of the basic level stuff works really well All of the data transfer stuff has direct one-to-one comparisons All of the kernel launching stuff directly translates I'll talk to you afterwards What I can't do is is the way zoom is not working for me today. I can't share you the second tab Which has got the diagnostics that the sickle ematic puts out where when it can't do something So it gives you a code and the next explanation and those list of explanations to show you some of things that that are not implemented Yes So what do I think we can do to better convince vendors to sport sickle So specifically the example given was Lumi Where we don't want to you want to support from a commercial vendor He don't want to have to rely on the University of Hylberg not nothing bad about the University of Hylberg acts all the guy who does writes that is Genius and machine. He's really good at support. But yes, he's just about one person I Guess you've you know the normal way you get any vendors to do anything You ask and you say well, you won't buy the next one unless they But they do it so through whatever your support channel is you would say I need to run the sickle version of Pro Max on Lumi And it's not compiling because you compile up Right so they So the question is the long lines of sickle is different to cocos and Raja Because although all three of them see flip and abstraction layers the C plus plus over different device programming Sickle needs a compiler implementation Whereas the others can be compiled with the the existing compiler And that's a message that an awful lot of people don't understand you don't get first off As to advice as to advice to how to get your your vendors to do what you want I Think maybe that's a discussion I'm gonna say it on the screen