 All right Hi, everyone. So I think we could just hear the bell from outside. She's nice just done so we can start now Thanks for coming to this session on cycles on Intel GPUs Intel GPUs and all things around with these in Blender My name is Xavier. I've been working on the one API back in cycles with Nikita and Stefan who are based in Germany, but they're here today Also, well on Blender side, I've worked with many people Haven't brought masters or not that many people But still a lot we need to thank Sergei Ray also We have a lot of work to these guys too and Here to present all we did If you have questions you can interrupt me because of various topics in this presentation So you don't need to wait till the end so I can still remember What I've just said on the agenda well brief Presentation of the Intel GPUs what are they today? Why one API a short demo? We'll talk then a lot more deeper about the implementation in cycles and How it works with tools performance tuning and our roadmap for Blender Contrary talk about the roadmap for GPUs themselves So on GPUs is our current lineup for desktop. We have a usual three five seven ranks We don't have nine yet. We're really aiming at Low and mid-range in the markets what a lot of people want So here are the details we have very number of execution units on the media side The stack is basically the same across all and we have variations of these for laptops and also for workstations a Little bit more into details some things that are really nice with this new discrete GPUs He's also the presence of AV one encoder. She was really industry first when we released the arc a 3 80 earlier with sure and Well, actually today you can already use it in so far like FFM peg from from master and Getting traction and really happy about users using everyone. Well, everyone is not the topic today. The big is Blender, so let's talk about GPU support in Blender There is an overview of how it worked before before I joined the project also with Blender 2.x Well, Blender 2.x already supported Intel GPUs, so it's not really a new thing to support Intel GPUs I could say the only issue was it was supported through OpenCL Correctly sometimes. Well, there were some share of bugs It appears and Well, something happened, which is Blender 3.x and then Definitely not my words here on the slide that sad reality so that They decided to remove OpenCL support because of many reasons and that means that Basically Intel GPUs, AMD GPUs, Apple GPUs were not supported anymore So what we did? Not just us, but the whole industry was to let's make it work again through Something that's not OpenCL The third part is fragmented the APIs Good part is that now we have a good API one Talk about it on the slide right after so it took some time Especially for us to come back with Intel GPU support and we're starting with a discrete GPUs But well, that's where the performance is on the discrete GPUs So, yeah, everybody is working on a kind of its own backend To make it work. It's still the same code base in cycles That's able to target all of these and then the framework is kind of aside We'll go a little bit deeper into these topics In Bender, well, you have all the shading and also all the reinter sections and Nowadays you have hardware retracing. We do have hardware retracing on all of our Intel discrete GPUs Right now in Bender, we don't support these yet We need Embry to finalize support For the hardware retracing on our GPUs and then we'll be able to integrate so Our solution as backend to use after opens here has been to go with one API So one API is also used as for a lot of things in Intel It's a bit of a marketing term sometimes painful, but here in this context It's one API and what we've been using is one API DPC plus plus compiler That's also based on the SQL language So we're not the only one who can bring a SQL DPC plus plus I mean a SQL compatible compiler and Make this but I think our compiler is quite compelling. You'll see this In some of the next slides. So one API is really what we want to help people to not be stuck with vendor lock-ins and to really be able to work all together with Codebase that can target many different hardware vendors and so on and we meant it to be really open So there is now a new community forum There is a there is a process to standardize everything because now I think we have probably that's quite ready and can be used And be driven by a broader audience of people who want to run things on our And not reinvent the wheel four times So yes, so our one API Goal is to be able to target CPU GPUs FPGAs or really whatever Because we can also find over types of compute devices Of course if compute devices very specific It may not be exactly the same code But all the ecosystem should be should make it easy for you to target this so in blender One API back-end we're using the one API data parallel C++ compiler It's our implementation of the SQL language that's driven by Kronos. So again an open standard Trails could to be reused across hardware targets different compilers. So You can find over compilers that allow you to use SQL like compute CPP tree cycle and over funny world funny names It's um, we're I haven't tried the over compilers or focus has been to use the Intel one, but not just for Intel hardware I'll give some more details So yes, our compiler is basically able to compile C++ with the SQL Extensions and we've also additional extensions that I think we needed and but our goal is to make is also part of the SQL language If over what if overswands this going More precisely into the architecture of our DPC plus plus stack So there is the compiler at the top and we have a runtime underneath that runtime. We have a plug-in interface That's rather simple and that's what really allows people to come up with plugins to support new hardware So on our end if you don't know the compiler from the Intel website you'll get You'll get it with just the level zero and open CL backends Actually, if you go the open source way and grab it from GitHub and compile it yourself You can already compile yourself a backend for CUDA and HIP who are also being heavily worked on and Very in pretty good shape. I mean if you have been down there checking out our demos right now We're running we're sharing code that runs on using all these backends at once. I mean CUDA, HIP and level zero So it's a rather clean API so plug-in interface. I was to just implement everything in a single back-end file We have plug-in discovery mechanism. So at runtime you're able to discover backends and And yes, that's how we enable the stack one thing you'll see here is also the The fact that you can run the same code on CPU But right now the only solution we have for CPU that allows to use more than one thread Which may be important Is to use the CPU as a device like you would use GPU and so on but right now It's going for the open CL runtime in a specific open CL CPU runtime So this part is not ideal yet The vision is ready to make you able to run the code as fast as possible on any any GPU But also any CPU that right now I would say that CPU back-end in cycles is pretty safe from that competition Not not really ready to to unify this part. We have a lot of good solutions for CPU And the stack is already quite good I think on our end we need to to make improvements to be opens here to be CPU device So all being worked on if you open, I mean this old compiler is on a on github And you can check it out by by yourself as well So key differences between CUDA HIP and one IPI as a developer and so One IPI is really C++ oriented Everything is C++ in there If you don't like templates, maybe it's going to be complicated at first, but it's not that bad I'm not a big C++ fan sometimes, but still it's good and it allows things to be clean and more Better defined in terms of type safety and things like that. So that's the good part rather than sharing void pointers everywhere and then What what is that thing? I have tried that for like three lines of templated errors But at least it shows points you to somewhere So everything is defined as classes methods. You have Exceptions that you can can be thrown from anything you call from SQL and you can also have your own async Exceptions and loss using C++ exceptions to On the memory side, you have two choices You can deal with memory like you would do in opens here Well, or you can also use pointer based which is the solution we went with for Blender On memory location pointer base is what we called unified shared memory So you have one space and you can just call malloc Malloc to allocate on the host device or on target device But also shared if you don't really know which device you should pick and want to let the runtime guess Well, Blender too much what the overback ends are doing. We've been using Specific allocations or what are getting device and host specifically But yeah malloc shell is also a potential solution And it's the same free function for fall The only drawback of using this is that right now there is no support for hardware something So you just need to re-implement all the interpolation in software But it was already done for CPU so it was no big deal to do and performance impact is not that big still We could gain some and It will come in the future Don't know when but well need to be a standout first and then to implement the old chain behind it to make it work And the open chelware I'm not going to talk too much about it because we're not using it But you need to define buffers Accessors It's nice in a way, but huge imitations for Blender is that you need to know everything at compile time in terms of how many Images well textures you will need which in case of Blender is well no You can't tell users yes after this number you need to recompile Not going to work. So what was it for? primer on The one API how it works Right now the one API back in in Blender. It supports the art discrete GPUs I have some tricks in my sleeves who really want to make it run on Intel integrated graphics older GPUs and so on We don't really test it. So that's why we don't really want to release it also the performance. I mean Do you really want to run all the easy things on? On an IGPU well if you have one of yes IGPU it may bring a little bit more performance than your CPU But not by a huge margin still with IGPU you can theoretically use a lot of your systems memory DR4 Maybe easier to have in huge volume than on the GPU If you if you really want to render your scene using and it needs 40 gigs of mem yeah Can use the IGPU not sure how fast it is but well you can try it and let us know if it's if it's interesting to you and stages of it in Implementation while we have all features up and running At least I haven't seen a lot of bug reports yet. Well cards are not been out for a lot long But people have been using it. I've been pretty happy with it on both Linux and Windows as far as I know and Everything is open in it. I mean SQL is an open standard one API is as a Open forum being created now Compiler is open source itself and some github back ends are open source the GPU binaries compiler is open source and The runtime itself is open source So the varizer bug should be able to to track it down and will help you no problem So I've showed him off how it works, but well, I hope you already seen a lot of it down there in the lunch area Where we show it live on a Ubuntu machine with a bit more than just Intel GPU support But here it how it looks in a 3.3. So 3.3 stable and So seen from colleague called Bob Duffy who worked a lot with us So resident artist who is much better than me on the making stuff crash because when I tried everything works all the time So we need people like him and that's yeah really thanks to him that everything was quite stable at launch I would say it was a pretty good performance I Have some of those slides on on the plans for performance course and you can check by by yourself on downstairs So in the detail, yeah, I'll go in details on the build system and then on the code itself So here's the overall look of how it works. So when you target one API back in So build time Well the first part would be embryo on GPU. It's working progress It's going to use one API as well when it will be released to target Intel GPUs Then you have the one API dpc++ compiler Then the graphics compiler because we want to pre-compile all the binaries to gain some time for the users otherwise it would take 15 minutes 20 minutes on the users machine. You don't really want to wait that long before first render So we pre-compile Then you have a secret runtime that ships with a that's part of the compiler stack and The level zero backend and we package package both runtime and the backend inside the application So as a shared library and then at runtime With the backend expects to find the level zero loader because level zero is a bit like opens here it's an API that Should be cross vendors and anybody should be able to to make its own level zero implementation So the loader is not called in not particularly cold dpc++ and it's really for level zero and it's separated from from the Intel specific runtime On Windows, it's part of the graphics drivers on Linux guides should tell you to install it, but if you're missing it, it's a separate package and then Then at runtime you can have the Graphics compiler being used. It's the same compiler, but as part of the driver if you're missing the binaries It would be used if you just don't have a binaries because you're using a unsupported GPU for example You'll try to use 3.3 Blender 3.3 with a GPU that will release in two years. We'll still work hopefully And then you have all the rest of the kernel. I will not talk much about this So, practically Where you can get the compiler when working with Blender, it's easy. We do have our own pre-built packages from GitHub But what we've done with Blender because of the FX platform requirements mainly and compatibility reasons that we don't honor with Pre-Release packages is to recompile everything and We store these pre-builds on SVN along with all the other pre-builds So you don't need to download anything or any extra packages when building Blender with one API back-end Actually, you download, call, make, release And you have it Included. You don't need to download anything on the website accepting any end-user license agreement. I think it's quite a smooth process And even down to the GPU binaries, the compiler that generates the GPU binaries You also it's also compiled on Blender infrastructure, and you'll find it on SVN If you're targeting Linux, if you're targeting Windows, it doesn't really compile from sources at the moment So you can download the pre-built package from a specific web page on the Intel website So one extra step on Windows if you want to pre-compile the binaries But well, during development, JIT is convenient Usually. Oh, it's integrated CMake. So we've cycled device one API and we've cycled one API binaries Our other two options, like for the other back-ends Int only the way it works, it's going to call dpc++ compiler through that's named clang++ here with dashfcql So it's compiled to compile cql code and handling everything. So we call this using add custom command with proper environment variables And another cool part, so that's also what we demonstrate downstairs is that it's one API We want it to be able to run on all platforms, not just Intel And so we didn't spend a lot of effort making it work But Stefan made the first try at making it work and it worked. So we found it cool That's why we're showing it. It's without much Changes, you can just take latest master Add a few specific cql targets to compile to compile AMD HSA and then the aptX For specific platforms. So one thing to note if you target AMD Stacks since they don't have intermediate representation yet for the compiler You need to specifically ask for the target architecture for AMD And well, you can also set architecture for NVIDIA, but if you don't have it, it will be recompiled from ptx and So not everybody has all devices from old back on from one API back end We just save world with this with with two environment variables. So cql device filter and cycles one API all devices So actually if you want to target Intel in IGPUs, you just need to use this environment viable cycle one API all devices and Should show up compile and run And and it's been a long time haven't made it crash And that's really how You need to just have a bit of blender with my API back end that runs on all platforms And that's what we were showing downstairs and performance is actually quite good We are not running at 10% of the Native thing otherwise, we wouldn't show it I can't disclose performance numbers, but you can check yourself downstairs. Actually We don't hide the timers or whatsoever quite proud of the performance there and Our colleagues work on one API really want are really aiming for just matching the performance. So if it's not 90 percent Or 95 percent. We're not happy and you can report a bug Only thing you're going to like going through this Back end is the hardware retracing because right now And the NVIDIA don't really open all their hardware retracing works if you want to target it from From one API and from from HIP But if they open it a little bit well, maybe in the future we'll be able to even run with hardware tracing from here So with these settings practically what it means if you want to run something else on Blender just in New York It just passes the new SQL targets and Specific options for each of these targets And if you really want to reproduce all the demo we did downstairs at home, you just need this slide So it gives a way to recompile the DPS file compiler and send to us. So it fits well with the over Blender dependencies that are Compatible with GPC 2.17 and the old C++ API Then you just Specify that you want to use this compiler instead of the one in the pre-builds and then for IND we still have two little fixes to make To to make the compiler happy Not a big deal, but that let us really match a Native hit performance, which is nice. So don't try it without this you could be disappointed If anything doesn't work, we have it's a slide just on trouble shooting I will not spend a lot of time on it, but we have environment variables So you can have tracing at all the levels SQL and level 0 and I'm putting the environment variables if you have unsupported device you don't see well Still need cycles for an API all devices That was it for a High level well the system is not super high level that well, it's still higher level than the code Let's dig into this part So cycles itself. Yeah, these are kernels written in C headers meant to compile all the backends at once So there are very small differences. I think press key would give us back a dark eye if We put a lot of if deaths everywhere for targeting the values backends It's right not the goal and that's rather clean at the moment All the back-end specific code is Contains into a compact header and then ways to do the old memory transfers Launching kernels and deal with her. There have been written with could I in mind first? So here's a small dictionary to find your way between CUDA and SQL terminology, so Yes, nothing is the same actually, but SQL is More like open CL in terms of definition So subgroups are equivalent to the warps And Just one thing about subgroups on Intel platforms We can have different sizes of simd so we can execute simd 8 16 32 on an entirely discrete GPUs Which is something not really common on CUDA. I think there are only 32 So something to to watch for if you want to synchronize a lot of things So the good thing is well in cycles for compact.h you can still use the CUDA Terminology, you don't need to switch that much. So it's still simple the way we are launching kernels So we have basically a big switch case with all the kernels and and all their arguments and I'm definitely not going to show the macros and templates because it's not going to fit on any slide But well the end goal of these macros and templates is to Learn something like what you find in the second half of the slide so basically calling parallel for on a single dimension range Using the specific kernel you you're calling and The global and local sizes So that's how it works now when it comes to performance tuning so things to watch for like for any GPUs It's often that we spear registers some kernels are big and with a lot of variables It's hard to avoid but it's important to monitor because it can ruin the performance In many cases sometimes you can get good results if you play a little bit with Loop and rolling and play with inlining thresholds to make compiler happy if you make a change and all of a sudden the compiler takes two hours in the Unless instead of 15 minutes. Well, maybe something happened. It's good to check You can educate the compiler yourself. If you know what you're doing just I should have put the blender equivalent here but yeah CCL Device in line and CCL device knowing line basically Unintel GPUs and well also on CUDA materials the internal terminology you can Specify a large GRF mode that gives you twice the number of registers and you can opt-in for it We do this in the CMake side So but yeah, all these are things to look for We have tools to help navigate that. So first one is the Compiler output that will that's really too small on this slide But it will give you the number of registers allocated spills the SIMD size used. So if you make a change and This changes for the worst like spilling more registers and so on It's it's good to check to to to maybe revise the change and it will give you an understanding of why it failed And in terms of tools we have our tools. So it's just one of the environment variables to get full debug We have also an option to To output all the assembly in line with a code on Linux and Then we have tools like vtune and one prof. So vtune right now for Intel Discrete GPUs is not fully ready in public. You can have the overall analysis But it's not as good as it could be yet. I mean I hope soon you will be also be able like on a GPUs to dig into Exact source code for lines and see all the other counters from the GPU and all the memory transfers I mean, it's super powerful right now and there is a little bit too big for it to handle and All the software is not completely ready for for the Intel discrete GPU You can also run it on over all their Intel GPUs, but then well not really running the code on the target So it's a bit more painful. If you want all the other counters, you can order the access these for one prof It's just part of For the link is in representation. It's just a profiler that allows you to dig into all the other counters That are available If you know good bolts, that's a compiler explorer So let you just run a quick snippet of code in compiler And then you can see the LLVM IR and even GPU binaries from the browser which is super cool if you want to check a little snippet and see how it compiles to GPUs and it is integrated with with DPC plus plus compiler and even with DPS plus plus compiler with CUDA support Which is fun and also if you target if you don't have Intel discrete GPU yet because they're often out of stock You can already run the old one API back and on nvidia and use and the tools like nvidia inside It just works and you get the proper names Everything is a bit that was hit for the code part now. What what are our plans? So first of all for Emory so we're waiting for next major release of Intel Emory It would be in the coming months months hopefully by the end of this year and Then we'll be able to work to integrate it in Blender internally. We started that work. We have internal builds up and running so it shouldn't be shouldn't take that long from Emory for GPUs official release to Publicly available source code and build of Blender using this we shouldn't take long We'll see how it materialize and what version we can intercept for Blender But yeah, everything will be happened publicly once Intel Emory is released Right now we foresee just small code changes. So it will be part of one API back end for sure We'll have to to split the Emory filter function to work to better work with with GPUs and also use 8-bit why why masks because that's what we need on the GPU hardware side Beyond Emory we have other plans a lot of plans so open PGL you have already seen the talk by Sebastian and the and the demo at downstairs So right now it's only on CPU, but our plans to make it run also on all GPUs We have open image deno is already Which gives nice results again not running on GPU yet. So Hopefully yes, I should progress on OSL It's maybe a more long-term plan again, but we see a possibility to also make it work on on Intel GPUs in the future and really have no days just all plan and We are still working really on performance. I think performance is quite good with the back end We are it will be get better with use of the hardware retracing but even with that I think our hardware is even more capable of what we are doing today even if It's quite quite good We have plans to make it even better with the first generations of hardware So yeah, I expect improvements. We continue working on it And with that that was your old tour If you have other projects that could use one API you are some resources if you have just CUDA codes It's sad, but we have a sequel ometic project that allows you to convert CUDA code to SQL Maybe you'll never look back. That's kind of what we would hope Because you're still able to run on CUDA on NVIDIA GPUs after going to SQL and if you get Good performance just stick with it. If you don't get good performance You can also report a bug on various projects are presented. They are all open source and Yeah, open to issues or whatsoever and you could even contribute Although depending on the project, I don't think many people will contribute to the Intel GPU binaries compiler But still it should spot something that you can And we have a thread on the DevTalk If you want to give feedback on how it ran Add some issues you need some troubleshooting. I know setting everything up on Linux right now is easy If you have open 2204 and maybe a little bit more nightmarish if you want to Do some if you're using some fancier setup, but it should be it should work And we can help All right, we have a website To summarize everything we've been showing at a Brenda conference You can give it a look. We also have a small contest to win RGPU or even a full PC with RGPU a knuck and with that We're very quiet and think it was a lot of information. I hope yet. Maybe some some questions for now. It's your turn Yeah, so let me repeat the question for your calling So you try to compile the one API back and on your laptop and you run out of memory How do we have any plans to make it the situation with better? Yes, right now. I think if you compile you need the various 8 maybe 10 or 12 Geeks of RAM, which is huge and that's also why we pre-build so yet users don't compile on the laptops Yes, we have plans to make it better. I think it's especially at the post link stage that it gets really Bad Yeah, we try to improve this side as well. Yes, but yeah, it's not perfect So a question is about OSL support on GPU or could you maybe even contribute to to make it happen? And if we have a public roadmap or things like that I don't think we have a public roadmap on this yet I'm not particularly I Not know all of the OSL process and not working with them just I know that what we are doing is public open source, so you can check the Intel.com people contribute to us and open bunch of requests. That's where the status is right now in OSL So it's hard to give a date these things take time But that's where you you can look and you can contact me. I can put you in touch with quite people if you want to really have a precise next steps Yes You Thanks Let me just summarize because it's a very precise question that for the recording. So what's the status of? the performance of running one API and on specific backends versus directly running the specific backends. So here are precisely one API on HIP versus HIP and Here we have a super nice process to make performance claims I haven't got the time to follow it at all. So I can't say anything But you can find your answer Using all demo downstairs You can open a scene and do the runs yourself if you're interested in and you'll see if we show it here It's because it's good But you've shown a tutorial to port your CUDA code to one API Do you have the same for OpenCL to one API? Because I have a lot of OpenCL code which I really prefer to organize And I've been busy porting it to HIP Yeah, so question is so we have porting tool open source to port CUDA code to SQL. Would we have the same for OpenCL? That's a good question Well, I would say we don't see a big threat from OpenCL as a vendor look-in. So we haven't put as much efforts but Porting itself shouldn't be that hard considering Difference between OpenCL and SQL. I mean your kernel codes all the notions of World groups subgroups and derange everything you'll find it So I would say a manual process is shouldn't Really be awful here. It should be quite natural But yeah, it could be handy to have a tool for that Second question on yeah, if we have documentation to make it easy to migrate from OpenCL to SQL Maybe yeah Google it. I hope it exists. And if not, well Make a request because yeah, that would be nice. I agree It's not the first time I hear about this need from my coaching Just thinking of migrating post-cost code in Blender that we'd like to be able to run on the G-tube and not require it It's we obviously want to be able to do that on all G-tubes and this is looking really interesting in that respect Do you think that's sort of like a reasonable goal and something that SQL and all this Stuff will be able to support in the long term Very good question on if SQL could be used in other parts of Blender than cycles and to target all these GPUs to migrate CC++ code to GPUs and I would say yes and plus I mean it's Easy now and so not in 3.3 because we had everything in a specific library a specific Shared library that's what's really contained but now with 3.4 in master You can just include the SQL header and it will work everywhere Kind of cool, so yeah, but at least makes the option to for you to try then there is always the Things like device enumeration Fallbacks and queuing kernels all this needs to be coded as well, but yeah I mean the infrastructure is there already You would need efficient CPU execution of the same code Yes, right now we open CLCPU runtime. It's not where it should be yet need to wait I recommend using same climbing not migrating away all Blender code to This infrastructure, I mean it's not working for every code. It's really for data parallel code So right now you can have a good CPU fallback that really targets CPU with usual tools TBB and everything I think we're out of questions Thanks a lot and enjoy the rest of conference I will be done at the demo and if you want to talk further about specific projects and Yeah, everything in Intel have a good day