 Good morning ladies and gentlemen. Allow me to start this talk. I am Rubomir Rih and I'm going to be talking about our work on parallel rendering using Blender Cycles. The work's been mostly done by Milania Roche and Petr Strakos. Both of these gentlemen sitting right there. So if you have a really technical question, they are here to answer. And without further ado, I'm going to start. So today I'm going to be talking about our rendering client for HPC clusters, about our new service that we provide, which is called rendering as a service. And I guess the most interesting thing that you might find is the rendering of large scenes on the multi-GPU system with shared memory, where your scene is actually, where your data are distributed among the multiple GPUs. So this is the overview. And I'm going to start. So I'm going to start with our rendering client, which is called Cycles5. Cycles5 is pretty much the modified kernel of the cycles that we took out of the Blender. And it is designed to run on the HPC clusters, I mean, or in general on any parallel machine that has some interconnection between the compute nodes or between the servers. So since we are, it's primarily designed for the HPC or it was designed with the mind of the HPC, it's using the, for the distributed rendering, it's using the MPI, which is the standard for the HPC, but it can run on any network, even on the internet. It's using the OpenMP to paralyze over the multiple cores of your CPU. And it also supports some, for general audience, quite exotic architectures like Intel XM5, both the first and the second generation. This is the one that we have at our center and we were using it a lot, for instance, for rendering the spring movie. We also have a newly supported the multi-GPU system. I'm going to be talking about that in a few minutes. And also the ARM, we are porting the client to run on the ARM via it architecture. So this is kind of an overview of what the Cycles does. I mean, as you see in these pictures, I mean, it's pretty much a rendering client to the blender. So you can either, you can run the blender on your workstation, or yep, for instance, in your workstation, and then you can have a set of rendering clients executed on the rendering nodes, and then you just communicate over the sockets with your workstation. Then this is, for instance, if you have strict CPU cluster, but if you have somehow accelerated cluster, this figure shows that you have the XM5 accelerator, but it could be any server with any number of GPUs and CPUs, and if there will be any new successful accelerator, I'm pretty sure we're going to support it. So this is the architecture, and then you have again, you have your client that could be your blender instance running on your workstation, and you're just getting the data in, for instance, if you're interested in interactive rendering in the real-time directly to your client. One of the good features of the rendering client is that for just performing the ray tracing, it needs a significantly lower amount of memory on the actual hardware that performs the rendering. So we have, like, this example shows three scenes, where the largest one, the scene from Spring, actually requires up to 48 gigabytes of memory if you want to run the full blender, prepare the scene, and start the rendering. On the other hand, if you just take the data that you need for the rendering itself, and you just run the rendering on using our client, you need just 12 gigabytes of the data. This is kind of a crucial, because the accelerators that we have in our center, the XM5, they have a 16 gigabytes of memory, and if we wanted to use them for the rendering, this actually allows us to use them instead of using a CPU only with 64 gigabytes of memory per node. This is an example of how we actually did the rendering of the Spring movie on the XM5 accelerators. They need to be treated slightly differently than we would treat a general rendering cluster, or a general render farm, with the CPU only. So we need to implement a different kind of load balancing that fits more, this kind of architecture. So the XM5, it's kind of a 60 core, we have a 60 core CPU. All cores are sharing the memory, and the cores themselves are not as powerful as the CPU. I mean, it's pretty much the basic of the core, it's like a Pentium 4, but it has some extension for the vectorization that actually has the real performance, but you need to treat it differently. So for instance, if we wanted to do the load balancing among the cores, instead of using the tiles, we had to implement the load balancing over the pixels, and if we wanted to use, and we successfully use up to 70 XM5s to render a single scene. So we did pretty much render the entire scene on the each accelerator, and we distributed, and a set of samples was rendering for each pixel on each of the accelerators, and then we just combined the partial renders, the partial samples into final render buffer, and that gave us the final product. So this is, for instance, the performance difference if we use a different kind of load balancing, and you can really see that we were able to get from, let's say, 1000 seconds per frame to like 400 seconds per frame if we were using 50 of these accelerators. So quite significant optimization. So this is the Cycles5. This is our rendering back end. This is the client that runs on the cluster, and potentially, as I said, it can run on any farm with Ethernet network. It doesn't have to be any special HPC network. So it's kind of a backbone of what we call the rendering as a service. So rendering as a service is a service that allows users to use our cluster very user-friendly. The idea, so it's based on, as I've already introduced, the Cycles5, and it is based on what we call the HIPI. HIPI is pretty much a middleware that's developed by our colleagues at the center that allows people to use a supercomputer in a user-friendly way. So if we have a customer, the customer comes, we prepare for him a website that he can use to submit his work on the cluster, and we take care of everything that's happening on the background. So for a user, it's a very easy, easy to use HPC infrastructure, which by default is quite difficult to use. So this is the HIPI. That's the middleware that's kind of translating the request from the client to the cluster, but for the Blender, we develop a BHP, which is add-on to the Blender, that actually implements the interface to the cluster directly into Blender GUI. So this is the example of the BHP add-on. So you pretty much, since this is for our cluster, you can pretty much choose from two different kinds of resources. The mic is the 7-fly accelerator if you want to use the accelerator for the rendering, or you select the MPP, which if you want to run a massively parallel job on the CPUs. So then you can just select how many accelerators in this case, or how many compute nodes you want to use. And then just by clicking submit the job, you send the data and the request for the rendering to the cluster. It gets scheduled on the cluster, and then you have pretty much a list of your jobs that you have submitted, and you can just click the refresh to get the status whether your jobs are finished or not. And then if you have a job that is finished, you can just click download results, and it will just download the results to the Blender. So it's very straightforward. It's fully integrated into the Blender GUI. You can then set the typical parameters. Like this is kind of a specific for the HPCU. You have to tell the cluster how long, what is the longest possible time that you want to run the rendering, then the resolution, and how many samples, and that's pretty much it. So it's a quite straightforward and it's designed to be as easy to use as possible. So I've already shown this one. So now I want to talk about our latest work that deals with this. It's quite a beast. It's NVIDIA DGX2 system. It's a server, quite expensive server that has 16 GPUs in it. It's the high-end Volta GPUs. I guess for your community, it's going to be slightly disappointed because it's not Turing. It doesn't have the RTX core. It's the generation before that that's been developed specifically for the machine learning and for the high-performance computing. But so the server itself has 16 GPUs. It's Tesla V100. Each GPU has 32 gigabytes of HBM memory, which is high bandwidth memory and the throughput to the memory is close to the one terabyte per second for each GPU. And then what's the special about this machine is how the GPUs are connected to each other. So this server has fully connected, all GPUs can be connected to any other GPU with just one hop through one switch. And it allows you that you can have very fast data transfer between the GPU memory way, way faster than if you would use the PCA Expressing. I have the numbers on the next slide. The other thing that each GPU have like a six links. And if you want to transfer data from one GPU to another GPU, you can simply use all six links. And you have a full, full potential, full bandwidth, full performance of data transfer between the GPUs. So what is the one of the key advantage of such system that actually you can combine all the 16 times 16 GPUs each with 32 gigabytes of memory into 512 gigabytes of a shared memory. So then each GPU has access to any memory of the GPU. And it's a fully transparent. It's hidden for the user. It's done by the hardware. So it looks like it feels like if you're programming a multi processor system, if you have like a dual socket system, this is just a 16 socket GPU system. But it feels to the programmer like you're programming a multi core and you have like 16 CPUs in that the hardware actually does the memory coherence for you. You don't have to worry about transferring the data between the GPU. It's done by the hardware and since the underlying hardware infrastructure is very strong and very efficient, it works fairly well. This is what we call the NUMA matrices. And it's pretty much tells you that if GPU zero access memory of any other GPU, what is the latency here? So if I'm accessing my own GPU, then the latency is about 5 microseconds. And then if I'm going over the PCI Express, the latency to get one byte from other GPU, it's about 25 microseconds. If I go to the to the end welling, then of course going to my own memory is the same, but going to the memory of other GPU is just 10 microseconds. It's only two times slower to get the data from any other GPU. In terms of the bandwidth, if I read from my own memory at the speed of 850 gigabytes per second over the PCI Express, I can transfer the data at 12 gigabytes per second over the end welling. I have an order of magnitude better bandwidth and I'm getting closer to the 130 gigabytes per second. This is for instance faster memory access than your CPU have to its own memory. So this is the platform that we used. And now pretty much we've tested this platform. We ported the cycles file to this platform. And as a baseline, we use the GPU how it's used in general. So you just duplicate the data and you render a part of your scene in each of the GPU. So you have to pretty much you have to duplicate the data in all the GPU's memory. So this is kind of a baseline. And this is this figure shows that we have implemented the dynamic load balancing for the interactive rendering. So this says how many frames if we render one sample per frame, and then if we don't move the camera, we just keep incrementing the samples. So the frame, the bars show how the load balancing work, how it's dynamically adjust to keep all the GPUs equally busy. And you can see that some part of the image are less complicated, some are more. In some scenes, it's pretty much a uniform distribution of the workload. And this is pretty much the performance if you're rendering one sample per frame if you move your camera. So yeah. So this is actually the this is the this is just the video that shows how we implemented the load balancing because it's going to be used on the other slide. So this is the scalability of rendering, which means if you expect I'm adding extra GPU, I want a half of the time going from one to two. If I'm adding the 16 GPUs, I want one 16 of a time. So the dotted line is ideal. The dashed line is with the load balancing and and the full line is without the load balancing. So this slide shows the two benchmark scenes, the agent and the classroom. I guess everybody here knows what these scenes are. It's a standard benchmark. And this is a time to render a single frame with one sample. I mean, if you added the samples, it's just it takes a number of times longer depending on how many samples you want to use. This is this other different scene. This is the BMW and the Pablon benchmark. And for instance, for the Pablon, you see that the load balancing that we have quite improves the scalability. It helps a lot. The BMW scene is it's getting it's quite small scene. So actually here you are getting closer to the 16 to the 16 millisecond and it's the data transfer and the synchronization pretty much kicks in and that creates the overhead that kind of kills the scalability at this point. I mean, you would simply solve it that you would render a multiple frames that you just need to get to 25 frames per second and you wouldn't render by one single one sample, but you would render by you would render multiple samples. So then, okay, what if the scene doesn't fit the GPU? So I can duplicate the data. So this is actually something that would be the ideal. Every GPU has a nice access to data. Every data it's in it's in its own memory and this kind of serves as a baseline for another for another experiment that I'm going to show. So this is the example one. So I have all the data duplicated. Let's suppose this is the data. This is the cycles internal data that cycles needs to render the scene. This is the input data for the cycles. So I can duplicate the data or if I'm going to this to the DGX2 to the shared memory GPU shared memory system, I can simply distribute the data. I can fully distribute every data structure that I have for the cycles and I can distribute it evenly. So the GPU zero will have first part of the data and the number of GPUs define how big the chunks are and then I do that for every data. So I have fully distributed data but then I can have some slightly optimal distributions. For instance, the first array starts in GPU zero, second array starts in the GPU one, etc. So I had just different kind of distributions of the data and then the last scene is that some data that are very heavily used and I have enormous number of accesses to this data and I can afford to keep them in the memory. I'm just going to duplicate them to get the best performance and some large data that are excess sparsely and or doesn't have such an impact that I'm reading this data from a different from the different GPU memory, then I can simply distribute those data. So and then we run this, then we perform this benchmark where we actually start moving one after another the data structure of the cycles to the distributed memory from the local and this we identify that like these three arrays have the highest impact. So if I have, this is the baseline where I have duplicated data then if I start moving these these three arrays I'm getting 50 percent another 33 percent and another 20 percent of the performance penalty. So my rendering time goes from 100 percent to 200 percent. So this we identify that if we are able because we have enough memory and we duplicate this data we can eliminate the most significant bottlenecks that we have and then we also take a look at the size of each of the arrays. So this is for the problem benchmark you see that the amount of memory that is used by this arrays is quite small. So we can easily afford just to duplicate the data and have the best possible access to it. While for instance the textures that take 70 percent of the time that have only one percent of the accesses. So they have very small impact and we can just easily have them distributed and we save the memory on per GPU. So then the last case would be that we keep these three data structures duplicated in each of the GPU and all other data structures that takes most of the memory which is distributed and then you can keep adding the data to this group where you distribute it depending how large your scene is and how many data you always want to have your memory full because the more data you have locally the better performance you'll have. So then but for this case we just use these three arrays that have the highest performance. So then pretty much if we see that we have this is the baseline this is again a time to render one frame this is for two GPUs. So this is the baseline this is if you duplicate data and this is the case for this picture where I have these three arrays duplicated and all other are distributed. So you can see that the impact is quite small and for mostly all the scenes it is a little bit slower but it's it's still performing as good as if I almost as good as I have all the data in the local memory and this is kind of true if I'm going to four GPUs eight GPUs for some of the scenes some of the scenes stops and potentially to the 16 GPUs where I mean you can see that the effect is getting slightly stronger if you're going to multiple GPUs but still we have let's say 20 15 to 20 percent of the of the performance penalty while we are able to render up to several times larger scenes on such a system. This is just to show what is what is the total memory of these benchmarks and how are the how large are the three key arrays that have been have been duplicated. Still it's it's not it's not it's not the major part of the data that are duplicated. So the final test that that we have was on the on the Moana Island scene so we starting to implement in the scene because we really want to have a huge scene in the blender so the video shows the dgx really has 16 GPUs this is this shows how much memory is is utilized on every GPU so this is the last case where we the three of the arrays are are duplicated and the remaining isn't and this is the this is pretty much the interactive rendering of the of the seed so this video shows we have running at 11 frame rate frames per second but I'm going to show you the another another set of results okay the fact is we still don't have the full I mean you you don't see the textures this is still work in progress but for this is just working with the geometry of the of the scene but it's still large enough and it it actually gives us the scene that we need for this kind of for this kind of experiments and and development so there and this is just the just the statistics the the facts about about the Moana scene so so we have we have per GPU as I said we have 32 gigabytes of data the the runtime and the libraries that we're using takes up to three gigabytes of that of that memory so we have about 29 gigabytes available for our for our user data so the end the the size of the entire scene was was a 31 gigabyte out of that the 18 gigabytes have been distributed in a sense that we just take a 1.15 gigabytes per GPU and then the the arrays that we that we that we duplicate it has the size of them was 12 gigabytes so in total we use 12.3 plus 1.1.2 gigabytes of of data so we still have some some memory some memory free so we can duplicate more data and this is the experiment that's going to happen and then the then the rendering times where that for this for the single for the single frame but with the with the one sample with the hd resolution if we have fully distributed data I mean every array has been distributed that was that's kind of the worst case that it was 151 milliseconds and then we if we apply the load balancing and we duplicate the data for the for the three key arrays we get to 27 milliseconds which translates to approximately 37 frames per second for one sample or four frames per second if you have a if you're rendering 10 samples at a time and then you just send the data so I mean if anyone is interested I mean we are publishing this this add-on the add-on that that's that is designed to read the Moana scene in into the blender you can find it on this website I mean it's still working progress but Milan is constantly working on that and as he updates and he's able to read more and more data from the scene he keeps pushing the new the new versions to the to the git if you are interested in different informations about the cycles five and whatever we do with the blender at it for I the blender dot it for I dot cz is the website where you can find the information and with that I would like to thank you and if you have a question please ask yes please the cycles five yes it's we just took the cycles out of the blender it's separate small client it's it's out of the blender yes and again it can be downloaded from the from the get here well that's really depends on the gentleman who runs the show here right they have to agree become yeah we would like to be but it still needs to be somehow standardized with the with the general approach that the that the blender community has it's still the HPC environment for us it's natural but for most of the use of the blender it's kind of exotic and