 Hello. OK, let's be start. Hello again. My name is Milan Jeroš, I work for IT for Innovations, Czech National Supercomputer Center at VSB Technical University of Ostrava. Today I would like to talk about our extension for rendering of massive scenes, in blender of course. First, I would like to describe what does mean the massive scene for us. Then I would like to show very shortly how our rendering service works. Then I would like to describe two state of the art systems for multi GPU. And then I would like to show the psychose 5 renderer, which is our extension for psychose 5 on rendering on HPC systems. Then I would like to describe two methods, support rendering massive scenes. Then I would like to show how it works on multi GPU nodes. And at the end I would like to show the use case how we all use it in our visualization lab. Let's start. What is massive scene for us? This is two examples for massive scene. First is Moana Island scene, which is the production scene. Several years ago the Walt Disney release the Moana dataset for free for researchers for developer for testings. And for this scene we developed the add-on. The second case is loading and rendering the scientific data. In this case you can see the rendering in cycles where the data comes from open form, which is the simulation alpha water. For the open form we develop two add-ons, be Covice and be Vistel. Both add-ons are based on Covice and Vistel, which was developed by our colleagues from HLRS Computer Center. Now I would like to show very shortly how works the importing Moana Island scene into Blender. We develop very small, very easily add-ons, which support the loading Moana Island scene. Walt Disney release several scene description, for example USD, PBRT, or JSON scene description. We use JSON scene description with OBGA files, which contains geometry and PTX textures. Odfrančitly Blender doesn't support PTX format and we have to solve it. We develop the algorithm, which converts PTX to OpenXR and UV systems. In this example you can see how we import the palm depth. On the right side you can see how looks the example of exported texture. And now you can see how looks the UV system exported from PTX. Now I would like to show how we allocate the resources for the rendering. Because I work for Supercomputing Center, we develop by HPE add-on, which is very similar for example for Flamengo. You can use only one button for submitting job and send the blank file to the cluster. And after finish it, you can use the second button for download it. Using this button you can utilize very easily the whole cluster. The PHPs now support two modes, HPE mode, where we use our HPE middleware system, which was developed by our colleagues at IT for Innovations. Or we can use the direct access mode, which is the wrapper for remote SSH command. Now I would like to show how looks our state-of-the-art multi-GPU system. We have 70 HGX-A100. This cluster or this node each node contains 8 GPUs. The uniqueness of this system lies and interconnect. In this case it's NVSwitch and we switch from NVIDIA and NVailing. This is very fast interconnect with very high bandwidth and very low latency. In rendering perspective you can use all GPU memories in one other space using the unified system. That means you can use over 300 GB for rendering of scene. The second cluster, which we have, is DGX2. It's same architecture, but it contains 17 GPUs card. And you can use over 500 GB GPU memory together. Now I would like to show two modes of our extensions for rendering on HPC. On the left side you can see how works the interactive remote rendering. The blender runs on the user computer and cycles file runs on the cluster. Blender communicates via TCP protocol with the cycles file and cycles file can run on several nodes. And it communicates between each process using MPI or NCCL. On the right side you can see the client-only mode. In this mode we exported the binary file from the blender and saved to the disk. Then cycles file reads this data and continue with the rendering. The advantage of this mode is that you can run the preprocessing on the different machine, for example with large memory. Now I would like to show two methods for rendering of massive scenes. There is two main methods for solving the low memory on the GPU. You can use the out-of-core method, where it usually is using the CPU memory. Or you can use the distributed or parallel rendering. In the case of distributed parallel rendering, you can use the data parallel rendering or image rendering. Our approach is in the second category, that means in image parallel rendering, which means the race remains the fixed on GPUs. How I said, we use the unified memory. In this case you can see how works the CUDA unified memory. In the case of fully replicated data, in the unified memory perspective, this is example for one array and this one array occupies four times more memory in the unified system. On the right side you can see the example of our Barbara GPU node, which contains for GPU and each GPU has own color in this picture. If we can distribute the data, for example, if we use the continuous distribution, we can divide this array, which contains several data structures, for example BVH nodes. We can divide this array to the large part, to the large chunks. In this example to four and each part we can save to different GPU. Then, that means we saved four times more memory. But we can divide this array to the smaller parts, smaller chunks and we can distribute it, for example, in round robin fashion. In this distribution each small part is distributed over all GPUs. Of course, we can combine both methods. That means we have the partial replicated and partial distributed array. In CUDA, we can use CUDA MemAdvice function for that and with specific flags. In this case, for the replicated, we can use the set read mostly and in the case of distributed chunk, we can use setprefed location and save to the specific GPU. Here are four scenes, which we created for the developing testing. First is Moina Island scene and after importing, we get over 160 GB and this scene has specific properties. That means it has a large geometry and a large number of textures. On the right top side, you can see the museum scene and this scene has large geometry and large textures. On the bottom, you can see the familiar pictures from the open movies from Blender. On the left, Agent 327 and on the right, the spring and we a little bit increase the size from the several GB to 160 GB and we can test it. Both scene has complex shaders. The Blender Cycles contains over 40 data structures, these arrays for describing the scene and the other structures, other arrays are for the texture images. In this table, you can see the four key data structures, which we found that. In the case of Moina, you can see that BVH nodes, which contains BVH3, this array has a little bit over 80% from all accesses to this array, from all arrays, but it has only 8% size from whole scenes. In the case of Agent and Spring, you can see in the SVM, which is the array for shaders, you can see that it has almost 20% accesses. We use this behavior for next research. We developed two methods, basic distribution of anti-data structures and advanced distribution based on memory access pattern and statistic. Now, I would like to show how works basic distribution of anti-data structures. Here you can see the rendering time for Moina and in this case is Moina only 27 GB because we have used the fully replicated, which is the green line. In the case of 8 GPUs, which is the time rendering on the right, in the case of fully replicated, you can see over 100 milliseconds takes the rendering. In this case, the scene occupies four times more memory in the unified system. If we use the round robin fashion, which is the example on the top, and distribute each and divide each array by distributing divide to small part and distribute it over all GPUs, the rendering time has over 200 milliseconds, which is the two times slower, but we save the eight times more memory. I think which is very nice. In this case, if we fully replicated four key structures, in this case, VVH, triangles, vertices and shaders, we get very much better time. In this example in 8 GPUs, we are very close to the fully replicated scene. This is advantage and very interconnect. How works the second method? First, we distribute all GPUs, all data structures to all GPUs using round robin fashion. Then we run only rendering for only one sample per pixel to measure the statistic. Then we download the statistic to CPU and our algorithms, which decide where chunks will be placed. Then we run the CUDA-MEMAdvice function to migrate the data and run the original unmodified path tracing kernel. Very shortly, this looks the algorithm, which decides where the chunk will be placed. The important things here, we don't care what is the type of data structure because we put all measurements, all statistic to one array, in this case of h array, sort them by the number of accesses from the highest to the smaller and we decide where chunk will be placed. How looks, what is the difference between advanced and basic method? In this example, for Moana scene, you can see how looks, what is the difference between basic and advanced. In the case of 5% replication ratio, in the basic case, we have only 20%, we cover only 20% of all accesses, but if we use the advanced method and using the statistic, if we use the 5% replication ratio, we cover over 94% of all accesses to the arrays, which in the end has a very nice rendering time. This is some visualization example, how looks the distributed data, distributed array in this case, how looks the vertices, where are they placed. On the left, you can see fully distributed, where each color has a GPU and you can see where data are placed in fully distributed method. On the right side, you can see the rendered image and this rendered image means the replication, that means these parts are replicated and other with specific color are distributed. How looks the rendering time on the HGX multi GPU system and the GX2 with 16 cards. Orange color is the visualization of the rendering time for specific replication ratio. If we have the 0%, that means fully distributed, on the GX2, we get over 500 milliseconds. If we use the advanced method and replicate it 7%, we get very nice time all around 200 milliseconds. That means advanced method works very nice on system with 16 GPUs. Now I would like to very shortly how works, how looks the scalability over multiple nodes. In this case, this test was run on over 200 NVIDIA GPUs and over nodes we used splitting samples. That means each node using rendered only parts of samples. You can see very nice scalability. At the end, I would like to show how it works on our visualization lab. This is how looks the visualization lab. It's not theater, it's my second office. You can see how looks the interactive remote volume rendering in 4K with 3D in our visualization lab, in cycles of course. The Blender support stereoscopy for long time, basically for Borgbench and EV. But for cycles, it supports only offline rendering. We add to cycles fire rendering of 3D images. That means it runs side by side. For example, 4 GPUs run the left, render left image and next 4 GPU 1 node render right image. This is how looks the rendering in our visualization lab. For that, we create the VR Clia client, which uses the quad buffer for OpenGL. It's receive on images and showing in the cinema. We use it, we can use the GPU JPEG compression, which is very fast for NVIDIA GPUs. To last video, this is example how looks the interactive rendering in our visualization lab. You can see the samples, how it render. This is the example for 4K and 3D. At the end, now we are developing the method for interactive visualization of scientific data. This is example of rendering, which we are able to render in this case, example over 40 GB VDBs in almost real time. OK, thank you for attention today. We have 2 minutes for question, if somebody have. 1GPU, you've seen, can you use the same technology to render big stuff on a small hardware? We are working on the method, which is able to run, for example, on 1GPU or 2GPU and then similar algorithm and other safe to CPU memory, which is in this case, the out-of-core methods. Is your add-on for building the Moana scene for the domain? Again? The add-on you developed to open the Moana scene, is it public domain? Yeah, it's on our, but it's public. I can put it to GitHub, if someone would like to be better. In this example, I use Blenderf 2.83, which I have for the latest Blender 3.3 and now I'm working on using the geometry nodes for that, because I use that particle nodes, which I have to use the mine-out page for the instancing of each object, because it has over 10 million instances and I use the hair and bake it to the right place. It's a very small page, but with geometry nodes we don't need any page for that.