 Hello. Thank you, everyone, for coming along. I understand, of course, this is getting on towards the end of the summit now, so you're probably all a little tired, but I appreciate you coming along. My name is Jeremy Phillips, and my better-dressed colleague here is Luca Cervini. We're from Pausie Supercomputing to talk to you about GPU on OpenStack for World Science. Now, before we start, just please be aware that while we did mention in our presentation brief that we would be talking about our results using RDMA, however, unfortunately, the results for that are not available at this point, so on that caveat, as long as that's not a deal-breaker for people, hopefully you'll be able to get something out of this. First off, just a little bit about ourselves. So Pausie Supercomputing, we're based in Perth, Western Australia. We are a non-incorporated joint venture between CSIRO, which is the federal government research organization, and four of the West Australian universities, Curtin, Edith Cowan, Murdoch, and UWA. We are also funded partly federally through the INCRS, the National Research Infrastructure for Australia Fund, as well as through the state government of Western Australia. Originally established in June 2000, primarily started with HPC and data storage, and just in recent years have started moving into cloud computing. We provide free computing resources and training to students, industry personnel, researchers, academics, and scientists, primarily, obviously, based within Australia, but we do have some other people coming from outside of Australia as well using our resources. From the HPC side, we have two Cray Supercomputers, an XC-30 and an XC-40. We also have about 10 petabytes of live storage and 40 petabytes of tapes. Quite a bit of that, obviously, is being used by the ASCAP, the Australian Square Kilometre Array Project Pathfinder, sorry, and, yes, on top of that, of course, we've got our cloud computing cluster. So, hello, everybody. So, our cloud facility is called Nimbus. It's not mentioned here, but it's an open-stack installation based on open-stack pike, and with some notes already, some services already in open-stack queens. I think we have Keystone and Cinder and Queens already. And we have 46 compute nodes, 39 storage nodes, and 12 service nodes, and the new entry that are these six GPU servers with dual NVIDIA V100 GPU card. In total, we have around 3,000 cores and one petabyte of raw storage with CIF. And the old cluster uses Wuntu 16 for the moment, and use Puppet and Mass as bare metal provisioning. So, we are a small team of three people, me, Jeremy, and Gregory. So, just so we take you through, briefly, at least, you know, a couple of the projects that are currently using our GPU nodes, just to give it an idea of the kind of fields that are being covered. So, obviously, we've got agriculture processing of multi-spectral imagery from remote sensing. That project is investigating land systems for water repellents in serial production and revegetation in low-brainfall farming systems. We've got psychology using TensorFlow to speed up sampling of large and complex Bayesian models. They're testing cognitive psychological models using statistical software to extend mathematical models of self-regulation to complex tasks involving rapid decision-making, for example, air defense. In biology, using molecular dynamics simulations to assess the interaction of glycans with their receptor proteins. That's NMR, spectroscopy, and molecular modeling techniques are used to find more about the structure, function, and dynamic of carbohydrates and glycans that play a pivotal role in cell-to-cell communications, cancer, and human pathogen interactions. We also have one, another interesting use case, which we've just done their own presentation recently on classification of shallow water fish, which is this one just here. So, basically, they, it's a deep learning model for real-time object detection and classification from underwater images. They use these baited underwater stations to attract the fish in, and normally they have trained oceanographers who do the manual classification, so instead they've started training this system instead. At the moment, they've got about an accuracy of over 90% on the validation data set, so that was run off of the GPUs. Just to sort of give you a bit of an idea of that work, which is cool. Well, we think it's cool anyway. Anyway, so you probably want to actually hear a bit more about the actual GPU nodes themselves. Yeah, we can. So, the GPU nodes are HPE servers, particularly at DL380s, and there's a dual CPU in them, and there's a 6132 Xeon server with 14 cores, 384 GB of RAM, and the two Tesla cards, and we have also the dual 100 gigabit Ethernet link. So, just to mention, so these servers have different affinity with each CPU and GPU. So, the first GPU has an affinity with the first CPU and the second GPU to the second CPU. So, these two have, in theory, we are running two GPU VMs in these servers plus other VMs that are not that they don't have a GPU in it. So, one of the VMs is running from the first CPU and the second one, the second GPU. So, to achieve so, as the OpenStack documentation, we use the CPU isolations for the CPUs that we want to run with Nova and with the GPUs. So, essentially, we enable IOMMU in Grab, and we use this parameter called ISO CPUs to isolate the CPUs we want to use for Nova. So, in Nova.conf, we use the CPU pinset parameter with the CPUs that we just mentioned above, and we enable the filters with the Noma topology filter for having the scheduler scheduled in the correct flavor to the instance. So, we also have IPer threading disables because following the advice of some colleagues from other centers for our workloads, it is better to keep IPer threading disabled to maximize performance on certain kind of workloads. So, Jeremy. So, yeah, the other part of that obviously is the PCI pass through for the GPUs themselves. The documentation that OpenStack provides is actually pretty good. So, very basically, we just run an LSPCI so that we can identify the vendor and product IDs for the two GPU cards, and then in Nova.conf on the controller side, we create an alias, in this case, we've called it V100, where we identify those particular cards. Of course, we have to use the PCI pass through filter for the scheduler as well. And then, on the compute side, we set up a white list for that vendor and product ID as well, so that the physical compute node itself knows to pass that through. Then we create the flavor itself. We just keep it to seven cores for the flavor. As Luca mentioned, these GPU nodes, we still use some of the resources for just running regular instances on them as well. So, for the actual GPU specific flavors, that's usually sufficient. We give it 90 gig of RAM. Again, the RAM is allocated, basically split between the two CPUs, so we have to use the new node adjacent RAM for each of those flavors, and 40 gig of disk for the root file system under CEP. In the flavor configuration, we just have three particular properties that are set, which are the, as you can see there, for aggregate instance extra specs, which is actually another filter that I forgot to mention, which should be on there. But anyway, so that set pin equals true. CPU policy is set to dedicated, and the PCI pass through, we refer to that alias of V100 that we defined earlier in nova.conf with just the one allocated to the PCI pass through. And then for the... When we create the host aggregate, we just make sure the pin equals true is set as a property for that host aggregate, so it knows that the flavor can run within that host aggregate. As a slight aside, if you wanted to fire up an instance that used both GPUs at once, there's some additional documentation under for OpenStack for CPU topologies that sheds a little light on that. The main difference, as you can see with that OpenStack flavor create statement, is you've got that PCI pass through alias, we're still using the V100 alias, but now we're telling it to use two instead of one, and the other line is highlighted in red, which is the human nodes equals two. So note, of course, because we've given it double the cores and double the RAM, that's so that it can pull the same amount of cores and RAM that would have, for one instance, from each side of the compute node itself. So I don't know if you mentioned it, but so this would like to be an open discussion afterwards to see and if you can comment with us our results. So going a bit more in details, so as you can see, these are the NUMA nodes of our compute node. So we allocated the first instance to the NUMA node zero and the second instance to the NUMA node three. We used just as much as memory as we could have used in the same NUMA node to decrease latency and having the VM perform as better as possible. And the GPU affinity, so down there you can see the GPU affinity and you can retrieve that with NVIDIA SMI topology. So the first GPU is, so GPU zero is got an affinity with the first NUMA node only and the second GPU with the third NUMA node. So you can see here an instance running with a vCPU pinning and you can see that the CPU affinity is only on the cores from zero to six. If you have another instance running without vCPU pinning, you would see the little y on all the cores, that means that the instance virtual CPUs can use any of the cores available on the hypervisor, but not in this case. So in our test we tried to compare our virtual machine with the GPU passthrough to a bare metal node. So obviously the bare metal node has multiple CPU and GPUs available. So the performance of the bare metal node had to be tuned down to be comparable to our instance flavor and GPU instance flavor. So as already mentioned, the VM is configured to use a single NUMA node and so we tried to reproduce that on the bare metal node that was exactly the same hypervisor we run the VM from, but with a directly provision bare metal. So to do so we removed the GPUs from the PCI using that command there and we removed also all the cores available except the first six on the first NUMA node. So the only difference we couldn't remove was the fact that the local storage was an SSD in the bare metal machine and with the virtual machine we still had a 40 gig volume on self and we have quite an old cluster and so for sure the 40 gig volume of self doesn't have the same IOPS as the local storage on SSD. That's an old SSD anyway. So the first benchmark we run is the high performance LIMPAC. LIMPAC is a well-known software library used for a floating point benchmark and as you can see the bare metal is on an average 4% faster than the virtual machine with around 5500 gigaflops against 5200 gigaflops. So LIMPAC is very GPU intensive but it's not ultra CPU intensive so I'll show you later why I'm mentioning this but so the result for this slide seems good results but in the other slides we had mixed level results. So this is a TensorFlow benchmark for I think everyone knows about TensorFlow but TensorFlow is an open source software library for high performance numerical computations and this particular benchmark uses the ResNet 50 image classification model and we see that the bare metal machine is only 0.6% faster than the VM and it seems quite a good result so you can see on the bottom the number of runs and it's pretty the performance are very similar to the bare metal. So the last benchmark we run is NUMD so we run this one only six times because we actually finished to do these results like I think it was the day before we arrived in Berlin. So essentially the bare metal node in this case is way faster than the VM and the difference in this test is the CPU is much more utilized than the GPU and this NUMD test is essentially a parallel molecular dynamic code for high performance simulation of large molecular dynamics and we've seen after many runs that consistently the VM is slower than the bare metal node. We still don't understand and actually we want to open the open discussion video. We still think that probably there's some issues with our novel configuration or the CPU pinning in which the CPU is not used as well as we think because yeah the CPU as soon as the CPU is loaded so it's very loaded the performance decrease compared to bare metal. So for the people that wants to stay we would like to have a small open discussion for people that had that already have a GPU deployment or even for people that don't have a GPU deployment and would like to have some and these results for the moment are good for us but just because we have a small cluster but recently we had a new capital refresh so we have to expand our cluster and an 11% difference in performance is not acceptable and so we would like to tune our configuration to have better results. So if you have some comments already I'm able to take some some questions or discussion. So these are already the the known issues that so each of the nicks obviously for how the PCI razors inside the machines are configured have a different affinity with a different CPU and though they are coupled they are bonded with LACP and so the two configuration creates issues with the virtual CPU pinning that we did before and so this TV the the instances cannot access directly both of the nicks they have to pass through the other numeral than the other CPU and so this increase latency and slow down performance though we don't think is this the reasons for the numD results. There are possible solutions for this and this is already some work that other centers for other computing centers are already trying to address and some guys already tried to pin half of the CPUs for the CPU 1 and 0 and half of the CPUs for the CPU 2 and 03 to have both GPUs and network be able to be accessed by the the correct NUMA memory but also we could also think about removing LACP and just have another network configuration but as you know the network guys will not be happy very happy to change all the network configuration at the center because we want to perform a bit better. And as going ahead and wishlist we really would like to test Nvidia Creed and VGPUs for Ubuntu hopefully the packages and the binaries already are available for Redat and Libert but they're not still available for Ubuntu. I asked actually before the Ubuntu guys if there was any news about it and they didn't know anything they told me that the kernel already supports the VGPUs from 4.10 or 4.12 but with no binaries I don't know if this is going to be possible and once that is done we really would like to try the VGPU support on Queens and Rocky and after that we want to do testing on RDMA on GPU to GPU through the network so if you want do you want to add anything? No that's pretty much it. Cool so well that's pretty much it for our presentation but did anyone have any questions? Yes my question is when you look at purely the performance of GPU bare metal versus in the virtual machine I saw the benchmarks that you presented were always CPU and GPU combined is that correct? Yeah except for TensorFlow it's mostly GPU. Right so did you see a difference there as well? Actually when it's only GPU bound purely GPU bound there's a minimal difference this even less than zero around 0.0 percent. Okay so it's mainly the CPU? Yes it's mainly the CPU yeah but we're still not sure if it's because the CPU with the NUMA node configuration with the VGPU pinning configuration and the NUMA how the GPU and CPU are joined together with the noises for that the performance decrease because 11 percent is quite a substantial difference. We don't have this difference on our CPU running without the GPU. Thanks so I'm kind of looking around the room for I'm hoping that there's somebody here who knows this stuff way better than I do but I don't know whether you kind of glossed over it or whether it's actually absent but there were a few pinning configuration options that I noticed that a conspicuous absence of. What are you talking about? You config like and again I don't know them off the top of my head so well but there's like it is a way to specifically define the CPUs and memories in the same NUMA. So you see that one is the first NUMA node node zero and the NUMA node has the first six cores from zero to six or seven cores and the memory adjacent memory. Right I get it but yeah sorry here I think it's what you were talking about so yeah this is from zero so this is the instance that is pinned to that first NUMA node and the cores are pinned and they stay within the NUMA node so if you see where the Y is it's a bit of an unreadable graph but VERSHA gives it to you this way. You can see that from cores zero to six is pinned so that is the instance information and these are the cores where the core lies on the physical core. Yeah so that's why I assumed that I had just you know you just hadn't shown whatever the configuration option because this looks like it's working the way you want it to in which case my hypothesis here is that you're fighting against the overhead from the you're not getting exclusive access to these CPUs when you're running virtual. Yeah we thought so as well but so we have all passed through already. The hypervisor we did the test was completely empty. Right but you're still fighting against the compute process. That was actually my question is limiting the overhead to not be on the vCPUs. I tried to do it but it didn't actually work so I don't know if that's something you've looked into as far as overhead. Though I mean the overhead 11% is a bit too high I mean it's not cannot be considered as a overhead I mean if you were thinking the first test around 4% it could probably be but 11% is not so and we still don't understand where the the issue is coming from. Have you used huge pages at all or did you set that at all in your? We still didn't we had on the other cluster but these are really early results that we published that we achieved right before coming to the summit. It can improve huge pages but I don't think it's going to improve so much the performance but yeah yeah thank you. So a couple of things just brainstorming with my colleagues over here if there was a way for you to you should be able to pin the hypervisor processes to CPUs that aren't these CPUs is one thing that could get if you are in fact fighting with the hypervisor process. Oh did you already do that? Yeah we already did that so the eyes of CPU parameters so essentially when you put that the that flag in the in the in grab means that from so all the CPUs listed are they don't have to be used by the operating system so they are completely isolated and just Nova can use them so right it's not hypervisor noise as well. Did you shut down the compute process before you started your benchmarks? Did Nova compute? No we didn't. You could do that as long as you don't need to go deploy more VMs. No no yeah that's that's a good idea that's a good idea. Thank you. Yeah that's it I was talking about I don't think that actually works I think what don't think it actually isolates them like it's supposed to so I don't know if that's a kernel version or something. Okay thank you. Does anyone have any other question or anything you want to talk about about this topic? It would be great to to know. If not thank you very much. Thank you.