 Here we go. OK, ready, here you go. Hey, good afternoon. Welcome to our presentation on HPC use cases. I'm fully conscious that we're standing between you and the party. So we've got about 50 slides to get through, so I hope you're not planning on getting to the party anytime soon. Just kidding. So just to sort of introduce ourselves, my name is Glenn Bowden. I'm with Hewlett Packard. I'm the chief technologist for the Amir Professional Services for Cloud and OpenStack. I've been working with HPC and Cloud for around about 15 to 20 years now, so good experience there. And my colleague is Eric. Eric. Hey, guys. Let's turn that down a little bit. OK. My name's Eric LeJoy. Work with HP as well. Glenn here is a high performance computing guy. I'm an NFE guy, so kind of similar roles. But I'm more focused on NFE, in a sense. They're very similar, but there's some differences there. And between the two of us, we'll be covering the topics today and hopefully getting some good questions from you guys and covering some interesting topics. Thank you, Glenn. So we want to get a feel of who's in the room, essentially, and where your backgrounds are. So if you can see the screens, and I apologize, they're a little bit small. Who has an idea of what architecture this represents? This is a high-level architecture. Who has an idea of what software it represents? Anybody? I know it's late. Come on. Yes. OpenStack. Everyone agree with him. Hands up, who agrees? Anyone have any other suggestions? Who doesn't agree? Yeah. Who doesn't care and just wants to go to the party? OK. So yes, it does represent OpenStack. You can see all of the components that fit in there. It represents something else as well, though. Does anyone have any other ideas of what this might represent? The clues in the topic. So it's also the same architecture as the Moab HPC suite. And all the components are in as well. So by show of hands, who considers themselves a cloud person here to learn about HPC? One. One person. How about NFE? OK, whose HPC wants to learn about cloud and OpenStack? More. Good. Who's in the right room? Just checking. So when we talk to a lot of people around HPC and cloud, I get a lot of resistance from the HPC people because they say, no, no, we don't virtualize anything. We don't want to go on cloud. We have a perfectly good scheduling system. We have a cluster that runs perfectly fine. Why do we want to introduce cloud to this? And for them and for others, HPC and cloud are opposing forces. Cloud is you share everything. So it's about having the resources available to all of your tenants and all your customers. It's based around very generic workloads. There's no tuning involved. You want to get the best for probably 80% of your use cases, which means 20% are going to suffer. It's very loosely coupled in terms of the way the architecture fits together, which isn't necessarily true with HPC. And you're running usually lots of very small workloads. You'll have lots and lots of different applications there, each with their own unique requirements, each with their own demands on the environment, all running together on the same environment. When you look at HPC, the traditional HPC cluster will go on to define HPC in a minute because I know it means different things to different people. And we don't have time to have the debate, unfortunately. But generally, it's a share of nothing architecture. You're looking to have the whole cluster dedicated to a single task. You're generally dealing with specific niche workloads. So there's something you're trying to solve when you want all of the compute power to going to solve that one problem. They can be very tightly coupled. The process can depend on multiple nodes and be tightly coupled together for those nodes. So using something like RDMA to access memory on different nodes and have the process of be available for all of that workload. And there's very few workloads. It's usually one workload distributed across many nodes. So HPC and cloud, there's pretty much no alignment, right? But actually, there is. There's quite a lot of overlap between the two architectures. We saw from the diagrams at the beginning. Cloud is highly distributed with large storage pools. Resort management is the key for cloud. That's how you keep all the customers happy. And you want performance management. You want it to be performant. Exactly the same thing is true of HPC. So there's enough overlap there that things start to get interesting. So first of all, let's just define what we mean by HPC. And I know there should be people in the audience that disagree with me. If there aren't, I'll be disappointed. So there's generally two types of HPC that we talk about. One is analytics, which has been dominated by conversations around Hadoop predominantly for the last few years. But what we mean by analytics is we're looking at big data sets. So that's large pools of data that we want to process down, aggregate down, and get some results out of. It's usually the operations we perform against this are fairly simple operations, but they're performed many times over across the whole data set. And what we're looking for at the end is an aggregation of those results. So we're looking to get a set of results based that summarizes the large data set that we have. The other type of HPC that we look at is computationally intensive stuff. This tends to be things like smaller data sets. So you're looking at a much smaller data pool to start with. This gets pushed through very complicated algorithms, which tends to be in a pipeline. You're doing some processing. Then you take the results and you process again until you get to the end result at the end. It's sequential processing. It's often what we call embarrassingly parallel as well, where you can break it up and run it across many nodes. And for those grid computing-type operations, it's often licensee-sensitive. You want high performance networking. You want high performance compute, obviously. And you also want high performance storage at the back end of it. You're usually quite sensitive to any latency injected at any of those layers that can slow jobs down. When we look at computational HPC, which was the second one we talked about there, and predominantly what a lot of these use cases we're looking at focus on, we can break that down into two types as well. So one is batch processing, which is a bit like analytics. Analytics is very batch-orientated as well, but there's some little differences with computational. So we're looking at batch processing. We take a large job, we push it in, and then it executes at one point in time. Again, that's very loosely coupled stuff. So those processes that are embarrassingly parallel, we break them up, run them across all nodes. A failure on one of those nodes doesn't mean the end of the job, it just means you resubmit it to another node. And the schedulers we have today are very good at doing this. It's also limited with no, or has limited or no shared resources at all. So they're very much independent. A node is pretty much stateless in terms of the other jobs that are going on. The other type of that is the real-time or grid computing, which I touched on in a previous slide. This is the very tightly coupled stuff. This is the stuff where the processor, the jobs you're running on a single processor will reach out and communicate with other nodes and need access to those nodes to get scratch-based access to get access to results of some of those jobs and to pass information between the nodes in order to get successful at the end. That requires the high-performance networking for remote direct memory access, for example, or IDMA. It's usually a high-performance shared file system in the back end as well, because a lot of these nodes, because they need to communicate with their neighbors, they're writing things out to scratch bases that need to be accessible by other nodes within the cluster. And the CPU and memory architecture become much more critical than Batch. Batch tends to be able to run on generic cloud-like architectures, if you like. When you start looking at computationally intensive stuff in real time, then you're looking at the way you can architect the CPU and memory, and we'll touch on some of the options with that in a bit. So why are we looking at cloud-as-all? I've already said the schedulers do a good job. I've already said that there's a religious argument pretty much about not marrying the two together. So what are the drivers for doing this? And one of the primary drivers that I've seen when I've been talking to customers regarding this is the multi-tenancy challenge. And this originally came from universities, but it's coming more and more from other organizations that are beginning to push this out. And the original use case I had for this when I first started working with customers getting HPC onto cloud was around genomics, human genomics particularly. And the reason being is particularly in Europe where I'm based, but I think through most of the world over, there's protection that goes around human data. So if you're analyzing human DNA, that DNA is protected under most privacy laws because it gives you an identifiable means of getting to who that person was, which means if a project has some DNA in it, you can't share that data with any other projects. And there's lots of projects that go on in universities where they're trying different algorithms and different methods for analyzing DNA to get to different results for different challenges. So they wanted a way of sharing the cluster systems that they had and the HPC systems that they had without having the data bleed between projects. And most HPC clusters, if you're just using the standard scheduling engines and you're just approaching it with the standard HPC methodologies, security isn't at the foremost. If anyone's worked in university environments where they have a large HPC cluster, you'll probably know that you can fairly easily get access to whoever else is running projects and jobs on those systems. So how do you deal with multi-tenancy in that environment? And cloud and OpenStack particularly gives us a lot of options in that space. So what we're looking for in multi-tenancy as well is we can then have a set of shared resources that all projects have access to. So these can be things like the golden images of the operating system. So if you're using high-performance Linux, for example, you can have a single set of libraries for that and a single set of binaries for that. The actual binaries for doing the processing themselves, so not the stuff the scientists are compiling and running, but the actual binaries for the projects and the software that you're using to process can all be common. You don't have to share them and you can all make sure that the same versions are tested on the hardware and things that you have. The drivers are all correct for the virtual machines or the bare metal that you're spinning up and you share them between the projects. It makes supporting this thing a lot easier because otherwise it can become a bit of chaos. What tends to happen in these environments, I'll just go, it is the students who are using it or the scientists that are using it find out that they've got a home directory and they fill that up with their binaries and they never clear it out. So you end up with scratch-based and home directories that just grow and grow and grow and become bigger than some of the data pools you're trying to process. So having this shared mechanism is a way of protecting that and actually using that mechanism to reduce the storage of demands on it. From networking perspective as well, if you're using cloud, particularly if you're using OpenStack, you can use things like tenant-based storage VLANs. You can have a storage device that exports its storage over a particular VLAN, a particular network, which is only accessible to certain tenants. Then you can tie it down a bit more tightly so that not everyone has access to the same endpoints as NFS. You have better scheduling options when over that allows tenants not to compete with each other now. So you can use things like availability zones, we can use things like regions, we can use things like cells and we'll touch a bit about those in a minute. So make sure that resources for a particular customer don't bounce off of resources for another customer. Also, we're beginning to look at Federation. Now, there are a lot of projects around where a single place is gathering all the data so they own the sensors. I don't know if Tim is in the room, but Surin is a good example of this. Not everyone has a large Hadron Collider in their back garden that they can use to gather this stuff. So if you do, you generate huge amounts of data, usually too much to actually process yourself. So you have Federation with other universities, other scientific research centers, other organizations, and you need them to be able to process the data in a meaningful way that you understand what the architecture looks like of what they're using to process it. That way you can ensure that results they get will marry up with the results that you get if you do the same type of testing. And it's a bit of a challenge to make sure that rather than just dictate the hardware and software that people have, you can abstract that in some way and just dictate the logic and the APIs that they need to use to get access to it. If you're dealing with universities, for example, their budgets and funds tend to be limited. So it's difficult for them to go out and buy specific hardware for a specific task. You need to aggregate that away. In order to get the Federation working, then you need to have a shared authentication system or a trust system in place. And we've seen with Kilo, the launch of Federation within Keystone. So there's some options there of how we actually tie clouds together. So we wanna focus a bit on compute now. We're breaking this down to three sections, compute, storage, and networking. So we'll focus a bit on the compute side now. I talked about multi-tenancy, so we'll just drill into a couple of options you have in that space. So one of which is regions. And what regions are, for those that don't know, I expect most people here do actually, is a completely separate set of endpoints. So you have a whole new set of services for each of the regions in terms of open stack services. So you have your own set of Nova, you have your own set of Neutron, you have your own set of Swift, all of those things. What that does is it reduces the impact of one tenant's activities on another tenant. So if you have a noisy tenant in the terms of, they're standing up lots of VMs, for example, so they're dynamically creating maybe 5, 600 VMs in an hour, then that's a lot of metadata that's going across the backplane for open stack, and you can separate that out, you can dedicate control nodes to a particular tenant if they're that busy. This comes down to designer, you can also fold them together so that those APIs exist on the same servers with different endpoints. It means you can independently scale tenants. So if you have a particularly busy tenant, if you separate them into regions, you can make sure the control plane and the API load balancing for that tenant goes across more servers than the other tenants, for example. So it gives you a way of, if you're selling HPC as a service, it gives you a way of pricing a premium service in a way. You can have a more scalable version by having five nodes of your control plane instead of three perhaps. It does require more hardware or very careful management though, so you do need to buy those extra servers to have those extra API endpoints or manage them very well so they don't impact each other if they're on the same servers. The other option is cells. Now cells are single API endpoints in terms of Nova, so cells are very much Nova focused, but they have a single API endpoint, so there's a parent API at the top, so one Nova API process. Underneath that you have multiple Nova processes, so Nova compute processes living underneath, and then it's up to the API to distribute the request to the correct cell. It's basically a hierarchical tree of cells. You can have a cell that's a parent of other cells as well, but each cell runs its own set of Nova services, so you can still need to compute side of things, but the other services are more challenged. So we talked about the grid computing. One of the things we said about the requirements for that was low latency and consistently low latency. It needs to be trustworthy, it needs to be supportable, that you're dependable, that you'll have low latency. One of the reasons is because one of the processes I said is using RDMA, and you might use something like InfiniBand or High Bandwidth Ethernet in the back end to achieve that. Nodes will make remote procedure calls between each other, so they'll call processes from a different processor or from a different memory space, and you share the data set amongst many nodes as well, so you need low latency to be able to dip into that data set and not have that be the slow point of the bottleneck. We're gonna look a bit at what non-uniform memory access on Numa gives us in terms of speeding some of those things up and improving latency of jobs, in terms of cache efficiency and memory localization to CPU, as well as virtual CPU to physical CPU pinning, and Eric's gonna cover some of this. We'll also cover a bit about CPU affinity and how that actually can improve the virtual machines performance as well. I've said a lot about what virtual machines are doing in this, we do cover a little bit about the bare metal side as well. And then the DPDK framework, which is from the networking side, which Eric will talk about. Excellent. And just so you know, we're not reading mail or email or anything like that or maybe slashed out presentation tool here. So before I get into the sections on DPDK, is there anyone in here that's actually using DPDK? Okay, I got one. Any others? Any that don't know if they're using DPDK? Okay, two. How about Six Wind or Wind River or Nuage or NSX? Any kind of accelerated switching? Okay. All right, so this is gonna be fairly new, so we'll make sure we don't skip over anything. One of the issues we have in OpenStack is the amount of CPU you use when you're doing packet processing. So I'm gonna take this from an NFE angle or a telco angle, where we have like a virtual machine that uses a lot of packet processing, let's say resources from a compute node. Before we go into that though, how many people in here are in the telco industry or doing NFEs or, okay, so we get the right side of the room, left side, okay. All right, so let me carry on with this then. So the idea about DPDK is you have probably 50% of your CPU cycles being used up for packet processing on a VM. So how do you offload that and what's causing that? So you analyze that, you realize you're doing a lot of memory copies, so kind of think here. Let me pause there for a second. So I'm not getting much feedback from you guys, so ask any questions if I say something you don't understand. I want to make sure that I don't kind of go over, above, or beyond here. So if you look down here, we're saying DPDK is for low latency, right? And then if you look over here, we're saying CPU offload, we're using DPDK as well. So there's two things that are going on here and we'll get into them in the later slides. But I'll take a pause there on DPDK and we'll get into it in the later slide. Does anyone know what NVME is? Or AHCI? Okay, we got two guys in front. Or, okay. So one of the things we're starting to see is that NVME is, let's say an instruction set for doing, let's say, a really high level using your hard disk or your storage in the system. What we're starting to see is that by using NVME, you're actually reducing the CPU load because of the simplicity of the instruction set. So that's one thing we're starting to look at. And even this month alone, Samsung's got this 950 Pro device that's coming out, which you can insert into an M2 module in a server or even in your home BC. And you can get 400% faster throughput in reads and writes and even your IOPS. So we're looking at that as a potential. So you're lowering your CPU utilization by changing your storage technology. Glenn's gonna talk about later the CPU offloading by using PCI pass through and using GPUs in the VMs for the high performance computing stuff. And then SROV at the end. So we're also gonna talk about how you can use SROV and also some of the impact. So we're seeing a lot of customers in our space in the OpenStack that say they want SROV. And then they're saying, you know what? I wanna put a transparent firewall in my cloud. Or I wanna run MPLS out of a virtual router. And then they say, okay, I can't do that over SROV because of the way SROV works as a technology. So there's a lot of stumbling blocks and hopefully we'll share that with you guys as we go through the slides here. Before we can get into DB2K though, you have to have a basic understanding. So I'm gonna go through this at a high level. And like I said, this is interactive. So ask any questions if something doesn't make sense. So we said NUMA, I think we covered huge pages or we will and some of the other pieces that we have in DB2K. What we're showing here is if you look and make sure I can see. You guys seeing this as well? A little dot on the screen. Okay. All right, so we have two pictures here. You have a picture on the left and a picture on the right. The picture on the left we could call kind of not using the resources in your server well. So how many people in this room know what NUMA is? Okay, good. How many know the efficient way or can tell me what is going on in these two pictures here? Okay, so this is gonna be very useful I hope for you guys. So what you're seeing here is in a NUMA node topology, let's say you have a dual socketed server. You guys are aware that each socket would be a NUMA node. That each socket's gonna have its own, in an Intel architecture, each socket's gonna have its own memory dims with the memory it uses. And then it's also gonna have its PCI Express lanes or it's PCI Express bus. Ideally what you should be doing is you should have your network cards that are doing all of your high throughput, going to the NUMA node where you're gonna have VMs running that are using that high throughput. And let's say when you turn on something like NUMA, you can, if you look here you have four cores and then on the other example you have four cores over here. This is the same server but we're just showing in two pictures. So ideally on the left side, you're seeing a VM on two cores but it's network traffic's coming in on the other NUMA node. It has to cross the QPI bus. And when that happens, you're having a very big performance degradation from what you would have had if you kept them in the same. So the picture on the right, what it's showing here is that you have the VM on two cores, let's say NUMA node zero. Your network cards are also plugged into PCI Express buses on NUMA node zero. And in this case, you have the most efficient setup with your hardware. Now, a lot of people think OpenStack, okay, it's abstraction layer above the hardware. But when you're doing high performance, you really need to be aware of where your hardware is placed and what kind of server architecture you're sitting in with the hypervisor. And what we're showing here is that with understanding your NUMA node placement of your hardware and then using the NUMA node feature or NUMA in OpenStack, you can actually pin VMs to the right cores and then also get your OVS bridge which we'll talk about a little bit working on the right cores as well. One good data point to know about, how many people here use OVS as their default OpenStack bridge? Okay, so with OVS 2.4, it starts actually using cores from each NUMA node. And depending on where the thread's actually running, so where your VM is running, it'll actually use the thread over in that NUMA node. So we're getting more intelligence to OVS so that kind of distracts us away from you, but you still have to know where your NIT card is and which NUMA node it's attached to or you're gonna have to cross the QPI bus. So the key takeaway from this slide, avoid the QPI bus as much as you can for the way your data flows through your server. Any questions so far? Okay, Glenn here is gonna carry on on the CPU affinity and then we'll hop over on the next slide. Cool. So when you start up a VM, you create virtual CPUs for that virtual machine. What you don't wanna do is have contention between those virtual CPUs within that virtual machine. If you're standing at one VM up per server, which some compute classes do, then this isn't so troubling, but if you're sharing resources as we talked about earlier, then you want those two vCPUs to be as performant as possible. If you look at this diagram, you can see, assuming my red dot works, there we go. Up in the top here, we have these two vCPUs and they're using the same core. So vCPU one is CPU zero and vCPU two is CPU one, but they're on the same core, which means they're competing. So although there's two domains here, they're competing against those two domains. This one is a little better, but it depends on the cache footprint. It depends on the footprint of the instruction set that you're using, because you're having to move between CPUs at this point. So you're going across that bus that Eric has talked about. And down here, this is probably the best way of doing it. You're staying on the same die, but you're using different cores within that die. So you have a CPU, one vCPU on one core and another CPU on a different core. This is where you ideally wanna go. Now, we can do this now. There's a, I forget the exact number of extension, but there's a filter now, which is a CPU affinity filter, essentially, ANOVA, which allows the scheduler to move processes or assign processes to CPUs in this manner. There's also ways of forcing a vCPU to bind to a particular CPU. I wouldn't recommend it unless you absolutely need to micromanage your services. If you're in a shared environment, this can be a bad thing, particularly if you're doing something like an evacuation of a node because of an issue or a migration of some sort, a live migration, because it limits the amount of availability of other servers that can actually cope with this process. So let the scheduler decide which vCPUs to bind it to, try not to specify them yourself, but use the scheduler to specify properly. So back to Eric for the DPDK. We can keep bouncing around, so. Take a look at this and then absorb what's here and see if you guys have any questions before I jump through this. I hope that I can get my audio to be consistent here. Does anyone see something familiar that they've run into before? Okay. I have some of them ready. Everything doesn't care. I'll give you guys some real-life cases of what we're running into last week that we finally solved this week in one of our customers as well. Maybe as a kind of a data point that'll save you guys time. And just to be clear, the stuff that we've done here were HP, so we did this on Healing OpenStack, which is the not canceled product. We're not public cloud, where this is a private cloud like Red Hat or Morantis or whatever. You can do this on any of your distros. All the issues we're talking about today will be on every distro, unless they're doing something customer proprietary. So I saw three hands that said they knew DPDK. Out of those three hands, or maybe there's some more that didn't show, how many of you have done DPDK 2.0 or higher, or 2.1? Okay, perfect, one person. So this should be very valuable to all of you if you're planning on getting high performance data processing out of your OpenStack cloud. 2.0 when you install it, you have to have these things. You have to have OVS 2.4, especially if you're gonna use a user space OVS. You can use kernel space or user space, but if you wanna get performance, you're gonna go to user space. You have to have huge page support and you have to have two different sizes. So for us, we made a default of two meg, and then we had some one gig huge pages later on for OVS. Real-time kernel's not required, but if you can get the pre-emptive extensions, it's perfect. One thing we ran into, we were running kernel 4.2 and DPDK doesn't support it in 2.0. We had to roll back all the way to 3.19 or older for 2.0. So if your OpenStack is running a kernel of 4.0, you're gonna wanna look at the requirements and say, okay, I'm gonna have to roll back my kernel on these high-performance compute nodes. The other thing that's gonna be in there is you have to have NUMA, right? So if you don't have NUMA in your system, you're gonna have to recompile QEMU to support NUMA. You're gonna have to get libNUMA installed and you're also gonna have to do some libvert recompiles to support it as well. I'm gonna quickly go through the rest just so we stay on track, but if any of these things are important to you and you wanna ask questions, yeah, sorry. That's a good question, so we're making the, so the question is, are we talking about Kilo or Liberty or some, which release of OpenStack? We are talking about Kilo or later just because that's where NUMA came into the picture with the right level of support. Great question. Okay, so you've got NUMA, you've got QEMU with NUMA with libNUMA, I mentioned libvert. How many people have turned on IOMMU in your BIOS before on your PC or your servers? Okay, how many of you have turned on VT-D in your servers or your computers? Okay, if you're running Fusion on your Mac or Workstation on your PC, you've most likely turned this feature on. This is an Intel feature, so it's called different under AMD. I don't actually remember what it's called right now, but that last checkbox, VFIO, if you wanna run that, you have to have IOMMU turned on. The problem is some of the servers break because of the NICs in them or the PCI Express devices because IOMMU is turned on and there's not a good handoff between BIOS and the operating systems. So you have to be very careful and make sure that that's supported on your server hardware. The other issue is VFIO. So we have, sorry, this audio is getting a little different. We have an example of a VM that's running a kernel in the VM of 2.6.26. This is an old kernel and it had its own customized version of VertIO because it was right around the time that the VertIO was getting developed. I think it was a pre-1.0 release of VertIO. And I guess we'll go into this detail. Let me know if we need to speed up. So how many people know what MSI or MSI-X is? Okay, we got lots of smiles in the front. Yep, okay, good. And we got some in the back. So we had this VM running. It's a way of interrupting. Let's say you're using DPDK, which we haven't gone into description of, but it's a way of, it's a PCI or PCI Express interrupt. So if your system is running, let's say VertIO is the driver and you're gonna have your VM sitting there. And with DPDK, we're extending the Rx and transmit buffers up to the VM. I know I'm trying to be abstract here. I don't have this drawn, but let me describe the problem statement that happens. You're using VFIO, DPDK, you get a packet coming into a physical NIC. And with DPDK, what you're really doing is you're trying to copy, you have memory inside the NIC, which is for the ring buffer for the packets. The packet comes in and gets stored in memory. The main purpose of DPDK is to not have to copy the packet to kernel memory and then copy it up to user space memory into the VM. So you're trying to reduce the CPU cycles that are going on there. Now that being said, let's say a packet comes in, comes into your ring buffer on the hardware, gets copied up to the user space ring buffer for that VM. And at that point, the VM should get a interrupt saying, hey, your ring buffer's full or you have a packet in the ring buffer. And if your libvert version has some bugs in it and there's lots of different bugs that are in VertIO and the MSI X interrupt, let's say protocol, then you're gonna run into issues where your VM just sits there and doesn't process packets in the receipt buffer. And then you get lots of packets thrown on the floor and you see error rates on your VNIC interface. So we actually hit that. We went into the VM and we turned off this MSI X interrupt which then it falls back to another method and we actually started getting really high packet throughput and a lot of other things were improving from that issue. The key takeaway from here is don't assume all your VMs are gonna work when you're using DPDK. Put them in the environment, get Spirant or whatever your packet generator is gonna be, test it out, see what you get for packet drops, you get throughput and all of those kind of metrics that are your key KPIs. I'm gonna stop there. There's a lot more detail we can go into. You can grab Glen and I after the session to go through it. But we'll also get into SROV as well in the later slides. So the other option for processing, we've talked about CPUs and we talked about NUMA but the other option for processing large workloads is using GPUs or graphics processing units. Now this was something that was pioneered by NVIDIA back in 2007. The race at that point was more about gaming and graphics cards and it was about how many polygons they could render and how much physics processing could you do within that graphics card in order to get the game to flow smoothly to have the realistic effects to simulate gravity and things like that. So that's what kind of drove it originally. It was a consumer gaming market. But the architecture led to increasing the cores with dedicated GPU memory. So you end up with a graphics card in your PC that had dedicated memory allocated to it which would hold only the visual buffers and perform the calculations against those buffers. The idea was then that we could use this same technology and this same architecture in high performance computing. Now GPU and CPU are entirely different architectures in the way you look at them. You can see by the pictures that we have here. GPU has thousands of cores on it but they all do very simplistic instruction sets. So it's very simple what you can do on each of those cores but you can do it in very, very parallel ways. Distribute those jobs across all of those cores. Whereas the CPU, you have multiple cores but those cores are more complicated instruction sets which add more latency to what they're doing. So what you can do with GPUs, we talked about the embarrassingly parallel jobs. Well these GPUs are perfect for that sort of thing. We see them creeping up in things like physics simulations and calculations of where you can split the job into thousands of pieces. NVIDIA created an architecture called the Compute Unit Device Architecture and what that was was an interface to program to essentially so it was an API that allows you to take advantage of those GPU cores and take standard compute jobs and distribute them to the GPU instead of the CPU. And what this means is that you can then share the workloads between both of those processors. You can have the GPU doing the actual calculations and the physics processing when you can use the CPU to do all the other bits and pieces like the IO shifting and the memory management and the other parent processes that go around the actual job itself so that they're scheduling in a metadata. Well the GPU is really the workhorse here. How many people are running a GPU compute cluster? One, two, cool, so there we go. We've talked about virtual machines quite a lot but there are other options too and the other option is running it on bare metal. And now being from HP we've had a lot of experience with Ironic because our deployer for our Helian OpenStack used to use Ironic, will leverage Ironic to actually deploy the OpenStack images. But its primary use case is to manage bare metal servers. So you can have bare metal hardware and you use Ironic to stand up images onto that bare metal hardware. This gives you dedicated resources which is what a lot of HPC units are actually looking for. It doesn't become a shared resource because that bare metal server then becomes dedicated to the tenant that stood it up in the first place. It was integrated as a project as part of Kilo and you can see I've used the new fancy dashboard we have on the OpenStack project page now to see how mature it is. Adoption is only 9%. It was introduced and integrated in Kilo. It's been around for a little bit longer than that but it really became usable in Kilo. It's had some new features or more stability I guess added in Liberty as well and there's work going on in Imitak as well which I think the design sessions for it were today and possibly Friday. We've already said it allows the bare metal servers to be provisioned by Nova so Nova is in control of what happens. What you're essentially doing is you're taking a glance image you're DHCP booting your node and then you're applying that glance image to that node. So it's not entirely that you're putting an operating system directly onto the host. There's still some abstraction in there because you need to have that image virtualized in some way. The image doesn't contain all of the direct drivers for that hardware necessarily. So you need some abstraction and then there's an overlay that does that abstraction. It means it's image boot as well so if you need to update that server at any point in time using Ironic you tend to have to rebuild the image so you redeploy the image to the server. You make the modifications in the image and do a redeploy. It's not the only way but it's usually the best way because otherwise when you redeploy that image again if you have a power down or something then the server will look different. So patching can be a bit of an issue for this one as well and there's some network considerations as well around DHCP and lease times and things like that that are confusing when you're using Ironic. So it's not the default go to yet but it is in there as an option. It does work. I've run Sahara classes on this before so we've had Hadoop as a service essentially based on using compute on bare metal with Ironic. So what have we got in Liberty that advances for HPC compute? Well the first one is Red IO Q scaling and there are a number more than these but these are just the four I wanted to pick out for the time we had. So what Red IO Q scaling does is it provides enhanced network performance for guests that have greater than one virtual CPU which means it allows multiple CPUs now to actually push through the network traffic. There's a flag you need to set which is hardware VIF multi-Q enabled. I'm reading that off the slide, I don't have that good a memory. When you enable this in over then that allows the Red IO Q scaling on your virtual NIC. We also have support for InfiniBand SRIOV for lib virtualization now and this is new in Liberty. What that does is gives you the same hardware to virtual NIC translation that SRIOV did for our network interfaces but on InfiniBand interfaces which means we can now get InfiniBand directly into the virtual machine. So if you're using InfiniBand if you're interconnects for your storage for any of those reasons your VM now has direct access to that which removes the latency of having to go through that virtual abstraction there. The different cell scheduler was inserted into Nova so this is an anti-affinity scheduler which means you want your processes that you're spinning up to run on a different cell than the process that it's compared with. So we've had affinity before now we have anti-affinity which is different cells. You need to enable this in Nova as well. And there's now a Neutron quality of service API as well. So you can define quality of service on networking using the Neutron API now. This is also new in Liberty. You can define it on a per-port now as well in Liberty. Any questions on the compute side of things? I think we're about a step into storage. Yes? We do, we can share it separately with the white paper it's not in the slides but if we have a chat afterwards we can make sure we share the numbers. We have some numbers but they don't compare Medell and Numa, so things like that. So we have some numbers to share but there's a white paper that we can get for you, okay? Okay. So I'm gonna cover storage a bit now and go through the different storage options that we have in OpenStack and how they map to use cases within an HPC environment. So we'll start off talking about ephemeral storage. Now everyone knows what ephemeral storage is I hope it's the bit that gets created when you create your VM and it persists only as long as a VM. Don't store your data in ephemeral storage, please. I've had customers that do that and then thought terminate just stop their VM and not destroy it completely and lost all the data. So don't store data in ephemeral storage it's usually used for scratch bases that you don't care about non-persistent data and operating systems that are gonna get rebuilt when you re-spin up the node anyway. It's usually located on the compute server itself so it's usually local to the VM so it's quite fast in terms of access speed. So if you're using it for a scratch base then it's a good one to use but don't care about the data that's in it you need to be able to get it back. User scratch base is a good candidate for this. You can have a second disk of ephemeral storage attached to your VM put your user scratch base and your work scratch base in that also it's used for the operating environment. Block storage. So this is the bit that's delivered by Cinder currently. It's persistent and it's non-shared currently. Liberty has some options now that we're able to share some of that block storage with other nodes but at the moment it's persistent and non-shared. It's also the size of the block storage is independent of the size of the VM so with ephemeral storage you've got what the flavor dictates so when you standard per flavor you get the number of CPUs, the number of memory and the count of ephemeral storage. With the block storage you can have independent sizing so you can create whatever you need. It also comes from many different sources so SAN arrays, local disk, whatever you need. In terms of use cases this is normally where you would see the database. So if you're using a database like Oracle or MySQL or SQL Server or anything like that this is typically where you would put the data for the tables and the indexes. It's usually high performance because its SAN base is usually a big powerful array set behind it if you're externalizing it and at that point you then have the option to fail that line around and move it around to different clusters as well. It can be used for high performance scratch base. It depends on how much scratch base you need. I've seen external flash arrays for example use a scratch base attached in this manner. As I said the Project Ascender usually have a large disk array attached by Fiber Channel SAN or iSCSI to all of the compute nodes. The benefit of that as well is if you are moving your VM between nodes either in an evacuation or in a migration because the data is external to that node it's much faster for that migration to take place because it can access the data that's on the external array from either of the places so there's no need to do any data moving. The guest is responsible for creating a file system on a LUN normally. So if you have a LUN it'll come in as a raw block device and then you need to put some kind of file system on it that any Linux file system will do. It's also at the moment not shared with other guests. As I said there's advances in Liberty that make some sharing possible. Metac is gonna take it a little bit further but currently you need to consider that it's not shared. It's dedicated to one guest only. Object storage. So object storage is kind of the darling of cloud in terms of storage I guess. It's persistent and it's scalable so a bit like Cinder in that sense. But you access it using a REST API. You don't access it using like a POSIX file system or anything like that. So you have a REST endpoint that you're firing basic HTTP verbs at. So get, put, delete. This makes it very developer friendly but it doesn't make it very user friendly necessarily when you're looking at sort of standard file systems. I think we're also running out of time so I'll speed up a lot. It's not bound to an individual guest. This is something that exists independently of everything else. In terms of HPC use cases these are great candidates for centralized data lakes. There's a lot of abstraction you can put between that object pool and the compute processes that are going on. One of the use cases I've used this for is we've had a huge data lake that needs to go through a Hadoop process first and then the results of that, the same data set also can be processed by a different processing engine and in good compute. And rather than having to pick up this data lake and move it around, which is petabytes in size or can be, we actually use it to sort of have the same data accessible by both of the tools. The project is Swift. There's also stuff available for both block and object if you're looking for it. It uses metadata. There's a whole other topic around the metadata which I can go into but we don't have time for. But come and talk to me afterwards if you're interested in object storage and metadata, particularly using open standards such as CDMI. File storage, the project is Manila which we'll come onto in a minute. It's shared persistence storage with a POSIX file system usually. So something like SIFs or NFS or SAMBA. HPC use cases normally use a home directory but more and more frequently now shared project data and we'll whiz quickly through one of the examples of what that shared project data looks like. But it's a scale out file system. So as I said, it's Manila. Manila manages the creation of pools on the provider services. It creates the shares, applies the permissions, maps it up to the VM, gives the VM access to the share. It also manages which guests have access to it. It's plugin driven. Each of the storage vendors will have their own plugin to create the shares that they have. So Luster is a high performance file system that we see in a lot of high performance classes. Anyone here running Luster? Yes. Most of the people with gray hair I'm guessing. I've run Luster, my hair's going gray. So one of the use cases, and this was from a human genomics processing pattern. We were standing up a shared KVM compute environment but they needed access to shared storage and shared data. So we had a shared pool that was coming from an AS device initially. We're using Manila to create file services on that shared device. And it was being delivered to the guests over their own dedicated VLANs. So when I talked earlier about the tenant-based VLAN, this is what we were using so that each guest couldn't access other storage over the same network. It was segregated out. So this was the kind of high level architecture, if you like. Let's get into quickly what Luster is before I tell you how we solved some of the issues that we were seeing with this. It's a massively parallel file system made up of several components. There's a management server and a target. In Luster, server is the bit that runs the processes. Target is the actual pool of storage that's being used. So the end disk, if you like. So there's a management, there's a metadata server, and there's an object storage server. One file can be broken up and spread over 2,000 different objects. With LDS-SKFS, which is a default file system from Luster up until the release ago, each of these objects can be up to 16 terabytes in size. So that gives you a file system of 31.25 petabytes for a single file. If you break it up to 2,000 times, each of those things are 16 terabytes, 31 petabytes per file, which is pretty big. And you can have four billion of those per metadata target. So that's fairly big. And you can also have up to 4,096 metadata targets. I haven't seen one this big, but it's got the capability. So this is the layering of services we have for the Genome Sequencing. There's a surprise line in here. I hope people are looking at it and going, what's that doing in there? I did. So we start off with the object servers at the top. We then have CRAM. So CRAM is a compression algorithm that we see a lot in human genomics, which gives us some very good compression. We have Cinder at the bottom of managing the storage. And we have ZFS, or ZFS, if you're American. So what's ZFS doing in there? How did that get into the high-performance compute bit? Well, Luster has very limited data protection. Anyone that's used the Luster cluster is running one, knows that you have to do the data protection yourself, because it does basically raise zero strike across all the object servers. So if you lose one, there's no protection at all. You rely on the physical infrastructure to protect your data. It's scale out. It's really easy to scale out Luster. But it's not so difficult. It's not so easy to scale it up, adding disks to a Luster thing. You go through all sorts of metadata and layout issues. It's difficult to scale things up. ZFS has healing. It also has snapshots. I wouldn't necessarily recommend turning snapshots on for what we're talking about here, but it has them. And it is really, really good at scaling up. So if you have a ZFS file system that you're offering up to your Luster file system, it gives you the capability to go up and out. ZFS also has cache pooling. So you can dedicate SSD drives to do caching, read and write for the ZFS file system, which means you can use that for the metadata parts, which really accelerates any queries into that file system. It gives it massive potential and scale. Now we already talked about scale. These are the numbers we mentioned, 16 terabytes, 21 petabytes per file, four billion files, wouldn't need any more, right? Couldn't get any bigger than that. Add ZFS into the picture. You end up with an object size of 256 petabytes per object. You can now have eight exabytes in a single file. And believe it, there are applications that need large files such as this. So if you look at oil and gas, for example, when you do a field survey, at the moment we have to break the results up into multiple files. With ZFS and Luster, we haven't had to. Still four billion files for the MDT and still 4,096 MDTs. So this is huge stuff. This is a file system that you're never gonna run out of space on. I'm sure you've got the disk at the back end, obviously. So there's some work in progress that we're working on with some of the customers. I'm conscious of time, or with. Luster is for high bandwidth and low latency. Low latency is challenging in a virtual environment. We do have some advances such as the SRI-OV from FinnyBand I talked about. High bandwidth is easier, but still not simple. We can use OpenStack to provision those Luster components, particularly if we're using Ironic. We can have bare metal components for the metadata servers and the object servers. You can also build small-scale segregated clusters just for testing and for multi-tenancy. So you can use NoVern Cinder to actually stand up that architecture. You can export via NFS with Manila. We're in the middle of writing a Manila driver for Luster. 90% of the way there. And you can include ZFS and compression on the actual OSS host as part of the operation. We also use Cinderblock for the shared storage element. Networking is Eric, and we'll fly through this one really quick as well. Yep, so we've got 10 minutes before the end, so we'll just wrap this up quickly. So there's gonna be three networking slides here. The first one's called Fat Tree. This is one of the ones dedicated to HPC, where you're basically keeping your link density linear on the way up so you're never doing any oversubscription. So you get line rate all the way up to your core. Sorry, go ahead, take your photos. There's gonna be, if anyone wants these slides after we're done, just come up here with a USB stick as well. Okay, now we get the three-dimensional wrap-around for the Taurus. So this is really where you have basically six links coming out of every VM, and you basically get this kind of full, I call it a full mesh glen here in the HPC world. We'll call it a three-dimensional or a Taurus. So this gives you your complete connectivity for your neighbors for the, what would the type of workloads would do? So this is generally for grid computing. If you're having nearest neighbors, so if you're doing RDMA access to the nodes next to it, then this is the type of topology you would generally have to reduce the amount of links you'd have between yourself and the neighbors that you're accessing the RPC from or RDMA from. Couldn't set it better. Thank you, Glenn. Okay, back to my world. NFE type workloads will have typically provider VLAN. So this one I'll just take one minute on. Typically you're going to run an 802.1Q trunk up to your host. So you're providing all the VLANs into the host and the host is going to segregate it up into your VMs. We don't have trunking into the VMs yet, but it's being worked on. There are some flags in Neutron and there's just no ML2 or ML3 plugins that actually support it. And I think they're reworking it somehow. OVS 2.4 we talked about earlier. Then we have the issues that we already discussed with MSI-X and then, yeah, there's one more issue here. So SROV, if you're setting traffic east to west, this is within the same compute node between two VMs that are on the same network and they're using SROV from the same NIC, then you have a nice scenario because there's actually a little E-switch on those NIC cards. So you can stay within the compute host when you go east to west. If these VMs are using separate NICs, physical actual NICs, not on the same two port NIC or four port NIC, they've actually got to go out of the chat, out of the host, top of rack and then come back in. So just keep that in mind if you're gonna use it and do designs. Then north and south is always the same. So you're gonna go to another VM on a different network. Even if it's on the same NIC, on the same E-switch, they have to go out to top rack and come back down. So the good thing to keep in mind when you're doing designs with SROV. This is gonna be a really interesting slide. So this is comparing what we had with DPDK enabled and DPK not enabled. So we're talking about an open stack with pure OVS 2.4 and open stack with OVS 2.4 plus DPDK with pulled mode driver because we're using Intel NICs with their pulled mode driver and the interface type where driver we're using was vhost-user. So this is really the bleeding edge way with DPDK to get high performance. You can see here we had one compute node and we were seeing close to 25 VMs on a single compute node where we weren't getting, let's say, our CPU saturated with utilization. Without DPDK, we could get a maximum of 10 VMs on the compute OS before the whole system started basically failing. The next thing you're seeing here is packets per second. Without DPDK, this is just to give you a good scenario here, this is a single compute node at this point. With one VM, which is acting like a router, it has one CPU core and one gig of memory and three network interfaces. We were seeing that we were getting about a million packets per second without DPDK and we got up to 1.8 million packets per second with DPDK for the single VM. Now it could probably go up a lot higher and system-wide it would be a lot higher but we only did one VM. We're still testing actually this week to see what we get on with multiple VMs. This is also impressive. So if you're running OBS, you'll see that sometimes there's about three gigabits per second if you're talking about a single 10 gig interface through OBS without DPDK. With DPDK we're getting actually a closer to eight today. We're seeing eight gigabits per second and this is an OBS bridge that's using one core. I mean it's using two, we talked about the affinity or the using the cores from both numitodes but in this case it's using one core because the VM is only running on one numitode and we're getting close to eight gigabits per second so you add another core to that and you can go even higher. The important thing to take away here though is that if you look inside the VM that's using DPDK, when we're seeing three gigabits per second on a VM without DPDK, the CPU of the VM is at 100%. It's completely spiked. When we're seeing seven to eight gigabits per second with DPDK, the CPU's running at 50%. So that gives the VM the freedom to do all these other add-on services, routing, IPsec, security, firewall type of features inside the VM. Okay, I think we're almost out of time here so I'm gonna quickly go through these. Zero frame drops of DPDK, we had frame drops without DPDK. Latency, you can see the green being the DPDK. We had consistent, very low latency around four to six microseconds versus quite high without DPDK. And this is all about those memory copies which we can get into. If anyone here is doing DPDK or plans to do it, we're looking or specifically me, I'm looking for people to work with in the industry in a neutral fashion to kind of troubleshoot these things together and look at other use cases that we can solve together. Okay, so, summary. So we'll whiz through the summary, we are aware the party's starting. Thank you for staying by the way. So the initial interest of HPC on OpenStack was driven by tenancy requirements but it's also about resource management and sharing those resources. It's about managing flexible HPC resources which has been tricky up to now, particularly with wrapping security around it. But OpenStack is making it much easier to produce high-performance computers as a service. There are many areas that need to work well together for success and OpenStack community is addressing all of those areas and communicating well together in order to get that momentum going which I've not seen in another project that's addressing this up until this point. So HPC on OpenStack is a reality, there's lots of projects that are doing it, I'm involved in a number of those, but they are pushing the boundary and what we're trying to do with all of these projects that we hit, so if you are running HPC on OpenStack then please do sort of communicate with us. There is a, if you use the OpenStack operators mailing list as an HPC tag you can use to subscribe to what's going on in the HPC world. We're trying to commit everything we do back to the community, we wanna share all this with the community. If you have questions, if you have comments, if you are doing this sort of work and think you can do things better than some of the stuff that we're doing, please do share it with us, we wanna sort of collaborate on all of these things. Thanks for staying with us. If any of you do have any questions then either shout them out now so we can get them on camera or come and see us at the end or we'll probably be at the party later. Email addresses. Yep, that's our email addresses and our contact details so you can get hold of us directly. We think that says thank you very much in Japanese, we hope it does. Google was our friend. Does anybody have any questions? Sure. Yeah, and if you guys wanna copy the slides just come up with the USB stick. Yep, we're happy to share the slides. Yes. Yes, so RDMA is easier primarily, sorry, so the question was how easy, or is it easier to support RDMA on Liberty with OpenStack? So yes it is, and one of the reasons is the SRIOV from Finneban particularly so you get the low latency networking and you get the talk between the two. Everything else about RDMA is network based, it is guest based, so it's a configuration you have within the guest. Okay. Yes. So the question is how do we compare the HPC schedulers with NOVA schedulers, is that? Yes. How about you? Yes. So the question is, can we integrate the scheduler of an HPC environment with the orchestration in OpenStack? And yes, I'm working with a lot of genomics projects, for example, that are doing exactly that. So you use the scheduler and you can use the scheduler's knowledge of the queue depth, so the depth of jobs that are coming into the queue, and using some QOS algorithms as well, so this is largely custom scripted, you can have the scheduler responsible for standing up NOVA compute nodes. So you use the queue depth to decide how many nodes you'll have to run the job in a particular time. You can also use that same information to prioritize other people's jobs. So if you have a priority queue, then you can use that information to see who gets which nodes in NOVA. The genomics team that I'm working with, the particular using issues case, are using those priorities to decide who gets the bare metal nodes and who gets some virtual nodes. So yeah, there's a ton of integration work that can be done there. At the moment, a lot of it is hand crafting, but there's a lot of work going on in that space and we're looking for places to share all of that. Anybody else? Okay, so thank you very much for staying to the end. If you do need us, then come and talk to us or see us later at the party. Thank you. Yeah, thank you for your time.