 I work at the Cambridge University HPC Systems Group. I'm talking on behalf of that group today. Our group is really tasked with delivering HPC resources for faculties and research groups from across Cambridge University. Really that entails quite diverse use cases and very different requirements. Today I'd like to cover a little bit more about what we have been achieving in scientific compute and with the help of OpenStack and our partners, not least Canonical. In particular I'd like to discuss our view of future areas for development and the forward direction that we would like to go with Cambridge University and OpenStack. So what exactly are we doing with OpenStack and HPC at Cambridge? There are many diverse use cases. This photo here is the Wilkes GPU cluster at Cambridge. It's designed to optimise the power, compute power per joule of energy. When it first came out it was number two in the green 500 list of the most energy efficient supercomputers in the world. It was second only to a system here at Japan in Abriken. This system, the Wilkes system, is now central to our OpenStack development plans. As part of those plans I can tell you that for the first time ever parts of the system are going to be running Ubuntu. But there are many many other use cases that we must provide for as well. So let's just take a quick look at a couple of those. First and foremost the square kilometre array is a radio telescope. But more than that it's also a vast IT infrastructure project that becomes increasingly bespoke and specialised demanding the closer you get to the radio dishes themselves. Further downstream the use case becomes more generic signal processing data analytics on the unprecedented amounts of data that would be generated by the telescope each year. Groups within Cambridge are playing a prominent role in the global consortium that is defining the architecture for sections of this hugely ambitious flagship project. My colleague Peter Bram will be up next and he's going to cover the SK telescope in a little bit more detail so I will move on now. Our group, the HPC Systems Group is also closely involved in some biomedical informatics projects with a consortium of Cambridge University, local hospitals and other bioinformatic research groups around the city. This is a project of three phases. So phase one was a pilot scheme to put a toe in the water with OpenStack and the water was found to be warm. Phase one has already been used very successfully by early adopters for research into the genetic causes of type one diabetes among other applications. Phase two is where we are now and that really concentrates on adding HPC technologies into the mix. So for this project we have selected a Melanox network fabric and they're exploiting the hardware offload, SRIOV and RDMA capabilities of the Melanox NIC. We're also working on HPC file systems and data access methods. This system, the biomedical informatics system, is a canonical OpenStack system as a management plane of which has been deployed using Mars and JuJu. This is where we are today and we've been working with our partners at canonical on phases one and two. Phase three will be a production deployment of all the technologies that we've evaluated successfully in the earlier phases. But what potential is there for doing scientific compute workloads in an OpenStack environment? Well, it turns out there's quite a bit of interest for OpenStack in science. OpenStack really enables for us the phenomenon of private cloud compute. So in scientific research private cloud compute really unlocks the possibility of pooling resources between projects. It gives access to a greater shared resource on elastic demand for the scientists. One of the really telling statistics that I see here, I think this is just staggering, is the survey from Vancouver of the OpenStack users who were attending. They said that 41% of them were planning on deploying OpenStack for science and engineering workloads, which is just incredible. I mean, they were doing other things too, but that was a real eye-opener for me. So it really says that private cloud is a very successful model for science. For our group in Cambridge, private cloud washes away 100 small clusters, each born of an independent research project, probably not implemented perfectly. And these get replaced with a consistent and available managed resource, general purpose, but tailored to the requirements specific science compute workloads. More capable and more efficient due to the economies of scale and the expertise of the HPC cluster management group specialists in my group. When scientists evaluate the cloud model, why stop with private cloud? And indeed, there are many science projects that are very successfully using public cloud services like AWS. You can also get specialist players in the public cloud space who are niche HPC operators. However, there are particular needs and requirements that the private cloud use case makes compelling. And partly that relates to the demands of high performance computing and the flexibility and OpenStack in being able to deliver on those demands. So let's take a look at some of those requirements now. If we turn the comparison around and look at it from the HPC perspective, you can see that if you go to the supercomputing conference or something, 26% of the respondents surveyed there say that they are looking at cloud compute for part of their high intensive analytics workloads. So it's a intention that's getting reciprocated. It's mutual really between the two. HPC itself is undeniably a big tent. There are various common themes, but there's no single unifying traits. But if you were to pick one, you would say it is about whether HPC is a tightly coupled compute problem. CERN, for example, is an exception to that. Well, you could without doubt say it's scientific compute, but their use case is strongly throughput dominated, a bit like cloud compute. In other areas of HPC, the theme is of tight coupling of instances. So sometimes we're talking about tight coupling of data with instances, and sometimes we're talking about tightly coupling of instances that are working together on a parallel workload. This is typically what we call a bulk synchronous parallel model. Standard practice in cloud compute simply does not meet the requirements of this programming model today. Through our work with Canonical, Melanox and other partners, we've been working to address some of the requirements for tightly coupled compute on private cloud. Canonical have been working to enhance their support for Melanox Infiniband NICs, and in particular the SR-IOV and RDMA capabilities that we so badly need. We'll shortly hear about Peter's team and their project on HPC stack. We've been working with Canonical on juju charms for the deployment and setup of scientific compute libraries and services for Ubuntu instances on an Ubuntu open stack hosted environment. There is progress, and across the world there is a good deal of momentum, but from our research we still see gaps in the functionality. So stepping back and evaluating what is needed to plug those gaps, we really feel that we're still at the foothills of where we need to be, not at the summit. So let's take a look at some of those gaps and where the future projects lie. Inevitably the first has to be the question of whether you can have the benefits of private cloud infrastructure without paying any of the performance overheads. The established practices of open stack configuration are a well-traveled path now. Perhaps you'd choose KVM for your hypervisors. You'd probably choose OpenVSwitch, VXLAN for your networks, and you'd probably choose SEF for your storage, and you'd be in a very, very good company doing so. Recent studies have found that KVM is within 1% or 2% of bare metal performance when your workload is CPU or memory intensive. But when you include disk IO or network IO and those become a dominant part of your workload, studies have shown that the overhead could rise to as much as 40% of bare metal performance. With VXLAN network tunnels, the impact on network performance can be so great that it means it's unlikely that you will achieve full bandwidth on any but the most basic of modern network fabrics. Software VSwitches can dramatically increase the network latency and determinism of your infrastructure. So what can we do about those things? In each instance, we're seeking a better path to yield greater performance. Many people are working to study and address the overhead of virtualisation. If you haven't seen the series of blog posts that CERN make, they made a good run over the summer on tuning KVM performance. I recommend checking those out. The CERN team point out that the overheads are not fundamental to virtualisation. In their comparison tests with Hyper-V, it's soundly outperformed KVM. So there are performance gains that can be had in KVM's implementation. If virtualisation is not yielding, we've got other options. I mean, we've heard today about Lex-D, a conon called OS level containerised hypervisor. And these will give you close to the metal performance, but the real beauty is there's no material changes in the abstraction that you get when you're interacting with NOVA. So that's something that we're going to be evaluating with interest in our group at Cambridge. You go a little bit further for a little bit more gain, and you can create a sort of a bare metal platform with OpenStack, and then manage that with an external container orchestration system like a mesos or Kubernetes. For the HBC community, there's also what's called the shifter project, which came out of some of the US labs, which integrates a workload manager, Slurm in this case, with dockerised workloads. Because we're in scientific compute, we have the luxury of a slightly more restricted context in what users want to do. And because of that, we can do some more specialised tricks. So with substantially more effort, we can make an even greater performance gain. Here's a couple of examples of the kind of stuff that's going on. So Unicernals are an intriguing alternative strategy. It's an image that replaces the Linux kernel and the OS entirely with a single application, a single address space, single process, static image, which takes the place of the Linux kernel so that it's running as though it was an operating system with no paging and nothing but the hypervisor between your application and the processor assigned to it. Intel's Exascale Research Initiative goes even further. They've recently published some work on a variant of this where they call it a lightweight kernel, where you just run one or two cores for Linux and effectively a housekeeping role. The other cores in the system are entirely dedicated to running Unicernal images, so there isn't even a hypervisor in this case. Plenty of interesting work there, but much of it is scoped for a scientific open-stack private cloud. With regard to the network architecture, in some cases an HPC system is as much defined as its network architecture, as it is by its processors. Network types that make use of software bridges can be a little bit of a problem, but they get a good bandwidth bounce when they're using Intel's DPDK toolkit at the expense of a CPU core or two. Even with that, though, software V-switches can add overhead. Cambridge is deploying an alternative strategy. We have, on our Ubuntu open-stack cloud, we're making use of Mellanox's hardware offload capabilities, SRIOV, and that gives us low latency, high bandwidth network punched straight through into the Nova instances directly. Many HPC systems also use a non-traditional network fabric in the HPC space in Finnaband as dominant, but Intel are now positioning OPA, their proprietary evolution. Even Ethernet in an HPC system has its own HPC-centric non-IP protocols. How does this fit into a cloud infrastructure where there are so many assumptions about layer 3 IP networking? Perhaps equally important is how do we analyse the performance of a virtualised software-defined infrastructure? In HPC, performance is vital. The clue is in the name. We must have the capability to monitor our HPC systems for performance issues in order to analyse, optimise and repeat. How can we do this in an open-stack system in which there are seemingly impenetrable barriers of abstraction that prevent knowledge of the physical hardware below percolating into the virtual world above? From an HPC viewpoint, a software-defined cloud has to be able to expose the details of its hardware-defined facts. A tunnel heat map is no use if it can only identify network tunnels that are underperforming without actually providing any clear as to why or what you can do about it. This is a challenge that's common to all large and complex high-performance compute systems. The key concept for a solution is to split the HPC workload into its layers of abstraction and gather data in each of the domains, the application-level knowledge at the top, the virtual network topology that it's working on next to that and then the physical network infrastructure below that. Being able to correlate performance telemetry data from one domain in the context of another is really the key to unlocking complex application performance issues. With this capability, for example, we can view the traffic from different workloads as they contend as they make their way across our underlying network fabric. We can view the interactions of communication with the converged storage traffic patterns, and that brings us to our next gap. Cloud file systems offer distributed, highly concurrent file IO, but are really not so good at those instances where you have a single stream of broadband with parallel file IO requirements. When used as a file system, Ceph strikes file data across its data servers, but it doesn't yet support zero-copy-rdma networking for transfers between server and clients. High-performance file access is for a long time been an expected requirement for an HPC use case. It's really table stakes. The dominant HPC file systems today, Luster and GPFS meet this requirement quite well. There was some discussion of Luster at the Vancouver Summit, and there is some progress going on in the community today. When approaching the question of Luster integration, there are really two use cases that people envisage. First one is really the Manila use case where a Luster volume is provisioned using open stack servers and storage for dynamically creating a Luster volume or using those open stack resources. This volume is dedicated to a group of instances, dedicated to a single tenant, and operates as just a shared scratch space for temporary file data, and it then gets torn down. This Luster integration in the Manila project is a work in progress today. Glen Bowden, I don't know if you saw his tour yesterday from HP, he spoke about this and he touched on this use case. Secondly, the use case in which an external production Luster file system, a site file system, must be mounted inside an open stack cloud by the compute instances. This file system will come packed with petal bytes of data. We don't want to be copying it around too much. We need efficient high bandwidth parallel file IO to this external volume, which is likely to present challenges for any neutron gateway that's caught in between. The exciting news for Ceph is that an RDMA solution is coming along, built on the Accel.io library. I understand it's well advanced and we're looking forward to following that with interest. Finally, I think this one is the big one. Even if you're not trying to be too smart about it, open stack has a very high bar to entry. It is the mother of all learning curves. Scientific research institutions have a good deal of expertise, but they do not typically have the skillset in-house that's necessary for open stack administration and operation. There isn't an established forum for sharing experiences, know-how and support. Open stack maintains a mailing list for HPC, but its volume of traffic is so low. It was folded in as a sub-topic of operators mailing list instead. At Cambridge University, the HPC systems group wants to be a part of reinforcing and reinvigorating that community. We want to contribute what we find on our identified gaps and we want to work with others across the scientific community to build a common strategy for optimizing open stack for these scientific compute requirements. We're going to be looking for scientific institutions, companies and open stack vendors like our partners of leadership in this field, and who have also been interested in participating in the scientific open stack forum. We'll be looking to support others who are making the same journey as Cambridge's, and we're aiming to build a critical mass in an ecosystem that supports a scientific open stack. Thank you. So, I was just saying Cambridge has been instrumental in looking at gaps using HPC and open stack together in a real world production environment. The things that we know are HPC challenges, assembling their large luster cluster into their open stack, looking at low latency high performance interconnect. Peter Bram has been also cooperating with us from Cambridge on actual practical work to address some of those gaps and I'll leave him to introduce himself because he's there's no introduction himself, but he'll talk about what practical work we've been doing together and how far we've gotten. Cables. I'll tell you about an effort that is quite closely related to that of stake, but it has a different focus, different goals and a different time trajectory. So, I work primarily for the Square Kilometre Array group in Cambridge. The Square Kilometre Array is a huge new radio telescope that's being built. It will have on the order 1 million antennas and they will be spread over two continents. There will be one installation in South Africa and another one in Australia. And one of the first questions you always get is why do you go to these countries? First of all, there's much more to see on the Southern Hemisphere. It points at the centre of the galaxy for example and so you don't want to be on the north side. But very, very interesting is many humans around because cell phones and radio telescopes don't go together. So the population density particularly in Australia is toningly low. It's 0.05 humans per square kilometre and so that's the kind of place where it's good to build a radio telescope, not very good to build a supercomputer because the supercomputer needs a lot of power and so we have enormous wide area network problems. And here's a picture of this whole whole data path. So you see these little antennas on the left. This is a new kind of replacement of the dishes of the past where electronics will replace the careful parabolic shape that was present in the past. This kind of family of antennas will produce 20 exabytes a day. 20 exabytes is a little bit much for a computer to digest. So this first goes into electronics. This subsystem is called a central signal processing. I find that a very strange word because it isn't so central. It's almost at the beginning of the digital signal processing maybe. So that does a big reduction of the data in some way but it also works for every pair of telescopes and so because we have so many pairs it also increases the data very much. And this electronic stuff is in the desert relatively close to the antennas except that the antennas themselves span hundreds of miles. Then we need to go to cities where there is enough power so we go to Cape Town and Perth where enormous supercomputers will be built to change these pre-process signals into images. Images and some other scientific data like rotating pulsar data and that sort of thing. That's not the end of the travelling data. A single image that is created by these supercomputers is going to be about 100 terabytes in size so that's a little bit more than what you get on your iPhone at the moment. And then that needs to go to probably dozens to hundreds of tier one data centers across the world. How many compute systems have I now covered? Quite a lot. These are going to be enormous computers. I'll tell you in a minute I think how big they really are but there are these central computers and then these computers worldwide. So the question that we encountered was really what are the facts around this deployment? Because this can't fail, this is a big system on which a lot of money is being spent. So first of all the first deployment is still a long way out. It's around 2020 or something like that and maybe upgraded before the thing goes in full production around 2022. Secondly we have this federation across all these centres around the world that will be doing data analytics for the scientists. To make things even worse we have opinions every year we have a different opinion about what the right hardware solution might be and what the right software solution might be so many many things are undecided. So we put our heads together and we means in this case Cambridge meaning the astronomy group there and the high performance computer services together with Canonical and the centre for high performance computing in South Africa which is a major player in this. So we had some discussions what do we really need how can we get there we need a little bit of money to develop a prototype and we concluded relatively quickly that OpenStack with a few HPC enhancements is the way to go and I'm going to tell you how this is progressing because it's looking promising. Of course at the same time what Stig described was happening in the compute services that is focused on things that are happening sooner but that are much more ambitious in some ways. They want to solve the general problem of HPC we only have a somewhat limited focus for our applications. So what is SKA actually computing? Images of the sky. That may sound boring but it's actually quite complicated because there are lots of effects from clouds and sunshine and all kinds of trouble calibration effects they are called and we are looking for very very faint signals from very early in the universe very small objects very weak signals very refined algorithm needed but only a few algorithms. It's a set of four or five programs probably that will be running. So the computers that were designed for this are not going to be the biggest computers in the world at that time but they will have some very special qualities namely they will be very data intensive so typically the IO ratios that we see will be much higher than in supercomputing installations and this is related to the sensor nature of this problem. Sensors create masses of data lots of noise in it and as a result there is more churn of data and maybe somewhat less computing so you see the budgets here the power one of the things that is very interesting the ingest of the data is going to go into a hundred petabytes of very very fast storage and that needs to be read then ten times every six hours or something like that so here is a picture of the data flow going through that system so we start with input data on the order of 5 to 20 terabytes per second that is something that a future file system can maybe keep up with then we do image reconstruction which is a lot of computation and we read that data at times again this is very unusual in HPC HPC applications normally only write but in data analytics applications as you find them in the cloud this is quite common they analyze and analyze again so here you start to begin to see what is sometimes called high performance data analytics where these two disciplines converge the output is not a big problem for us it's only an exabyte of data and that's a five year archive it doesn't have to be written very fast but an exabyte is going to be one of the bigger file systems in the world at that point in time the science I already talked about distributing this to a hundred places is a problem in itself but it's probably not much worse than YouTube it might even be much less than YouTube but we are also less cash rich than Google is so this brings you to the question what's really different between SKA and maybe some other future HPC projects we have both competition analytics we have very advanced software because we are thinking about a new application that's years out so we MPI is a choice for us it's a standard for most other people we look at very modern data flow packages very advanced compilers to deal with multiple optimizations and other architectures and that sort of thing it's highly specialized only a few applications so it's a different sort of system traditional HPC runs many programs in the data centers so they have to be very very flexible they are more focused at the moment on simulation although that is changing the trend is towards more analytics and somewhat less simulation and at the moment there is typically one evolving large image with system libraries it's always out of date and so we believe that that led us to what we've called HPC stack I think it was Kiko's idea to use this name and we wanted to make it as simple as possible and so we said let's look what Canonical has built and so we use Mass for the deployment of the bare metal hardware Mass is capable of handling many architectures which is very important for us because we don't know what architecture we will be using that's one thing but also this telescope will become 50 years old or something like that in its lifetime a telescope is not something you throw away like a laptop after a year or something so it's going to go through generations of hardware provisioning juyu with the charms it looks really good because we can add HPC software to this and things like MPI packages in Finneban Dataflow software and container models will be very very nice for us to use people can build their own development environment do some testing on small systems and migrate that very easy to a large HPC cluster they are no longer tied to this fixed image that is available at the compute centers people can use their own environments using a container model so we want to leverage OpenStack particularly for monitoring and authentication and identity management this is very important in these federated situations because they typically have distributed user databases that are not identical in every place where interaction is tricky still being worked out to some degree also OpenStack allows many different storage systems to be integrated and we are thinking about both the current systems and the future systems evolution is very important Ubuntu and OpenStack will stick around so the philosophy is to get to working prototype fast we we think that we don't want to wait years and years we want an effort that is manageable even if only SKA will use HPC stack we want to get there in a few months and be happy that we have something that's usable that maybe can be supported by Canonical or that we can handle ourselves we don't want a project that is too large for our needs but we want to do it in a way that is large following so we have made the choices very carefully so that if people think this is a good idea to do HPC in general for example by integrating maybe many of the patches from the HPCS effort that stick talked about they can so what's the status at the moment the status is that we have worked on implementing this for a month and we actually got further than I thought we would so at the moment we can deploy and run MPI programs on clusters we have built a few charms for Slurm and OpenMPI we've made fixes to the Keystone database to deal with user IDs and groups which are not needed in cloud software but on which much of the HPC software depends now the next goals are to leverage the fast networking efforts that we just heard about Docker containers some more storage systems and then the SKA prototypes that are being developed need to start running and the goal is that maybe in another two or three months we would have this done and then we will make a little bit of noise about this project we can say in four months this is how far we got you have something that goes end to end from hardware deployment and other advanced applications maybe this is a good model to follow it's a wait and see situation but I think it is reasonably promising I think this is almost all I want to tell you please use it contribute to it it's done with the package system from Ubuntu that allows you to adapt to Ubuntu very easily to something new and roll it back so everybody can contribute much effort you can also just follow it and take another look from time to time to see where we are getting thank you very much we have a few minutes for questions if we would like audience actually group pretty much since we started here so I wanted to take a step back and ask are others that are considering open stack for HBC workloads interested and would you like to ask questions and know a bit more about what we are doing I have a set of questions as well thank you you have to invest a lot of time and effort to optimize in the small piece of code for running into a platform but when you run this code into a cloud a cloud infrastructure that involves a lot of degrees of freedom and how can you optimize something with so many variables I think that's a very good question so one of the curious problems is that we want to create the illusion of a virtualized infrastructure but you also really need to you don't want to lose any of the benefits of knowing the physical placements of things and that can be a problem but I think with a lot of the periphery projects that are going on in open stack for enabling better access to the hardware and so on you can actually recreate equivalent performance as you would for a bare metal system that you hand create yourself and I think the best outcome is that you don't sacrifice anything but you gain a lot of the support of things like the multi-tenant access but you also get to use the repository of images so that there is no supercomputer out there where you can choose what image you want to boot and that is a great strength that is brought by the flexibility of the software to define cloud infrastructure so could I supplement your answer with a few other things so when you take a supercomputing application typically you follow three steps first it's written you have to build it, you may be tested on one node or something like that then you get a small cluster test case they give you a few notes to see if it works and then you go to a big cluster now with containers this transition becomes a lot smoother first while you're developing it you can put together what you want then you can take that same container to your small cluster and to your big cluster it doesn't really change the situation of scaling but that problem you have anyway so it won't change the network topology problem that you have it's totally true that a tightly coupled application is very dependent on the topology it is now and it will be with OpenStack deployed cluster that doesn't change, it's the development cycle that changes your interaction with the system administration your ability to use new software so what you're saying is not everything changes and that's completely true but some parts do change in a favourable way fair enough? yes, that's what container systems do you basically instead of scheduling a program you schedule an entire environment in which your system run it's smaller than a whole virtual machine but it's bigger than the program I think you can't be totally certain of that but you can't be at the moment either people can upfuscate programs extremely easily you still can the network traffic remains transparent with cloud infrastructure you do have a lot of opportunities to make some quite sophisticated log analysis you get to bring it all to one place and you can process it in different better ways than usual HBC systems to decide what to do with that data but isn't mining bitcoins one of the best things that HBC centres can do to pay for their expenses yes, absolutely it's a good question I wanted to ask how do we see exposing additional hardware acceleration into the system which is nice in HBC systems today is because you're running on bare metal you've got GPUs there you have other additional accelerators you can just write your program and use them so there's a very interesting example of that which is the Chinese computer Tianyutu runsabutu and I'm not sure how they manage it but I think the way the strategy for doing that is exporting the capabilities as an overflavoured attributes so what is work to be done there around exposing those capabilities in a way which the generic container aware application can leverage yes, I think by the way there will also be some obstacles here so for example you could expose a GPU relatively easy to a single application that runs within one container or something tightly coordinated you can't probably expose it to a random number of containers at the moment nobody is really doing that either so if you run a GPU job you probably get that whole GPU as part of your resource allocation by the scheduler that is actually the point that is making that the scheduler needs to understand okay, I need to make sure that it is an exclusive job on that node because otherwise so the HBC scheduler has already understand these things very well so Slurm has extremely good management of what resources on a particular machine you get and how exclusive they are and the MAS attributes can bleed through into Slurm I guess is one question, will it know will it be able to inspect from MAS what the node has available yes, yes and probably the easiest way to do that is to actually let Slurm schedule the containers and not run inside a container so that it has a good knowledge of the machine right any other questions from the audience I can speak from my experience that we were running them inside a network namespace so effectively inside the networking proportion of a container and what I found from that was the performance is more or less equivalent so there were some naughty problems to do with the discovery processes because if you use a container and you isolate it from the control plane of open stack you're probably also severing a connection back to the Slurm controller demons and it needs to use that to find the jobs and do job startups so there's a discovery phase which in the last investigation I did was not working very well but actually the performance on itself was pretty good that was on a 40 gig ethernet no I don't think it was to do with memory performance I don't think with a container you would see any visible impact of memory performance it's just sort of network visibility inside containers that we encountered and by opening some visibility on networks the jobs ran normally across a number of containers actually it was remarkably easy there was nothing special going on in the end alright well thanks for attending thanks very much Stig and Peter