 ac mae'r next presentation will be given by John Garbert. It's on open software in astronomy. John, if you please. Thank you very much. Good afternoon everybody, my name is John Garbert and I just want to quickly introduce myself. I'm a principal engineer currently working at StackHBC. I've been working since about December 2010 on and off on OpenStack and about the last 18 months or so I've been with StackHBC looking at particularly at high performance computing and OpenStack and how to get all those working together. One of the major projects we've been working on at StackHBC is a square kilometre array and various things around that area. So today I'm going to start talking about that. So first place to start is what is the SKA? Here I've got two artist impressions. Hopefully it's vaguely visible. The left hand time, night time is quite night time but that's fine. So to start with it's a radio telescope. So looking at the sky trying to take better pictures of the sky. It's currently coming towards the end of the design phase hoping to start construction in April 2020 I believe. The first thing to note is there's two locations. There's one which is mostly to do with mid frequency. So that's the left hand side. This is called SKA1 mid. This is planned to be in South Africa and the surrounding area. There's about 200 of those dishes I believe planned roughly speaking. And they all get joined together to form one radio telescope in South Africa. And there's a second instrument SKA1 low which is on your right hand side. And that's here the thing that looks like a field full of metal Christmas trees. They're about my height-ish in a field. Then the little discs of them. This is the aperture array of low frequency. So why the square kilometre array? Really it's about taking better pictures. These are some infographics from the square kilometre array project. I should say I'm talking on behalf of StackHPC. I don't actually work for the square kilometre array project. I work as a subcontractor for Cambridge University on technologies related to it. This is where we're looking at things. So really it's about taking better pictures to the sky. How do we define better in terms of these pictures? Well I have no idea so I asked the scientists on this. The infographics says the resolution, the sensitivity and the speed of the pictures that they can take. Effectively the square kilometre array both in the mid frequencies and the low frequencies are making a huge step forward in this. Now from the computing challenge of this. Those of you that have been ignoring what I've been saying and reading the rest of the slide will notice that that bigger picture, better view of the sky, basically means a crazy turn amount of data going around the place. These are data rates that are pretty damn scary, particularly when you start thinking about the idea of 24-7 operation trying to get the most money at these telescopes always looking at bits of the sky. This is a radio telescope so it can kind of see virtually from horizon to horizon. So you've got very long observations happening. So the main problem I want to talk about is how do we actually do some science with all of this data that's flowing around? So I'm going to talk about three things. The first thing I want to look about is the science data processor architecture. So this is the part of the square kilometre array that's actually doing the number crunching. So each of these telescopes, the SK1 mid and the SK1 low, will both have their own dedicated supercomputer to process the data coming from the telescope. So the basic idea is the radio telescope is an area of radio silence. So that would be a kind of awkward place to build a supercomputer. So then there's a very long cable that goes to the data centre where all the computing happens. So you've got loads of effectively what's planned. There's lots of 100 gigabit ethernet links with UDP coming into the data centre, and then we need to do something with that inside the science data processor. So step one is the data is coming in to the switches, lots of UDP coming in. The overall approach here is to try and use a very large parallel file system, a buffer, to try and allow a... to try and disconnect the ingest of the data with the processing of the data. The rates at which these need to happen need to be different. So the best way is to try and buffer between those two things. So the first part is really to ingest all this data and get it into a buffer so that we can then later take that data out of the buffer and reduce it down to the data product, which effectively means that you're doing Fourier transforms and averagings to get pictures of the sky. So you've got the UDP coming in, get to the disk, try and create the sky pictures. You're trying to deliver those to the scientists across the globe that process them and find interesting things in them. And that's the general plan. So let's zoom down a little bit further into the detail of the problem here. So what does this look like in each of the sites? So for each site, you've got the data, as I said, coming in from the telescope. Certainly for the SK low, there's a central signal processor. And effectively, that takes the voltages and things from the telescopes. So the data rate coming in to the science data processor is about one terabyte a second, roughly speaking. There's a lot of UDP not to drop on the floor, to be fair. The next step is, as we've said, to try and get that ingested. So if we take a little zoom further down, we've got the data coming in, and we've got these sets of real-time processes to ingest the UDP. They're roughly writing into this tiered buffer system about one terabyte a second, and they're reading out for the batch processing order of magnitude between four to 10 terabytes a second. And for the output, the data products is on order of magnitude smaller. These are all estimates based on what we think the processing to be... that will happen, but the processing... these science workflows are not yet decided. Things may change between now and when the telescope actually comes online. So just estimates about the data processing. So from the batch processing to the data products, it's expected to be about 10 gigabytes a second-ish, such that you can then push that all out to the regional data centres using a 100 gigabit link and try and keep all of this... and again, the challenge is to try and keep all of this running 24-7. Okay, so this is a massive eye chart of architecture, and I'm not really trying to go through this apart from to say... I've been talking really about the main visibilities that are flowing through the system, as the scientists would call them, the main data, but really the buffer is trying to be this general purpose storage component within the system. There's lots of different places that have to use this. So when you read out the long-term storage before you actually deliver the product to the regional centre, you're trying to use the parts of the buffer for that as well. So just to say, the sizes of these buffers are going to be quite variable. So the buffer component, what it's actually trying to present to the system is this abstraction of a data island. So as you'd imagine, the data island is just trying to define... it has certain characteristics such as... do I need this data to be able to read very fast? Do I need to have it to be a particular level of resiliency? Do I need to do replication on this or do I just need it to be fast right now? So the idea is that different components define their needs as a data island, and then that can be scheduled to decide exactly which resources within the system you're going to consume. Okay, so let's go over just quickly like the key challenges here. The main one is the massive data rates. How do we get a parallel file system that can actually sustain these data rates? How can we estimate its cost so that we can start to do budgeting? The second thing is here we've got a lot of control plane activity beyond the data rates. There's a lot of different size buffers that are needed at different times and needs to be copying between the tiers of buffers if that's what the budget basically means that you can't just throw the most expensive storage hardware, network hardware at this problem. You need to be a much better efficient use of resources. So we need to have a tiered buffer so we need to deal with that modelling. Going back to keeping the budgets right, the idea here is to size the system for average load. What I mean by that is when you go back to the observations that are happening, depending on which bit of the sky you're looking at, your observations will be for different lengths of time, just because you physically can't see that bit of the sky for 24 hours because the earth moves. Well, you know what I mean. So for different observations, there'll be a different size buffer because the longer you're looking at it, the more data you're going to come in. We need to try and slice and dice the system in a way that that works, but still keeping that sustained data rate. Okay, so the next two parts of this talk are going to talk about how do we actually go about prototyping the system and exploring the problems better? So the first part of this is what I call the design supercomputer. What we've built is the system we've called Alaska, ala SKA, which is a performance prototype platform. This is two racks of hardware hosted by Cambridge University, and this is where a lot of the stack HBC work has been focusing. This we've turned into a bare metal open stack cloud, and this is allowed us to slice and dice the system to actually explore some of these problems, and let me try and dive into how this has helped. So the hardware itself is trying to model a small fraction of what the final system could be. The core part of it is in the 25 gig ethernet here, we're trying to mirror this idea of this is trying to be the network that the UDP comes in. So the telescopes would be sending to the top of rack switches at 100 gigosecond, and then you get the UDP packets coming into the nodes. So right now this has been running, the simulators for the telescopes run on one rack, and that generates all the packets that go across to the other rack, where they try and simulate the ingest nodes, and there's ingest nodes trying to use high speed networks such as InfiniBand to connect disparate storage systems to disparate compute systems to try and deal with some of this dynamism. So when I say software-defined supercomputer, what do I actually mean? What I'm trying to say is how do we build up reusable pieces that people can share when building up their workflows? These workflows will evolve over time, and a lot of this work has been about working with the scientists to see how well existing systems could be packaged up in this way, so we can basically allow the scientists to do their thing, while the network engineers and the compute engineers optimizing other bits and pieces can go on doing their thing and grease the reels of innovation. So in an open stack world, none of this is particularly new, but this is just applying it to the science world, so the idea of just trying to package up applications using containers and Ansible and just make these pieces that we can slot together. So one example I wanted to pick up on was some work we did with the Kubernetes Cloud Provider OpenStack. So I spoke about this abstraction of a data island and we were trying to look at whether we can actually use the idea of the Kubernetes abstraction of volumes to map to this data island concept and how all this thing could glue together and what it actually means. So this is a big diagram with lots of boxes that have millions of moving parts inside. Let me try and describe what's going on here. If you would like to see a recorded live demo of this, I will put a link on the presentation to something I did at the OpenStack Summit in Berlin. Where we demo this working. But what we've got is we've got OpenStack Magnum that we're using to create the Kubernetes cluster. So Magnum can resize the cluster depending on the size it's required. So you state how many physical nodes you need and that stamped out the Kubernetes cluster is actually using Ironic underneath so you get the whole bare metal machine. There's no hypervisor in between. So that gives you a set of machines. You've got raw access to the compute and use Kubernetes in the usual way to create the containers that you need and connect them between your machines. But these particular containers are backed by the Cloud Provider OpenStack's OpenStack Minilla-Seph native plugin. Let me go through that a second. So what happens when you create the volume in Kubernetes is it talks to the Cloud Provider OpenStack code which creates a share of the appropriate size within Minilla. In this particular case we're using Minilla's CephFS back-end. So actually what it was doing was creating share quotas and namespaces within CephFS and then exposing them appropriately into Kubernetes. But what this has given us is a way in which the scientists today working on the workflows can actually start using this model and saying this is the size of the volume you need, this is where it needs to be connected and because it's using Ceph as a back-end this is attached in multiple places in multiple writer, multiple reader mode. So this is exactly the kind of thing you need here. One particular aspect of this to look at is the name shares. So when you create your volume you can say I want to use this particular share in Minilla. So you reference particular share in Minilla. That share in Minilla may have a particular frequency band of the information you need. If you need to attach to several different frequency bands the idea is that you can create those set of abstractions that are saying I need to connect to all these different frequency bands together on one machine. And this is one of the problems with the placement that we have to deal with in the system is that while you can have an element of ensuring that you have the data local to your compute at some point in these algorithms you need to join up in many of the algorithms at least you need to actually join up all the different parts of the sky if you've divided the sky up or you need to go across all the frequency channels more likely. And so in those cases you need to actually go between racks where you've let me try and describe that better in one second. So in some cases you can't just put all the compute local to the storage in some cases for one particular observation you kind of need to see across the whole for one particular workflow you need to see across the whole buffer in particular cases. So if you go back to this data island picture most of these prototypes most of this prototyping work has really been can we join all these pieces together? Can we do the dynamic provisioning of slicing and dicing this piece up? Does the set of abstractions work for us and working through those issues? The big risk we haven't addressed so far is how do we deal with these data rates? Do we deal with it while you're laptop locking up? It would seem awesome. Second? Well, it's happy now. Okay, so the second well, this third part the second piece of work on prototyping I want to talk about is the Cambridge Data Accelerator project and this is where we're we've been working on a project where while it's not directly targeting the SKA it's a resource for testing these very high data results or testing these very high data rates parallel file systems and how to explore some of this slicing and dicing. So the basis of this project has been using Cambridge University's cluster which their supercomputer that they're currently calling the Cumulus UK Science Cloud it's really got there's kind of two key parts of it it's the cumulus the cumulus part which is a mix of Intel Xeon nodes and KNL nodes and there's a a second side which is the GPU side for this particular project we've been focusing more on just the the Omnipath part of the network which is on the the Xeon and KNL but the main thing is the storage offerings so this is quite a traditional in the sense that they've got a global luster system that's servicing most of their users they've got more and more users where they've got workloads that are IO bound their luster file system works great for the general purpose needs but for the users with big simulation runs where they've got loads of small files the metadata servers are creaking and for when they're really pushing lots of data through you really want something that's more dedicated so the data accelerator project is trying to how can we address those users' needs better so just to highlight the size of the system if you look at the top 500 list of supercomputers because of the two networks it's kind of the system's got two entries for the two different halves number 87 on the top 500 in November 2018 was the cumulus side it's about 12,000 sorry, 1200 machines right now and it walks to GPU clusters on the other side just to a side so how did this how did the data accelerator project go? it's actually got two entries for the IO 500 so we're using the IO 500 benchmark to see how we could get for the whole all of the hardware and the data accelerator using Luster we're able to get number 3 on this IO 500 list and number 7 using BGFS so we tried the two different file systems on the same hardware it's not quite the same client count it's pretty pretty impressive from what what we're able to do and I'll go through that in more detail I should say at this point that the IO 500 doesn't currently have 500 entries so it's a little bit unfair saying that number 3 is the third in the world there's not that many more entries if you scroll down that list but it's a good benchmark that's starting to gain traction so what is the system? it's actually it's currently targeting about half a petabyte currently that's spread across 24 machines each of the 24 machines have two Intel Omnipath adapters and I'm going to go through what interface do we give the end user with this system from looking back at the original data rates if you do a marketing trick of changing bytes to bits it looks like it meets the requirements it's the best bit of marketing I've done but you know I'm an engineer so I tell you about it so yeah it's close the bit I was talking about the data islands and it means it's almost it's certainly good enough to do one data island of what the SK is wanting and I want to sort of go through a few interesting stories about the benchmarking I had some sort of conversations about this and people said it's really interesting benchmarking is very tricksy just a few things when we were looking at this benchmarking of the system we were having very unpredictable results between the different runs we were actually using the production cluster so it's picking at random different compute nodes and running on them and trying to just do sort of you know how can we get one node maxed out it turns out that this setup the IOPS even running Luster and BGFS basically the network turned out to be the bottleneck very quickly even with the the two Omnipath adapters on the box so what we actually found was this Omnipath network is a factory network so effectively you've got lots of switches with lots of 100 gig links so that if you look at the top of RAC one switch has it's a 2 to 1 over subscription so you've got two cables going to the course where you've got two cables going to your compute nodes and one cable going to the course switch all of the cables in the switch and this core switching area was effectively where we were seeing saturation because the routing algorithms weren't adaptive enough for the network load so you're actually able to saturate some of your network cables so it's actually ended up testing some of the network routing much more than we were anticipating there's a lot of stress on the network another thing we found was using these particular SSDs the P4600 they remind me a lot of my son trying to walk towards me and bear with me with this analogy they're very optimistic to start with you're like wow these are most amazing things ever they're way better than the spec sheet and then they sort of trip over and fall and then get going again and sort of crawl towards you and then keep going all the way around the house and run loops around you what we found was is that it's effectively reading around the accounting on the way your block is stored you hit that sort of you seem to hit the next it's almost like the data structures have got extra levels that are appearing in the data structures and it just gets that little bit slower so you have to be very careful about getting that steady state of the system to say that that's the that's what the system is going to do in real production and doing a a secure erase of the disk gets you back to it sort of very happy peppy wanting to run about all over the place self which is really interesting well at least it meant we could actually repeat the experiments we saw the repeatable results but yeah it was interesting okay so the how we're actually exposing this to users for this particular cumulus system right now right now I'm taking a subset of the system we're trying to get to open stack on top of it so we can have stuff other than slurm on the system so we're trying to do the same bare metal slash hypervisor trick from the p3 system but right now the majority of users are accessing the system via slurm slurm has a concept of a burst buffer within it it's already all in production in several sites the only working driver for the slurm burst buffer right now is the Cray DataWalk driver because of that that's the one we decided to integrate with what that actually does is it talks to a CLI tool so slurm just shells out to a CLI tool telling the CLI tool this is what the user has requested here you go by the way now I know this is the compute node please mount it on the compute node so you've actually got a contract with slurm what we've done here is done a little bit of integration work with some go code just to basically take those commands from slurm and then update a a data model and then basically try and distribute that between the burst buffer nodes where it just shells out to ansible to go and create the correct file systems let me talk about the that's going to really mind me if it keeps saying this we're going to talk about the data flow between all the different bits of this maybe it's just my keyboard that's gone anyway so in terms of informing the control flow for the SK work the hot tier cold tier sort of tiered buffer system has exactly the same properties that we're trying to fix and trying to sort out you've clearly got a setup phase you've got a tear down phase at the end where you release the if you sort of read both ends together it's sort of easier to understand the system you've got the setup and tear down you've also got the data staging in and out so users can basically say um please create me this burst buffer of this particular size please get this data ready for me within the burst buffer for when my job starts so that means for a particular user you're not wasting valuable CPU time um copying data about the data is already ready in the burst buffer for when the CPU nodes are chosen and then assigned to that particular job so right now if people try to copy that themselves you'd actually be charging them CPU time while they do that copy which may be a significant amount of time so this system actually uses the the burst buffer nodes themselves to copy from the existing filing systems to the high speed one that gets it already there for the job what the user actually gets in terms of their environment in slurm is they get environment variables telling them where the burst buffer has appeared so if you've requested um to attach to several persistent burst buffers or one per job burst buffer you get different environment variables telling you what's happening so what does it actually look like from the user perspective so if we think of the user requesting a per job burst buffer and they're requesting it to have the global namespace and the private namespace they might have their code running on this host one and host two and there'll be a set of environment variables that point to those top two locations one is a share you know one is basically a simlink to a shared space in the filing system and each of the hosts have their own private spaces and really this is just to avoid contention another thing that actually this can be used for is to try and simulate high memory nodes so one of the things we want to try particularly well for the SK work is how well some of the codes that need to have the whole of the sky model in memory at any one time if we can actually use disaggregated storage in this way to a swap if that's actually fast enough for those needs for some for some codes it may well be the testing is currently a bit inconclusive haven't really got the actual results on that yet trying to see if we can use this for swap as well okay so I just wanted to review where we've got to how we try and what things are we doing here so the SK buffer is trying to fix these problems really it's trying to have a flexible architecture and the flexibility in this sense what I mean is allowing for the different sizes and different jobs by slicing and dicing the buffer into different sizes and having that the ability to say that this data and this compute is close or this is the one case where we need to see data no matter where it happens to be trying to express those requirements appropriately so that you can do the right optimisations we looked at some of the control flow prototypes the Kubernetes the Kubernetes work where it ties into Manila so you can try and use these abstractions to give the storage in the different parceling ups that it's needed and try to look at the prototypes to get the data rate as close as we can to what the SK is needing so how to get involved a lot of this work we've been talking about in the scientific open stack SIG it's quite an interesting group of folks get together a lot of the open stack summits there will be a group of different talks and interesting conversations about people focusing open stack usage on HBC in that kind of environment also if you wanted to have an overview of how open stack can be used within these kind of areas there is a sort of miniature book PDF that you can read through on what's going on that's a great way of saying hello to people doing similar work for the that go piece of code for integration I was hoping to be able to give you the repo to the open get lab well you know the get repository haven't quite got there yet to get that sorted I'm hoping that will be really soon which is annoying but close but we've been having conversations about this project within the SIG and that will certainly be one way of communicating out what's happening so yeah thank you very much for your time and thank you to all the people supporting the project this can be done without loads of help from folks across the SK project different people at Cambridge Dell and Intel have certainly helped with a lot of the work on the data accelerator but yeah thank you all very much and if yeah thank you now we've got time for questions I hope if people got questions I'm hopeful a microphone might be able to get to you any questions oh sorry there's a question here or maybe shout and I repeat your question yeah you showed this connection between Islam and Dettawar from Craig so that the ACD was actually use the Ansible to help the hardware why you need that I mean you already have the Kubernetes and then you already have the hardware there so why you need Ansible to deliver the hardware I didn't understand that so let's talk about your, let me paraphrase that you're saying here why use Ansible so in this particular case this is on the system so there is no Kubernetes running here no no no my hope is this system so basically you've got the Ansible here is running on the storage nodes so the data accelerator nodes is where you're running the Ansible and that's effectively when you're doing the provisioning it does the provisioning of the Ansible across the machines and the tear down in the same way my hope was that I would use pre-existing Ansible to do that in the end it's actually home cooked but the idea is it's just an easier way of extending the system is rather than trying to re-implement what Ansible does we just shell out to Ansible to do the appropriate thing for the appropriate file system so BGFS versus LOSDA it's the same inventory same set of variables and you just sort of well you flip the play but basically cool well, thank you very much I'll be down here if people want to have a chat so that's great, thank you