 Hello everyone, thanks for joining us on this exciting new episode of Open InfraLive, the open infrastructure weekly live show airing every Thursday at 14 UTC. We've been doing weekly episodes for the last four months now with production case studies, open source demos and industry conversations. My name is Thierry Carrès and I'll be introducing today's episode. Some of the most exciting shows we've been doing are the ones around large scale open stack, organized by the open stack large scale SIG. We invite operators of large scale deployments and get them to present how they solve a given operations challenge and discuss life between themselves, their different approaches. Today's episode is of the large scale open stack series is a bit different as it tackles a specific facet of scale. When we talk about large scale, it usually refers to scaling out to hundreds of community nodes, but there is another dimension to large scale open stack, using open stack to build software defined supercomputers to use in a research and HPC or to run alongside supercomputers. So we have invited key practitioners in the high performance computing space to discuss the place of open stack in this very specific use case. Our distinguished guests today are coming from all over the world, Sadaf Alam, CTO of CSCS, the Swiss National Supercomputer Center, Jonathan Mills, Cloud Computing Technical Lead at NASA Center for Climate Simulation and NASA Goddard Space Flight Center in the US, Happy Citole, Director of CHPC, the Center for High Performance Computing in South Africa, Steve Knet, Deputy Director at Monash University E-Research Center in Australia and last but not least, Steak Telfer, CTO at StackHPC, hailing from Cambridge UK who will host this discussion. As I mentioned, this is a live show and we will be saving time at the end of the episode for Q&A, so feel free to drop questions into the comments section throughout the show and we will answer as many as we can. Well, it's time for us to start, so passing on to you, Steak, take it away. Thanks, Jerry, and I am just delighted to be here and to have such a panel of great guests from amazing open stack using high performance computing institutions with me as well. So I'm not going to talk too much about us, but I wanted to lay the scene a little bit with what this research computing, scientific computing and supercomputing context is all about. And I guess the first thing that you might say is that scientific computing has been on this incredible journey in the last, you know, however many years. And it's a journey from conventional scientific computing infrastructure management to the way that cloud has transformed this space a huge leap. And it really takes effect in several ways. And so I just wanted to lay out some of the context about how those changes have been appearing and manifesting. So I guess that we have been looking at research institutions who are moving from having expertise in HPC to moving into having expertise in a cloud native kind of HPC. Scientific institutions are inherently resourceful places. They survive, they thrive by the research funding that comes to them. It doesn't always result in a coherent collection of hardware platforms. So sometimes we have this problem of silos of computer infrastructure sitting alongside each other in a machine room, but not necessarily connected in a coherent sort of consolidated way. And we cast a mind back a couple of years to some of the open stack and open infrastructure science. And Jonathan Brice talked a lot about data center diversity and research computing is a great embodiment of that. We may have these incredible machines dedicated to particular purposes in data analytics or AI and machine learning, but how do we get them to work together in an interlinked way? And open stack really plays a foundational part in consolidating those platforms into one common high performance computing stack. The second point here is really about, well, we've got the software, we've got the hardware under one control, but what about the platforms on top of that? So quite often in research computing, you might be people providing the service, the operators of the service may well be deciding how we would be making use of the hardware that is available to them. And this is a limitation in, you can see how in modern computing environments when we're looking to create innovation, we quite often need to start at the software level. So we need this idea about being able to self-service and this is again where cloud comes in. So this is another example of how open stack is transforming the scientific computing landscape. We think about things like GPUs for example, they could be used in so many different compute platforms, environments and containerization or in high performance computing and for modeling. At the third point here is one of those ones about the lower level service level transformation. And this is really about this move to DevOps and site reliability engineering. And that doesn't just affect the usual users of open stack out there in the cloud space, but it can really have an impact on the way that scientific computing institutions are operated, how the machines are run and so on. And I guess, oh next slide please. Then we take a look at the sort of this massively broad landscape of high performance computing. And at its pinnacle we have this niche domain of supercomputing. And this is this idea which we're going to be talking a little bit about today or quite a lot perhaps. And in this environment there are systems that exist among the biggest and most powerful supercomputers in the world today that are today running open stack. But yet the way that they work is simply at odds in complete defiance. It breaks every assumption about the way that we usually think about how cloud works. And so I think we're going to cover a little bit today about how that is achieved. So we can make these huge machines and we can use open stack in order to run them. So I think that's probably enough in terms of seeding some context on the discussion for today. I'd like to go around our guests and have a little round of introductions and hear about where everyone is from and what their service is and how they use open stack today. So set up, can we start with you at CSCS? Sure. Thank you. Thanks a lot. Thanks Theri. Thanks for the opportunity. Greetings all. I think this is pretty exciting. So if we can go to the slide I prepared to show where our organization stands in terms of what we are known for. I can start there and then can talk through where open stack exists. Okay, can go to the slide. So what I did was I said, okay, just keep it brief and I captured the screenshot of if you go to CSCS website today, the top part. And what we do is we provide supercomputing and large scale data processing services to largely Swiss research and academic communities, but beyond that. So the big system you see on the bottom left is this line flagship supercomputing system. It's been used through various different programs is a create machine. And it as a stick said breaks all the assumptions cloud make about provisioning operating running applications on a tightly integrated environment. In the middle you see a weather and climate operational system it works through a completely different operational 24 seven model. And it's for weather forecasting system for Switzerland. So it operates in a completely different mode than a research supercomputing system. But then again they have very strict workflow requirements and even a stricter high performance computing requirements. And this system uses now a part of some virtualized stack. But largely speaking it is a still provision with the pretty standard high performance computing management software and operating stack. The new system we are building we are calling it out. And there we are leveraging largely a vendor cloud native supercomputing architecture and environment. And and this infrastructure will contain these elements. And the question is why are we doing it because increasingly we are seeing that some of our users they have workflows that go from web domain to HPC domains. In a very tightly integrated manner and we would like to create these isolation security. Now 30 seconds on our open stack experiences so we do have an open stack system running it provides infrastructure as a service for our customers. Compared to the systems I have shown it's miniscule in terms of hardware resources but it does provide some valuable service and resources. And type of privilege and access controls users can't get on these other high end systems. We did have an experience with running a HPC stack on top of an open stack provision system. But in what we concluded after running the project for almost two years was that you need so much. Expert knowledge to turn around a very mature stack like open stack to leverage a high performance computing technologies. The degrees of freedom are so huge. And from performance point of view we just realized that you need such a level of vendor support that it makes it really difficult to run a large scale system. With an open stack provisioned environment. But I'm really looking forward to these discussions from experts like SIG and others. So that's all from me. Thanks enough. We have a group from around the world joining us today. And I think next we'd like to look at. So it speaks to Happy Siddole from South Africa from the CHPC system. Thanks stick and welcome everyone. I am from South Africa and my organization is the National Integrated Cyber Infrastructure. So basically that combines compute network and also data intensive research. So if you look in the next slide you'll see that we have deployed HPC systems mainly to provide infrastructure for our researchers in the country. And this research as they come from mainly your science and engineering type of applications. But for this we have got like mainly or traditional HPC systems and which are X86 based. And also we have got some accelerated systems. And we also looking at putting containers on the system. So some of our users already use implementations like singularity on those systems. But we have got some of our users also that come from your VM type of users. This will be the type of users who need some scale out. They need to do large computing but not really HPC systems. So these are the users that drove us to have infrastructure as a service and also platform as a service. We looking at in future I think from the previous speaker also looking at converting our HPC systems around bare metal. So our users can be able to do what they like with the systems spin out like HPC system out of this. So that's the future that's where we're going and the ultimate goal to do that it will be an open stack. And we are mainly driven by all this community of users. But I think one of the things also is that we think that open stack will help us to also federate various users across the world and have a much bigger supercomputer. We can talk about some of those use cases. Currently this is just the configuration just to show you what sort of implementation is there. So basically we start off with what is known, what is stable. We run a production open stack but at the same time we also doing some research so that we can get to our trajectory to get to bare metal and run HPC systems on open stack. I'd like to also share you next slide with the various applications. What is very interesting is that when we started our open stack system it was just three days before the lockdown COVID-19 in South Africa. And by that time our government needed a platform that they can be able to provide the stats collate all the information. So this system today runs all those you will see that the number of things here are applications that came because of COVID-19. Like for an example mapping out connectivity across the country to understand when the students are not at universities, when they are at home, what type of connectivity do you have. So there was a lot of processing that needed to be done and that type of workloads were mooch suited for the open stack. So after three days of provisioning of our open stack it was hard at work to save lives in the country. But you will see also that there are some also some of very interesting applications that run from there moving from your livestock tracing as sciences. And the list is quite long but I thought this one should be of most interest. But anything else there I think that's almost it. I'm looking forward to sharing ideas with you on our route map to what's fully open stack on HPC system. That's tremendous. Thank you happy. And next I think we go to Monash University in Australia and to Steve Cornette. Good evening Steve. Thank you Steve again and thank you Terry. Next slide please. Thank you. So I thought I'd take a moment to talk about our operational challenge and if you go to research that Monash day to you as one of the largest universities in Australia. We have some amazing stats that we have to think about and we're responsible for responding to and in over the last five years we've had six and a half thousand researchers go through 20,000 projects if you like apps running somewhere doing something. And all those people connecting with 65,000 other people in various ways around the world and that presents a challenge for the traditional way we think about HPC. It's very normal for a cloud perspective. It's very difficult from a HPC or super computing perspective. And so this is the way we this is the challenge that we're faced with. And if we can move on to the next slide, the model by which we try and tackle this is we sort of recognize that traditionally HPC is a lot of money spent on very few people at this very peaky end. And what happens is people with their laptops inside the institution are stuck with windows and apps on there that you can do through normal IT services and that was this pile of dead bodies at the end over there. What we try to do is really focus that there's sort of a short tail end. It's sort of how do we get more communities digitized and using the advanced computing become cloud native and build their ecosystems that need successes where they are actually asking for more cycles. So that's the way we think about it. If we can move on to the next slide, that gives a bit of context to then how software defined this world. I guess the way I explain it is we're only really interested in applying web scale IT principles to all of research computing. And so the very bottom of our ecosystem is basically open stack and surf and they are power all of our HPC and our cloud presented usage. And the top of those sort of teams are various facilities massive as a specialist facility. It's data centric. It's got heaps of GPUs. It's really born out of trying to solve and make imaging image processing better faster and all those things as well as AI. But on the cloud side, we're also part of a national federation, the nectar research cloud, which I think many people would have heard of who have been involved with open stack for many years. And so it all works on this one ecosystem. And the thing is, the cloud end is very much this end where we're trying to incite new communities to go through that digitization journey. And it's very much about them designing their own software and their own applications, but also just choosing and influencing what is what does each compute brick look like, you know, and those sorts of things. At the middle tier, if you like, whether massives and the HPC tier is there's less choice to have a design is becomes more HPC, but you know, our fabric is data centric as opposed to MPI centric. And we what we do is we push off the MPI, you know, heavily traditional simulation science off somewhere else. So that's our ecosystem and the way we think about HPC and how it relates to open stack. And just, and just on that note of connections, you know, in this sort of increasing cyber threatening world, there are two things that we're really doing that speak to the tail end and the peak end and how we do security. And one is we leverage crowdsourcing to find vulnerabilities as well as crowdsourcing to help us fix the vulnerabilities. And the other thing we're doing is we're partnered with NVIDIA exploring how we use DPUs to offload, you know, and virus scan is various other firewall type of activities onto DPUs, so that we can move them into the HPC facilities and also know that we can firewall and secure applications as opposed to the facility. So that's really our background and how we think about super computing or high performance computing and look forward to the rest of the chat. Great. Thanks, Steve. And I think that takes us to our final panel member who is John Mills from NASA Goddard. John, take it away. Hi, Steve. Thank you for having me. I work for the NASA Center for climate simulation. And as the name implies, we do very large climate modeling here, or at least we provide the computational infrastructure to do it. Our largest customer is the Global Modeling and Assimilation Office, the GMAO at NASA. And so when you see these really beautiful weather maps of aerosols spinning across the world, they're responsible for that sort of data and creating those models. And they run that on our supercomputer. So next slide, please. So we have a large open-stack cloud between 200 and 300 nodes, which is called Explore. And we have been working on building that for, since 2019, and then COVID happened and supply chain problems, et cetera. But it is the upgrade from two previous open-stack clusters, one that was focused on infrastructure as a service, the other one that was focused on platform as a service. Explore is going to be able to host both of them. Now Discover is the name of our traditional HPC machine. It's your ultra-static machine. It's six petaflops running Slurm. And as Steve Quinn pointed out, this is the peak. And Explore is the tail. So the GMAO, they run a big MPI code called Geos5, and they use most of Discover. But we also have 13,000-some NASA scientists at Goddard who aren't doing big MPI work, and they need resources beyond their laptop, too. And they can come to us and run and explore and get much more flexible resources, a much more dynamic environment. So this is why we say Explore complements or augments Discover. It solves all the problems that are hard to solve in a very static HPC cluster. So really anything that isn't a perfect fit for Discover, we try to run it in Explore. And vice versa, anything that isn't a good fit in Explore, we try to push to Discover. So when we find people running MPI jobs in Explore, you know, sometimes we'll hint, you know, you could do this better in Discover. Or if somebody's running really inherently parallel work or data analytics work in Discover, maybe that would work better in Explore. Next bullet. So we do have Keystone Federation is something that we're playing with and we're making progress with so that we have a very, very nice tie-in with NASA agency identity policies and security. So I mentioned that Explore supports both IAS and PAS, and the difference is really in a lot of the networking. Can we see the networking bullets? Thank you. So our PAS tenants are really the thing that makes them unique is they have these provider networks that are VLANs that are our data center VLANs. And we can access very large pools of data, GPFS or NFS. And that's what makes them useful. Whereas our IAS tenants, they get a private VLAN and a floating IP that they can use to route within NASA semi-public networks behind a firewall. A lot of Explore's compute nodes are actually repurposed compute nodes from Discover, HPC nodes. So they're high-end nodes. But, you know, every three, four years we replace the old nodes and bring in new nodes in Discover. Those nodes are still great for a cloud. So it's this nice life cycle. And we fully expect in the next three to five years, once everyone is more comfortable with Explore, we're probably going to take these cloud-native tools and slowly start to use them to actually manage and redeploy parts of Discover and sort of bring it all together. But one of the beautiful things about this is we have what we're really proud of is we have this centralized storage system that is based on GPFS cluster export services. So it's NFS. These big, you know, tens of petabytes of data can be accessed from both our HPC cluster and our cloud. And that makes it terribly convenient to test things in the cloud and then scale them in the HPC machine. So there you have it. That's what we're trying to do. And you see some of the science that we support. It's all, I wish I could tell you more about it, but, you know, I maybe could pass as a computer scientist, but not one of these kind of NASA scientists. But it's all fascinating stuff. Sitting the shellfish is making me hungry, John. Okay. Thank you very much, John. And thanks to everyone else. I'd like Steve's comment about the short tail of HPC in comparison to the long tail. And you know what really struck me about both what Sadaf and Happy were saying is that it sounds like, you know, OpenStack owns the short tail space, but it's an idea that we can be working in a complementary role to an open to a high performance computing sort of super computing system. But it sounds like it's got plans to get onto that peak as well. So we're, we're also within fighting distance of taking on the main HPC system as well. The whole concept of the software defined super computer or the cloud native super computer is very much in people's minds at the moment. But I was wondering, the other thing that's changing there and I think we had a question on the on the chat is about the nature of the kind of compute node that we can be bringing into an open stack system where we can get these really beefy kind of environments where we have AMD processes and and GPU hardware and and also quite powerful networking infrastructure and hardware offloads and networking as well, which really also changed the balance in terms of what we can aim for. I was wondering if we could open up with another question then which is, you know, what is it that's so difficult about these peak systems? I mean, what is been stopping open stack from being a credible contender up to this point for for tackling these massive systems? I know that said after you were talking about running an HPC stack in at CSCS for a couple of years and and you gave us some good insights about the complexity of maintaining that environment. What else is it about the applications and the workloads in that space that make it really tough? Steve, did you want to start with that one? Sure. I mean, I think the whole HPC peak and sometimes conflates. I think a few things dig and one is the way people access it is through slow. So you have to be happy with the idea of making cues, putting things onto cues, breaking your jobs up. And that's actually quite different to, you know, pervasive computing, which is basically cloud or desktop, you know, it does feel very different. And I think the other part of that I think it conflates is in the part you're getting at is, you know, 20 years ago is the peak was very heavily correlated with the idea and still is today with simulation and modeling. You're doing finite element modeling. It's you've got sparse or any algorithms, basically what you're trying to do. And so you need something that runs MPI or, you know, communications with zero jitter and all those sorts of things that scale out to hundreds of processes talking at once. I think, you know, today's peak users are starting to try it's starting to become those you need lots of cycles, but they're processing data. And it's the thickness of the pipe to the data source that's probably more important than communicating with other processes. And so I kind of view that peak conversation as a little bit of how much money you spending on the capacity. The real challenge is how do you get more people to create more software so that and more tools for themselves so they can process data and the HPC framework is not the most friendly to do that. Yeah, I see what you mean. So, did you want to talk a little bit about, I mean, you for many years you've been custodian of one of the biggest computers in in the European continent. It's not open stack. Why couldn't it be. Okay. I was looking at time we have only 29 minutes left. Do you have 29 minutes to say that. No, it's just I'll say one thing but I'll turn the shift around because I mean we did we did like retrospective on our experiences but we have had for a long time the long tail running on these 10s of better scale system like this time we even run high performance throughput workload from the WLCG community, the high energy physics community who have teeny tiny jobs. So it can be done. So it's to me some more sort of a policy and the type of services questions that we can't do the reason we can't do it I think it's human resources we just do not have the knowledge and the know how and the competence in our HPC system or the admins and the engineers who who can leverage and bring in those concepts all by themselves. So I can give the number of engineers we have operating our high end system and open stack systems and we don't get enough, you know, brainpower and you know people on the screen to go and debug and troubleshoot all the issues we run into. And then we go for these high end vendors who do not give us a turn key solution, you know, for optimized network the stack the integration that we need. So to me it's not a technical question. It's more like a human resource training and applying them in our environment question because we can't return people while still operating these high end resources. And if you just look at the investment we have in terms of their, you know, matter on the machine room floor. The open stack is like a very small fraction of it. So I would view it like that is it's not it can definitely be done I can't think of a single reason why technically it isn't. But it comes down to the priorities and the investment you do in bringing and solving and troubleshooting these things at that scale, and you can't keep the machine idle while you do that. I think you're right. So the learning curve of the sort of cloud native methodologies is very different from conventional high performance computing infrastructure management. You could almost say it was the mother of all learning curves in some ways, but but you could also perhaps say that the view from the top is makes it worthwhile. I'm not sure. I mean, I think it's I think you're right. There is a tremendous sort of skills barrier to these things as well as the the technical nature of the workloads. Happy I wanted to ask you about something that's very obviously big a big thing in South Africa, which is the square kilometer array the radio telescope which is going to be half hosted in in South Africa. Did you want to talk about how OpenStack is going to play a role in in this sort of incredibly prestigious large scale system. Yeah, I think from our side, that's has been one of the drivers for us to say we should be dragged into OpenStack kicking because of the massive computing that will be required for the square kilometer array and done in different parts of the world. So the whole thinking around here was how will we be able to build such a computer. Not one nation will be able to do that. And we might need some diversity of architectures that scientists can move across on those architectures and the whole thing of opens that came in there that if we get something as a layer that can be used across the world. And a user see just one big computer navigate through that. That's the big idea that we have. And we believe OpenStack can be able to deliver on this and hence we started doing this from various countries, so that we build a skill base that setup talked about. And after that, then we can have this one big super super computer sitting in different parts of the world. Yeah, that brings up some very interesting thoughts around Federation and then managing of these sort of multi sites multinational projects where HPC systems can be existing in many countries but but the users can be able to sort of with some degree of fluidity move from from one place to another. And that touches on a whole other dimension where where OpenStack can play an interesting role. But before we go on to that, I was, I was kind of interested in what John was saying about the data. And you mentioned about having large shared NFS file systems between the different clusters, enabling easy data movement between an OpenStack system and the HPC you sort of misspoke HPC environment next to it. Would anyone else like to pick up on on particular data challenges that they've got with working with these these sort of systems paired together in a supporting role where they have a a large performance computing a supercomputer environment and an OpenStack system to sort of process its data and prepare their new models or in process the outputs from it. Steve, I noticed you're nodding. Why don't you start. I'm nodding because we actually don't. That's the one thing we don't do. We keep Gluster and the HPC side, even though the nodes are all OpenStack based. And I guess you think you know our nodes in HPC we do in virtualized. So it's not bad at all and there's some there's some networking reasons and connectivity reasons why we did that for many years ago. But one of the things is is a shared file system is a shared zone of authority and so it's very difficult to let a VM and let someone have root on a VM and be able to access a shared file system. And so you end up with that sort of it's a business problem, not a technical problem if you like. And the reason why I don't think it's ever really worried us is if you're on the cloud, quite often you want to use object storage or something that's cloud native anyway. And so yes, you might be copying the data across, but you are going through a very conscious data management action as you do that. So yeah, we tend to find that maybe I'm not the best example because we don't try to move those file systems across those modalities ourselves. Sure. Thanks Steve. John, did you have anything to add about how it works for you and what your pain points are? Yeah, so I totally empathize with what Steve was saying about the incompatibility between having a VM, having root in a VM that has a shared file system. And we certainly had a zillion conversations about that. And the way that we ultimately solved it was this rather rigid divide between IAAS and PAS, PAA, Platform as a Service. Our Platform as a Service users are essentially equivalent to our HPC users. They can be the same people. And those people do not get root in their VMs. They are LDAP based logins and they get the benefits of the shared file system. Our infrastructure as a Service customers explicitly do not get access to the shared file system unless it's a read only exported in FS mount, which we have talked about doing from our centralized storage system, but we're not quite there yet. But we certainly see the benefits of not having to copy the data around. And our centralized storage system is it's not so much, you know, like a scratch space, it's actually more like for very large curated data sets. Because what we definitely don't want to do is be pushing around petabyte size data sets like Landsat data or something. And we can export that pretty safely read only because people are mostly reading it and computing off of it, right, whether it's in the cloud or in the HPC machine. And so that we definitely see a lot of efficiencies with that model. And we have a, you know, a brilliant team of IT security people who run these very high speed 100 gigabit firewalls for us that attempt to handle Steve's problem of the different security zones or trust zones right. So the firewalls will, you know, take one layer to segment from the cloud and make sure it's, you know, safe to pass through and then dump it into another segment somewhere else. Yeah, we, so I completely agree. We, we do basically the same thing and we're just about to establish a new file system only for that reference data, because we can actually do that a bit cheaper than all of Luster and things like that. But one thing that we do is we use the concept of cells in open stack to create that. So we do that segregation completely through open stack so we don't necessarily use a separate isolated file. We still have the same effect, but we just use what open stack gives us to do that. It's been interesting because we've had several, we do a lot of patient identified things and things on our systems across all the modalities and we've had many investigations as to is this thing really safe and does, you know, whatever and and every time it passes. So it really works well. I think you've, you've, I think we've already covered a couple of major cohorts of use cases for open stack in, in this domain, and we're kind of uncovering a third, the third big one as well here. So I'd say that we've, I mean, we've talked a little bit about these other sort of the peak, the supercomputer environments, and those are perhaps next in the next territory, but the winning or the majority use case at the moment is around this sort of supporting a data analytics environment that provides flexibility and the freedom to innovate and the freedom to a place to feed data into one of these supercomputer systems and to process and visualize the data that's coming back. But I think Steve and John, you've just, you've just talked about the third case of a really interesting use case where open stack is a compelling choice and that is around these ideas of sensitive data. And, and you were talking a lot about open stack for secure research compute platforms for this kind of trusted research environments. And I wonder, I know happy you talked a little bit about using your open stack environment for COVID-19 response activities. Would anyone else like to talk about pieces of, of what they've been doing in a research computing environment where there is a strong requirement for security around data, patient identifiable data or other pieces. Go ahead, Sarah. Okay, so one of the platform as a service community for our open stack, in fact, that will the first use the driver for our open stack is the human brain project. And that's where I can plug in another project which we are doing for Federation but in the interest of time. And there we are, I wouldn't, I'm not going to call it like we have a solution because they want to also access the HPC resource where we have no level of isolation per se. And, and, but if within the opening stack environment because they can virtualize and isolate all the domains at every level. So this is kind of the starting point for them. And so I do agree that from not sorry for the background from the, from the qualitative point of view in terms of resiliency, isolation, security. So I think HPC doesn't provide some of those elements and then for some of these communities before they access HPC, some of these environments provide a really good ramp up phase to access larger scale resources. Yeah, we are on the HPC side, we are still figuring it out. But on the open stack side, some of their requirements can be easily met in terms of the GDPR requirement, PIR requirement, just because it does, they run a platform but they run it on top of an IAS which it can give them that level of access control and isolation. On a shared hardware in a private cloud we operate. Thanks. Is anyone else apart from happy to providing infrastructure to to work against COVID-19? What is the the open stack communities response against that? I wonder. I mean, I imagine that everyone's doing COVID related work and all of our infrastructures. You know, I think kind of in some ways, you know. It might have been a slow one. I'm not aware of any COVID work that is happening on our systems, but however, NASA has a disastrous team that is running some some very important work on our open stack cluster related to the recent disaster in Haiti in Haiti. So these are environment projects that, you know, require a rapid response and and so having the flexibility to set something up quickly is is as important as getting answers in from a powerful computing resource as well. I guess, speculating that. I think it is some of your question, I think, Stig as well. I mean, there was a an NSF related thing that happened in terms of was the HPC sector able to respond fast enough to COVID and things like that and doing things. I want to be misquoted, so I can't remember the exact article, but I thought it was really interesting because I think that, you know, once again, the article was less about this is not a technical problem, but you know, can we change our policies on allocation quick enough and things like that. And so maybe speaking once again to this, you know, rigidity thing. I'm not quite aware that on the National Federation, there's a lot of the modeling that the various state and after the government has used for their, you know, modeling around COVID, and then next moves that have been done definitely on the research class. Nice. Thank you. So actually got me thinking about something else and the way what OpenStack brings to these scientific computing domain, which is so long been the playground of conventional HPC environments. That's around this, this problem of flexibility. So these machines conventionally have been built for all outperformance at huge scale and they deliver that I mean they unquestionably they deliver these incredible powerful consistent low jitter environments with high performance network fabrics. But the flip side of that is that the environment for programmers and developers and scientists to work in is kind of constrained. And I think Steve, it would be interesting. I was reminded earlier about when we came to Sydney for the OpenStack summit and you gave that awesome keynote with your guests who talked about the data analysis environment with the possums and then the endangered species and how with the very sort of intuitive virtual and lab environment he was able to do the data analysis on some data set center. And it just looked completely alien to a conventional HPC environment. And do you think that OpenStack is going to solve this problem for us? How do we generate these innovative solutions for new ways of getting to the answers from our data? Yeah, so thanks for reminding me about that. That's Brendan Mackey out in Queensland and he's been leading an initiative. But now the common tool that's spun out of that thing is a thing called Eco Commons which is like a best practice and tooling for many communities to use basically Jupyter notebooks. It's how do you go and take your problem and feed data management through and have these certain styles of pipelines work through. And so that community has sort of led that charge and brought other communities on. They're using the same infrastructure in other domains like some bio ones and there's some other environmental groups that are using that same code. And it's basically it's a very cloud style of way of working. And I think you don't really see that necessarily is it's a kind of a modern day what used to happen with just sharing codes on HPC, you know, I think in some ways. I think one of the things that, you know, it's very clear for those people writing Jupyter notebooks, they don't want to have to deal with slurred cues and sort of getting onto HPC. But the other part is a lot of the research they were doing relied on the community providing data. So if you're counting endangered species, you're relying on people out there doing some of the observations for you. And so that sort of accessibility you need to connect is kind of why it was natural to build the digital framework in a cloud sense. Yeah. And it's kind of doesn't really fit easily onto a batch cued convention on HPC resource in the way that Jupyter notebooks, you know, they think differently, don't they? They were sort of orthogonal to that. It's very, I mean, you change, you know, you're running the code as you press enter in some ways, you know, and you stop and change it, you know, but in a typical HP and this is overstating the problem, right? But, you know, you would write some code, you would submit it into the queue, you would wait. So that interactivity that, you know, is the is the bit where I think, you know, in 20 years time, you know, everyone's going to be doing Jupyter notebooks or, you know, Excel or some sort. And it's going to be really interactive yet it's going to be processing lots of stuff in the background. And yeah, you know, so I think it's that sort of thing is that being these communities emerge, we didn't design it, we didn't know we needed a computer that needed to run Jupyter notebooks all those time, you know, years ago, but we've managed to buy every year the right shape for them, you know, we just readjust and that's, you can do that in that kind of cloud sense. But when I say that, you know, we still run lots of HPC, but we just build it on top of that same ecosystem. It's a growing HPC. It just runs a queue. Yeah. Okay. I think I guess a lot of people are providing services around Jupyter. I wanted to turn it around, actually. So, so that maybe is one of the pain points of conventional environments where OpenStack is is clearly part of the solution. What is what is wrong with OpenStack for these scientific computing use cases? We had a question from Hamid, and he was asking, if people don't use OpenStack, what are they using instead? So I guess that's two questions in one. But John, did you want to pick that one up first? Yes, our Discover supercomputer is deployed with IBM Xcat. And that's been a pretty, pretty good open source tool for us for over a decade, we've been using it. And I think about this all the time, you know, if I were to go to the other CIS admins on my HPC team and say, you know, I want to redeploy Discover with OpenStack. First, would I be laughed at, but I would also have to answer the question, what does a composable infrastructure bring to the table when the specific goal is a static architecture for a, you know, and and, you know, to me, one of the beautiful things about OpenStack is it can provide a fabric, a framework, a methodology for how to unite your data center around one kind of common technology. And if all your CIS admins learn that technology, then you've increased your manpower and your know-how. We can deploy static nodes from OpenStack, but we can also take Discover's test and development systems and make those far more flexible for testing different kinds of MPI problems or something. Right now, we manage our HPC compute image a completely different way from our cloud image. Why? That could be the same tool set. It could be the same CICD kind of stuff. You know, I just think that we're not quite there yet, but I think we're, I really want to see us get to the point where we can replace something like XCAT with the whole OpenStack tool chain. I'd like to strengthen that. I think that, you know, that rigor of how we do tests in the CICD thing and pushing that into the HPC world where they traditionally didn't do that in the same way, same automated. You know, having Garrett involved, having, you know, all the bits and pieces, having code reviews, sharing the skills around code reviews, changing, transforming the skills in your admin team. You know, I think we don't have that many system administrators, really, really small number. It's really, really small across both cloud and HPC, you know, and that's because they speak the same language, the same way. That's a good point. So there's this question about one skill set to rule them all and cloud is kind of interesting because it sort of eats the others, doesn't it? And we have, we do projects with Ansible as a common language and a common, you know, sort of data format, a common means of doing stuff, and it can relate to everything from the lowest level details of the RAC config, you know, to things like the BMCs and the switches, the operating systems, the biases, all the way through to talking about the high level stuff to do with TensorFlow or the applications that the end users and scientists are actually running. And I don't think that's happened before that there is a single skill set to learn as in a single language like Ansible, which can relate to every single level of a high performance computing environment. It does mean learning a new skill set for people who don't have the cloud native methods, but it does then bring this great consolidation of skills as well, perhaps. So I guess that XCAT is a familiar choice for other environments. I guess we've, my God, I was also pointing out about Bright Cluster Manager as being another popular one. Quite a few of the, they're like the tier one vendors in this space produce their own kind of homegrown or works infrastructure management softwares as well. I'm wondering if, if people can go around if we can have a pick list of what are the things that are missing from an OpenStack system that you need the most in order to develop your plans and then maybe we can talk a little bit about those future plans after that. Happy, why don't you, sorry to put you on the spot, why don't you start first? There we go. If I can find the mute button, yeah, I'm going to. I think at the end of the day, stick is to answer the first question why we're not doing OpenStack. I think it's a question of we just need to be patient, we need to get things to mature. So and all this software stack has to mature and we need to learn it. We did the same with Linux for us to get into HPC and we're able to get there. So it's going to be a journey. So what is very important is for us to get an effort in making sure that we develop the skills. The technology has grown quite quickly now. The challenge before which was more on the interconnect ID now we can easily move now the HPC system on the cloud. So it's more about the skills and putting all the tools together. That's the main thing for me. Great. What about you, Siddharth? What do you miss the most? What do you need from OpenStack? For me as a person who is trained as a performance engineer for HPC, it's the fabric parts. How can I without the whole fabric stack from provisioning to the middleware, MPI and all this. But if I look from the experience our CIS admins had in sort of sorting some of these things out, I would say past three to four years, I think some of the two links that are closer to the HPC CIS admins because one thing to note is that in the HPC system, the infrastructure is the platform or the platform is the infrastructure, which in an OpenStack world is discrete. And if you talk to those communities and then the users are there. Trust our users, they don't want to change their compiler for like five or 10 years or something like that. And they want to run their matter. So I think if I'm sort of focusing on adoption in common tooling and technology for provisioning, I would look for tooling which sort of somehow respects or acknowledges this, these two levels. And I think some of the slides I saw shown earlier and from other side, people are largely moving the infrastructure provisioning servers to OpenStack initially slowly and gradually. I think that's where it will happen. The first barrier will be either broken or merged depending on the perspective. And I think that's what we do there. So first of all, fabric, because if the network doesn't work, I don't see my microsecond latency, I'm out without having to spend lots of resources. But if we are serious in terms of having in a data center like ours, a common environment, knowledge and ecosystem, I would invest on some of the feelings. Great. How about you, John? Sorry, what was the problem? No. What are you missing? What is the piece of functionality you miss the most? Yeah, honestly, right now it's doing well. We need to get better with containers where it's a weakness for us. And I expect that at some point, we will get around to hosting some Kubernetes on OpenStack or something of that nature. But we're really not there yet and we're constrained by man hours, quite honestly. I'd say aside from those issues, my single biggest problem with OpenStack right now is just, it's just still too complex. And if you are really getting down into the weeds to fine tune this, that and the other thing, it'll become a time suck. Even if you have some elaborate framework, probably. And of course, we write our own CM and deploy semi-manually, but we don't use a triple O or anything like that. And maybe that's part of our problem, but we kind of fear those sorts of tools. That's neither here nor there. Anyway. Any thoughts to add, Steve? Thanks, John. No, I might just echo the Kubernetes statement, I think, and I just don't know if that's an OpenStack problem or a communities of practice problem. And that's something we're entertaining. We've got a sort of a parallel national community that that's looking at Kubernetes and how we back it easier on now. OpenStack deployments, but also everywhere else. And sometimes it's more about, well, what's the application anyway? And so it's an interesting problem. Maybe the other spot I would add is some of the things that we need to do is how we communicate responsibilities to the users. And so I think some of the horizon workflows are going to be one areas that we would see some change or we'll do things outside of it. But yeah. Great. Thanks, Steve. I think we were a little bit over time. And I apologize for that with the enthusiasm on the subject matter with you all. But I think we ought to finish with a final sort of forward looking question here. And we've got one from Mike about how AMD Epics and NVidia GPUs are changing the game. And I'm interested in hearing from each of you. What do you think is going to be the effect of new technologies, new hardware technologies, new software technologies coming into the software defined supercomputer environment? These sort of open stack in research computing context. What do you think the effect of these kinds of new hardware technologies are very sort of virtualization friendly is going to be? What are your predictions for new technology and new capabilities in the years to come? Steve, why don't you start there? We'll go around the other way. You know, the reality is if a vendor gives me a new piece of tech I can put into a motherboard or something. I'm going to get it into a researcher's hands faster through open stack than I am through a HPC system. That's just the reality of it. And so those who can take advantage of that tech and do some make some impact out of it. If I can accelerate that, then that's what I'm going to do. And I think that's probably the answer to the question. Great. What about you, John? So, you know, I know some people are interested in like the GPUs and virtualizing and slicing up GPUs. And personally, I think that's kind of diminishing returns and people want the full power of the GPU. And I want to give them to it and give that to them and time slice it through something like Slurm. We do have a machine learning cluster dedicated called Prism and it's about 22 nodes plus a DGX-A100. The 22 other nodes have Volta 100s and that's Slurm managed. But all of the ancillary services around that, including the deployment framework for it, are actually in open stack. The Slurm is in open stack, the login nodes in open stack, the Slurm DbD and the database is in open stack. And so all of that stuff is pretty flexible to reconfigure it and then give access. And that we run Jupyter hubs in open stack that can access the Prism machine learning cluster. So that's how we're doing that. Nice. Where'd you sit up? What do you think is going to change? How do you think these new hardware technologies are going to affect things? I think I'm kind of taken where John started. I mean, the reason we have AMD, APACA, generation of GPUs and Nvidia doing this thing is this huge explosion in these machine learning data science frameworks. And they are from ground up very cloud friendly in terms of their communication, in terms of their storage patterns. I'm not suggesting we check out all our simulation users, but I think there is a huge opportunity from like some of these, I can call it cross fertilization. Historically, without these types of users and framework, it was much, much harder to bring or justify a complexity of a cloud native software stack. Now, it is still there because users from two, three decades are still there. But it is relatively, I won't call it easier, but you can make a case why you would like to produce a cloud native software stack because it would benefit the frameworks in the workflows coming to us in a tightly integrated manner for HPC. So I am really very optimistic about this part of these very dense system which reduces some of these fabric concerns I have with OpenStrike because now you are really relying on the stack provided by these vendors. So I would say the change is not only coming from the vendors and the software stack, but it also coming from the scientific user communities and the usage of their frameworks we did for the most part are optimized by these vendors. You know, I'm glad you brought up AI and machine learning again there because that's such a teaser that we've skirted around it in this session simply because it's a whole other hour all by itself really as a platform on OpenStrike infrastructure. Happy, any final thoughts from you? I think the colleagues wrapped up quite well. The new technologies we're looking at now new users that can come in into the space of HPC. But also one of the things that I see here is other complications that come with some of the software that comes to try and mitigate this new technologies like your Intel one API, how they're going to affect our development of OpenStrike, is it going to make things easier? So it's quite interesting for me. I hope we can just be able to deploy the bare metal, throw the metal in there and let the users do what they like to do. Hopefully it's going to be much simpler. Very much so. Thanks very much, Happy. Yeah, thanks. Thanks, Happy. Thanks, Stig. Over time already was really a great discussion. So thank you, Stig, for leading it. Thanks to all our esteemed guests for taking time out of their busy agendas to join this episode today. It was really a great discussion and I learned a lot about this specific OpenStrike use case. So next week, we'll have another great episode lined up about the different facets of open infrastructure training, because talent trained on open infrastructure like OpenStrike or Kubernetes is in high demand in this market. And during this episode that skill sets and recruitment is actually one of the major blocker to further adoption of open infrastructure. So in next week's OpenInfra live, several organizations will present their training programs around open infrastructure and how they can accelerate your open source career. Also, remember that if you have a good idea for a future episode, we want to hear from you, submit your ideas at ideas.openinfra.live. So mark your calendars. Hope you're all able to join us next Thursday at 14 UTC. Thanks again to all of our speakers who joined us today. And special thanks to SteakTelfer for leading it. And see you all on the next OpenInfra live.