 Hello and welcome to standalone at scale. I rock take on high-performance computing with Timothy Randall's Jacob Anders and myself Julia Krueger one question one might ask is why ironic and Ultimately, it's because physical hardware in data centers Are never going to go away. You're always going to have the physical hardware in the racks inside data centers And really it's not because you can make these terms look really cool In order to build anything you need a foundation need infrastructure to build a pie you need a substrate and In in technology the hardware is always going to be the foundation that you're going to build upon So there's also some really solid reasons for having your own bare metal hardware First one's latency and if you look at how long it takes for Data travel in one millisecond. It's about 124 miles or 200 kilometer That's inside a piece of fiber or a piece of copper And that excludes intermediate hardware like to be transmitters and switches and routers So if you were exchanging one kilobyte data every millisecond or a hundred seconds You'd have a hundred megabytes of data transferred Whereas if you're doing the same with 30 milliseconds of latency You'd only have 3.3 megabytes And then also when we talk about having your own data centers usually There's really solid reasons why for a business or organization one is data sovereignty Which is becoming a major topic due to newer regulations Required security measures and these measures can come from anywhere contracts legal local or regional laws Even court orders and often these are driven by industry or government security needs in order to keep data safe and ultimately Whatever forward When you start looking at also bandwidth of interconnects While we've seen a massive explosion from the first local area network technology in the 1980s We can see that Local Ethernet has always been faster than why you're a networking It's always kind of chasing but these are also exponential growth graphs and largely because the telecom industry takes upon Leverages of same technologies for interconnects So we end up eventually getting that same level of connectivity But when you have it locally and you have the same piece of hardware you obviously have a huge advantage But again when you're building these data centers, there's some infrastructure that's required Lots of construction lots of work goes into building these facilities for very good reasons So when we talk about why ironic it provides a restful API interface It encodes common practices in managing physical hardware fleets There's integration with OpenStack and other tools like Ansible and Terraform and It's scalable and production proven Ultimately we turn weeks into minutes and hours So now with why I rank for HPC Hand it over Jacob Thank you, Julia. The typical high-performance computing system is large. It consists of hundreds of thousands of nodes sometimes tens of thousands And it is also highly optimized so that classic HPC workloads by saying designing batch queue jobs run very well The price to pay for this degree of optimization is limited flexibility It might be hard to run no standard workloads on like PC system For example, let's look at cybersecurity research or research in evolving sensitive clinical data It would be rather difficult to carve out a set of nodes put them in an isolated network and Addism independent work on that Due to this I see great degree of potential for improvement in terms of flexibility of high performance computing Just as a comparison, let's have a look at enterprise computing which has changed very significantly over the last years with adoption of cloud and DevOps HPC had evolution in this period, but definitely not the revolution Looking further into adoption of cloud and high-performance computing We see there is been some of it, but not as much as some of have anticipated Cloud is used for cloud bursting to increase scale beyond what's available on-site It's also been used as supplementary capability to support those non-standard workloads I referenced earlier However, many clouds are still running on virtualized platforms and that introduces performance Overheads which are something undesirable in HPC. It also adds some complexity around fastest access to storage which Julia has mentioned lastly Cloud computing systems are complex and so is open stack And when we add bare metal components and software defined networking for bare metal the complexity Grows even more making it not trivial to build and maintain such system Let's look further into some of the difficulties then that HPC centers Encounter while trying to adopt cloud computing approach to to HPC First of all as I mentioned earlier HPC systems tend to be very large Importantly, they also tend to run at very high loads All of this together means that problems are more likely to happen and when they happen, they're more complex to solve This in turn means that an HPC facility would need a sizable and very Scaled team to troubleshoot those issues Now while having such capability in terms of expertise is something natural for cloud hosting companies because running clouds For example open stack clouds is their core business That doesn't necessarily apply to HPC facilities because their core business is something different It's procuring building and optimizing very large batch queue systems and they can't necessarily afford by ferrting High number of their staff to cloud computing work But the good news is there is a potential solution to this We believe that ironic standalone can help solve those problems and provide a Resilient platform to build upon allowing HPC facilities to Implement infrastructure as code on per-metal without introducing great degree of complexity Because of taking the standalone approach We might occasionally encounter challenges along the lines of features that aren't fully present or implemented however, the good news is that Those features can essentially just move higher up the stack and be implemented using configuration management tools That's leveraging the per-metal cloud built with ironic standalone All in all, this means that the entry-level requirement in terms of complexity is significantly reduced and HPC centers now have a tool That's much more likely to succeed in the circumstances they operate in Lastly, I would like to highlight key features of ironic standalone that make it particularly suitable for HPC Workloads first of all, obviously performance because of its nature running on per-metal Ironic workloads will run as well as they run on classic batch queue systems and same optimization methodologies can apply However, performance is also something we can look at in terms of provisioning speeds and this is where ironic excels as well We can use Displaced provisioning, which is quite common in high-performance computing But we can also use has highly optimized Displaced provisioning using features such as fastrack, which means that we've got Nelson standby already booted into deep provisioning run disk waiting for provisioning requests to come through Second very important feature of ironic scalability Ironic uses a concept of conductor groups. So essentially as the environment grows We can add instances of conductor and each of them can manage a subset of nodes letting environment grow horizontally potentially to a very large scale And with this I would like to hand over to Tim who will give us users perspective on what ironic standalone can do at a national laboratory Thank You Jacob. I'm Tim Reynolds for Los Alamos National Laboratory. I work in the H.P.Line group So how are we thinking about using ironic at Los Alamos? Firstly, the lab's been around for a long time. It started as project Y in 1943 as part of the Manhattan project But today it's a it's an all-encompassing science and engineering laboratory supporting our main mission, which is stock power stewardship We also do a number of other Scientific realms such as physics biology material science energy and health And high-performance computing is integral to Landl's mission today Our simulation based stock power stewardship is with how things evolve starting in the early 90s When we stopped to actually testing our nuclear weapons in the United States But we also do a lot of climate modeling vaccine discovery material science and other things that are Require capabilities of high-performance computing and we have a long history of computing at Landl Initially during the Manhattan project our computers were actually people and we went through the vacuum tube and switchboard programming areas The control data machines of the late 60s and early 70s. We had the first cray Supercomputer the cray one zero number one and we entered the massively parallel era at the lab and This is when things really started to change for us because the scale of our operations completely blew up We weren't just running a few small machines. We were running an integrated hole of thousands of nodes And today we're in the hybrid era as encompassed by Roadrunner and Trinity here. We have a picture of Trinity and Roadrunner Each one of those machines is quite large Landl has a number of HPC clusters More than 10 ranging in size from 32 nodes to more than 20,000 nodes in the case of Trinity The bottom picture there is Roadrunner to give you a sense of scale the footprint of Roadrunner is about a third of an acre We also have to run a lot of other services for these systems So we have tens of petabytes of file systems and in a whole plethora of other support services that we need so In my own words Landl's HPC mission are to provide the services and compute cycles that our users need and ensure that those Services are maximally available if the users aren't able to run on our systems. What good are our systems? But the complexity of our environment and the desire to maximize uptime means that we have very conservative system management processes and practices My manager to lab is fond of quipping that the way we run our machines was five years ahead of its time 15 years ago And we're still using software such as CF engine and Perseus for configuration management Provisioning if you go look these things up CF engine is still alive Perseus is long dead. In fact, go look at the original Perseus website to find out just how dead And over time these practices while they're robust have become inefficient It takes us a lot of time to do some fairly simple things. So right now we have a product underway to modernize our operations So our next generation systems management project creatively named We're using ansible to replace CF engine for our configuration management Ansible is well understood. It's grown up in the DevOps era and there's quite a lot of mindshare out there It's a red hat product now And we don't quite yet know what we're going to do to replace our Perseus provisioning system But we do have a set of requirements Such things as it be API driven that be highly scalable It have a modular architecture it be able to provision a wide range of systems from diskful systems to disk with systems Like I said, we're using ansible. So it'd be nice if it integrated well with ansible It needs to support a large range of hardware Not only in scale and capabilities, but also in age So interfaces for system management such as IPMI are still available, but new things like redfish are coming home And they need to be built with the modern world in mind and actively supported And I think that ironic fits the bill for all of our requirements So what are some of the benefits of ironic for us? One it should be one service to rule them all today since we have a lot of different systems with different characteristics we have just as many different ways of provisioning and managing those systems and Because of that we'd like to get one tool that can handle everything that we've got and I think Jacob did a good job of Showing that you know we can do just full and discless we can deploy to a RAM desk I mean it can scale out We need a modern approach to image management deployment I ride provides this it's it can deploy over a simple HTTP interface And it's easy just to keep dropping new files into an HTTP server and then being able to deploy this to our systems It has a very active and supportive development community in many different ways interact with that community I'm fond of talking with Julia and Jacob all the time on the IRC channel. They use on standalone services The fact that ironic is can be used standalone and it doesn't require open stack is really attractive to us because as Jacob said earlier We're not really in a position to take on managing all of the cloud stuff that may go along with something like ironic But the good thing is ironic is well integrated with open stack So as we or if we go down the road of adopting more cloud technology inside of our data centers Ironic's already there ready to provision that that cloud system as well as it is our HPC system So I really think that ironic is going to enable LANL HPC to be responsive and innovative With our system management providing our users of world-class computing and support of their world-class science achievements. Thank you