 Live from Boston, Massachusetts, it's theCUBE. Covering Red Hat Summit 2019, brought to you by Red Hat. And welcome back to theCUBE, and our most continuous here at Red Hat Summit 2019. We are live at Red Hat Summit 2019. We are the Red Hat Foundation Center, along with Stu Miniman. I'm John Walls. We're now joined by Robin Goldstone, who is the HPC solution architect at the Lawrence Livermore National Laboratory. Hello, Robin, how are you? Good, good to see you guys. So, yeah, on the Keystone stage this morning, fascinating presentation, I thought. First off, for the viewers at home, who might not be too familiar with the laboratory, if you could please just give it that 30,000 foot level of just what kind of national security work you're involved with. Sure, so yes indeed, we are a national security lab. And first and foremost, our mission is sharing the safety, security, reliability of our nuclear weapons stockpile. And there's a lot to that mission, but we also have broader national security mission. We work on counterterrorism and non-proliferation, a lot of cyber security kinds of things. But even just general science, we're doing things with precision medicine and just all sorts of interesting technology. Fascinating, for sure. Yeah, so Robin, so much in IT, the buzzword of the last bunch of years has been scale. And we talk about what the public cloud people are doing. Labs like yours have been challenged with scale in many different ways, especially performance is something that usually at the forefront of where things are. You talked about, in the keynote this morning, Sierra is the latest generation supercomputer, number two supercomputer. So, I don't know how many people understand the petaflops, 125 petaflops and the like, but tell us a little bit about the why and the what of that. Right, so Sierra is a supercomputer and what's unique about these systems is that we're solving, there's lots of systems that network together maybe are bigger number of servers than us, but we're doing scientific simulation and that kind of computing requires a level of parallelism and very tightly coupled. So all the servers are running a piece of the problem, they all have to sort of operate together. If any one of them is running slow, it makes the whole thing go slow. So it's really this tightly coupled nature of supercomputers that make things really challenging. We talked about performance, if one server's just running slow for some reason, everything else is going to be affected by that. So we really do care about performance and we really do care about just every little piece of the hardware performing as it should. So, I think in national security, nuclear stockpiles, I mean, there is nothing more important obviously than the safety and security of the American people. You're at the center of that, you're open source. You know, how does that work? How does that because as much trust and faith and confidence we have in the open source community, this is an extremely important responsibility that's being consigned more or less to this open source community? Sure, you know, at first people do have that feeling that we should be running some secret sauce. I mean, our applications themselves are secret, but when it comes to the system software and all the software around the applications, I mean, open source makes perfect sense. I mean, we started out running really closed source solutions. In some cases, the hardware itself was really proprietary and of course the vendors who made the hardware proprietary, they wanted their software to be proprietary. But I think most people can resonate. When you buy a piece of software and the vendor tells you it's great, it's gonna do everything you need it to do and trust us, right? Okay, but at our scale, it often doesn't work the way it's supposed to work. They've never tested it at our scale. And when it breaks, now they have to fix it. They're the only ones that can fix it. And in some cases we found it wasn't in the vendors deciding you know what, no one else has one quite like yours and you know it's a lot of work to make it work for you. So we're just not gonna fix it, right? You can't wait, right? Right, and so open source is just the opposite of that, right? I mean we have all that visibility in that software. If it doesn't work for our needs, we can make it work for our needs. And then we can give it back to the community because even though people aren't doing things at the scale that we are today, a lot of the things that we're doing really do trickle down and can be used by a lot of other people. Yeah, but it's something really important there because as you said, it used to be you know, it was like okay, the crazy supercomputer is what we know and you know, let's use proprietary interfaces and I need the highest speed and therefore it's not the general purpose stuff. You moved to x86, Linux is something that's been in the super computers a while but it's a finely tuned version there. Let's get the duct tape and bailing wire and don't breathe on it once we get it running. You're running RHEL today and you talk a little bit about the journey with RHEL and you know now on the super computers. Right, so again there's always been this sort of proprietary really high end super computing but about in the late 1990s, early 2000, that's when we started building these commodity clusters. You know, at the time I think Beowulf was a terminology for that but you know, basically looking at how we could take these basic off the shelf servers and make them work for our applications and trying to take advantage of as much commodity technologies we can because we didn't want to reinvent anything. We wanted to use as much as possible and so we've really ridden that curve and initially it was just Red Hat Linux, there was no RHEL at the time but then when we started getting into the newer architectures going from x86 to x8664 and itanium, you know the support just wasn't there in basic Red Hat and again, even though it's open source and we could do everything ourselves, we don't want to do everything ourselves. I mean having an organization having this enterprise edition of Red Hat, having a company stand behind it, the software is still open source, we can look at the source code, we can modify it if we want but you know what, at the end of the day we're happy to hand over some of our challenges to Red Hat and let them do what they do best. They have great reach into the kernel community, they can get things done that we can't necessarily get done so it's a great relationship. Yeah, so that last mile getting it on Sierra there is that the first time on kind of the big showcase to the computer? Sure, and part of the reason for that is because those big computers themselves are basically now mostly commodity. I mean again, you talked about a Cray, some really exotic architecture. I mean Sierra is a collection of Linux servers. Now in this case, they're running the power architecture instead of x86 so Red Hat did a lot of work with IBM to make sure that power was fully supported in the RHEL stack but so you know again that the servers themselves, somewhat commodity, we're running NVIDIA GPUs, those are widely used everywhere obviously, big deal for machine learning and stuff. The biggest proprietary component we're still dealing with is the interconnect. So I mentioned these clusters have to be really tightly coupled, the performance has to be really superior and most importantly, the latency, right? They have to be super low latency and Ethernet just doesn't cut it. Yeah, so you're running in Finneban today, I'm assuming. We're running in Finneban, Melanox and Finneban on Sierra, on some of our commodity clusters, we run Melanox, on other ones we run Intel Omnipath with just another flavor of Finneban. You know, if we could use Ethernet we would because again, we would get all the benefit and the leverage of what everybody else is doing but just hasn't quite been able to meet our needs in that area. Now, if I recall the history lesson we got a bit from you this morning, the laboratory's been around since the early 50s, born of the Cold War and so obviously open source was, you know, wait, what about your evolution to open source? I mean, as this has taken hold now, there had to be a tipping point at some point that converted and made the laboratory believers. But if you can, can you go back to that process and was it a big moment or a big bang, or was it just a kind of a steady migration tour? Well, it's interesting if you go way back, we actually wrote the operating systems for those early Craig computers. We wrote those operating systems in-house because there really was no operating system that would work for us. So we've been software developers for a long time, we've been system software developers but at that time it was all proprietary and closed source so we know how to do that stuff. The reason I think really what happened was when these commodity clusters came along, when we showed that we could build a cluster that could perform well for our applications on that commodity hardware, we started with Red Hat but we had to add some things on top. We had to add the software that made a bunch of individual servers function as a cluster. So all the system management stuff, the resource manager, the thing that lets us schedule jobs, batch jobs, we wrote that software, the parallel file system, those things did not exist in the open source and we helped to write those things and those things took on lives of their own. So Luster is a parallel file system that we helped develop, Slurm, if anyone outside of HPC probably hasn't heard of it but it's a resource manager that again is very widely popular. So the lab really saw that we got a lot of visibility by contributing this stuff to the community and I think everybody has embraced it and we develop open source software at all different layers of software stuff. Robyn, I'm curious how you look at public cloud. So I look at the public cloud, they do a lot with government agencies, they've got GovCloud, I've talked to companies that said I could have built a supercomputer, here's how long it do but I could spin it up in minutes and use it for what I need, is that a possibility for something of your, I understand maybe not the super high performance but where does it fit in? Sure, and yeah, I mean certainly for a company that has no experience or no infrastructure, I mean we have invested a huge amount in our data center and we have a ton of power and cooling and floor space, we have already made that investment. Trying to outsource that to the cloud doesn't make sense. There are definitely things, cloud is great, we are using GovCloud for things like prototyping or someone wants a server that some architecture that we don't have, the ability to just spin it up, if we had to go and buy it it would take six months because we are the government but being able to just spin that stuff up, it's really great for, we use it for open source for build and test, we use it at conferences when we want to run a tutorial and spin up a bunch of instances of Linux and run a tutorial but the biggest thing is at the end of the day our most important workloads are on a classified environment and we don't have the ability to run those workloads in the cloud and so to do it on the open side and not be able to leverage it on the closed side, it really takes away some of the value of it because we really want to make the two environments look as similar as possible, leverage our staff and everything like that, so that's where cloud just doesn't quite fit in for us. You were talking about the speed of Sierra and then also mentioning El Capitan which is the next iteration, your next super unbelievably fast computer to an extent of 10X of what your current speed is within the next four to five years. Right, that's the goal. I mean what, throw some numbers at us there, I mean because you put a pretty impressive array up there. Right, so the Sierra is about 125 petaflops and the big holy grail for high performance computing is exascale and exaflop of performance and so El Capitan's targeted to be 1.2, maybe 1.5 exaflops or even more. Again, that's peak performance, it doesn't necessarily translate into what our applications can get out of the platform but the reason, sometimes I think isn't it enough, isn't 125 petaflops enough but it's never enough because any time we get another platform people figure out how to do things with it that they've never done before. Either they're solving problems faster than they could and so now they're able to explore a solution space much faster or they want to look at, these are simulations of three dimensional space and they want to be able to look at it at a more fine grain level. So again, every computer we get, we can either push a workload through 10 times faster or we can look at a simulation that's 10 times more resolved than the one that we could do before. So do this for me and for folks at home but take the work that you do and translate that to why that exponential increase in speed will make you better at what you do in terms of decision making and processing of information. Right, so yeah, so the thing is these nuclear weapons systems are very complicated. There's multi-physics, there's lots of different interactions going on and to really understand them at the lowest level, one of the reasons that's so important now is we're maintaining a stockpile that is well beyond the lifespan that it was designed for. These nuclear weapons, some of them were built in the 50s, the 60s, the 17s. They weren't designed to last this long, right? And so now they're sort of out of their design regime and we really have to understand their behavior and their properties as they age. So it opens up a whole nother area that we have to be able to explore and just some of that physics has never been explored before. So the problems get more challenging the farther we get away from the design basis of these weapons, but also we're really starting to do new things like AI and machine learning, things that weren't part of our workflow before. We were starting to incorporate machine learning in with simulation, again, to help explore a very large problem space and be able to find interesting areas within a simulation to focus in on. And so that's a really exciting area and that is also an area where GPUs and stuff have just exploded the performance levels that people are seeing on these machines. Well, we thank you for your work. It is critically important as we all realize and wonderfully fascinating at the same time. So thanks for the insights here and for your time. We appreciate that. All right, thanks for having me. Thank you, Robin Goldstone joining us. Back with more here on theCUBE, you're watching our coverage live from Boston, a Red Hat Summit 2019.