 Good morning, welcome back to theCUBE's coverage at Supercomputing Conference 2022. Live here in Dallas, I'm Dave Nicholson with my co-host, Paul Gillin. So far so good, Paul. It's been a fascinating morning. Three days in and a fascinating guest, Ian from AWS. Welcome. Thanks, Dave. What are we going to talk about? Batch computing, HPC. We've got a lot. Let's get started, let's dive right in. Yeah, we've got a lot to talk about. I mean, first thing is we recently announced our batch support for EKS. EKS is our Kubernetes, managed Kubernetes offering at AWS. And so batch computing is still a large portion of HPC workloads. While the interactive component is growing, the vast majority of systems are just kind of fire and forget, and we want to run thousands and thousands of nodes in parallel. We want to scale out those workloads. And what's unique about our AWS batch offering is that we can dynamically scale based upon the queue depth. And so customers can go from seemingly nothing up to thousands of nodes. And while they're executing their work, they're only paying for the instances while they're working. And then as the queue depth starts to drop and the number of jobs waiting in the queue starts to drop, then we start to dynamically scale down those resources. And so it's extremely powerful. We see lots of distributed machine learning, autonomous vehicle simulation, and traditional HPC workloads taking advantage of AWS batch. So when you have a Kubernetes cluster, does it have to be located in the same region as the HPC cluster that's going to be doing the batch processing? Or does the nature of batch processing mean, in theory, you can move something from here to somewhere relatively far away to do the batch processing? How does that work? Because look, we're walking around here and people are talking about lengths of cables in order to improve performance. What does that look like when you peel back the cover and you look at it physically, not just logically, AWS is everywhere, but physically, what does that look like? Well, physically for us, it depends on what the customer's looking for. We have workflows that are all entirely within a single region. And so where they could have a portion of, say, the traditional HPC workflow is within that region as well as the batch and they're saving off the results, say, to a shared storage file system like our Amazon FSX for Lustre, or maybe aging that back to an S3 object storage for a little lower cost storage solution. Or you can have customers that have a kind of a multi-region orchestration layer to where they say, you know what, I've got a portion of my workflow that occurs over on the other side of the country. And I replicate my data between the East Coast and the West Coast just based upon business needs and I want to have that available to customers over there. And so I'll do a portion of it in the East Coast, a portion of it in the West Coast. Or you can think of that even globally. It really depends upon the customer's architecture. So is the intersection of Kubernetes with HPC? Is this relatively new? I know you're saying you're announcing it. It really is. I think we've seen a growing perspective. I mean, Kubernetes has been a long time kind of eating everything in the enterprise space and now a lot of CIOs in the industrial space are saying, why am I using one orchestration layer to manage my HPC infrastructure and another one to manage my enterprise infrastructure? And so there's a growing appreciation that, you know what, why don't we just consolidate on one? And so that's where we've seen a growth of Kubernetes infrastructure and our own managed Kubernetes EKS on AWS. Last month, you announced the general availability of training of a chip that's optimized for AI training. Talk about what's special about that chip or what is customized to the training workloads? Yeah, what's unique about the training is you will see 40% price performance over any other GPU available in the AWS cloud. And so we've really geared it to be that most price performance of options for our customers. And that's what we like about the Silicon team. They were part of that Annapurna acquisition is because it really has enabled us to have this differentiation and to not just be innovating at the software level, but the entire stack that Annapurna Labs team develops our network cards. They develop our ARM cards. They develop this training chip. And so that Silicon innovation has become a core part of our differentiator from other vendors. And what Tranium allows you to do is perform similar workloads just at a lower price performance. And you also have a chip several years older called Inferentia, which is for inferencing. What is the difference between, I mean, when would a customer use one versus the other? How did they move the workload? What we've seen is customers traditionally have looked for a certain class of machine more of a compute type that is not as accelerated or as heavy as you would need for Tranium for their inference portion of the workload. So when they do that training, they want the really beefy machines that can grind through a lot of data. But when you're doing the inference, it's a little lighter weight. And so it's a different class of machine. And so that's why we've got those two different product lines with the Inferentia being there to support those inference portions of their workflow and the Tranium to be that kind of heavy duty training work. And then you advise them on how to migrate their workloads from one to the other. And once the model is trained, would they switch to an Inferentia based instance? Definitely, we help them work through what does that design of that workflow look like? And some customers are very comfortable doing self-service and just kind of building on their own. Other customers look for a more professional services engagement to say like, hey, can you come in and help me work through how I might modify my workflow to take full advantage of these resources. The HPC world has been somewhat slower than commercial computing to migrate to the cloud because- You're very polite. Latency issues. They want to control the workload. They want to, I mean, there are even issues with moving large amounts of data back and forth. What do you say to them? I mean, what's the argument for ditching the on-prem supercomputer and going all in on AWS? I mean, to be fair, I started at AWS five years ago. And I can tell you, when I showed up at supercomputing, even though I'd been part of this community for many years, they said, what is AWS doing at supercomputing? I know you care, wait, it's Amazon web services. You care about the web? Can you actually handle supercomputing workloads? Now the thing that very few people appreciated is that, yes, we could. Even at that time in 2017, we had customers that were performing HPC workloads. Now that being said, there were some real limitations on what we could perform. And over those past five years, as we've grown as a company, we've started to really eliminate those frictions for customers to migrate their HPC workloads to the AWS cloud. When I started in 2017, we didn't have our elastic fabric adapter, our low latency interconnect. So customers were stuck with standard TCP IP. So for their highly demanding open MPI workloads, we just didn't have the latencies to support them. So the jobs didn't run as efficiently as they could. We didn't have Amazon FSX for Lustre, our managed Lustre offering for high-performance POSIX compliant file system, which is kind of the key to a large portion of HPC workloads is you have to have a high-performance file system. We didn't even, I mean, we had about 25 gigs of networking when I started. Now you look at, with our accelerated instances, we've got 400 gigs of networking. So we've really continued to grow across that spectrum to eliminate a lot of those really frictions to adoption. I mean, one of the key ones, we had a open-source toolkit that was jointly developed by Intel and AWS called CFN cluster that customers were using to even instantiate their clusters. So, and now we've migrated that all the way to a fully functional supported service at AWS called AWS Parallel Cluster. And so you've seen over those past five years, we have had to develop, we've had to grow, we've had to earn the trust of these customers and say, come run your workloads on us and we will demonstrate that we can meet your demanding requirements. And at the same time, there's been, I'd say more of a cultural acceptance. People have gone away from the, again, five years ago, to what are you doing walking around the show to say, okay, I'm not sure I get it, I need to look at it, okay, now it needs to be a part of my architecture, but the standard questions, is it secure? Is it price-performance? How does it compare to my on-prem? And really culturally, a lot of it is just getting IT administrators used to, we're not eliminating a whole field, right? We're just upskilling the people that used to rack and stack actual hardware to now you're learning AWS services and how to operate within that environment. And it's still key to have those people that are really supporting these infrastructures. And so I'd say it's a little bit of a combination of cultural shift over the past five years to see that cloud is a super important part of HPC workloads, and part of it's been us meeting the market segment of where we needed to with innovating both at the hardware level and at the software level, which we're going to continue to do. You do have an on-prem story though, I mean, you have outposts, we don't hear a lot of talk about outposts lately, but these innovations like Inferentia, like Tranium, like the networking innovation you're talking about, are these going to make their way into outposts as well? Will that essentially become this super computing solution for customers who want to stay on-prem? Well, we'll see what the future lies, but we believe that we've got the, as you noted, we've got the hardware, we've got the network, we've got the storage, all those put together gives you a high-performance computer, right? And whether you want it to be redundant in your local data center or you want it to be accessible via APIs from the AWS cloud, we want to provide that service to you. So to be clear, that's not available now, but that is something that could be made available. Outposts are available right now that have the services that you need. All these capabilities. Often a move to cloud, an impetus behind it comes from the highest levels in an organization. They're looking at the difference between OPEX versus CAPEX. CAPEX for a large HPC environment can be very, very, very high. Are these HPC clusters consumed as an operational expense? Are you essentially renting time? And then a fundamental question, are these multi-tenant environments? Or when you're referring to batches being run at HPC, are these dedicated HPC environments for customers who are running batches against them? When you think about batches, you think of there are times when batches are being run and there are times when they're not being run. So that would sort of conjure in the imagination, multi-tenancy. What does that look like? Definitely, and that's been, let me start with your second part first. That's been a core area within AWS, is we do not see as, okay, we're going to carve out this supercomputer, and then we're going to allocate that to you. We are going to dynamically allocate multi-tenant resources to you to perform the workloads you need, and especially with the batch environment. We're going to spin up containers on those, and then as the workload's complete, we're going to turn those resources over to where they can be utilized by other customers. And so that's where the batch computing component really is powerful because as you say, you're releasing resources from workloads that you're done with, I can use those for another portion of the workflow for other work. Okay, so it makes a huge difference, yeah. You mentioned that five years ago, people couldn't quite believe that AWS was at this conference. Now you've got a booth right out in the center of the action. What kind of questions are you getting? What are people telling you? Well, I love being on the show floor. This is like my favorite part, is talking to customers and hearing, one, what do they love? What do they want more of? Two, what do they wish we were doing that we're not currently doing? And three, what are the friction points that are still exist, that like how can I make their lives easier? And what we're hearing is can you help me migrate my workloads to the cloud? Can you give me the information that I need both from a price for performance for an operational support model and really help me be an internal advocate within my environment to explain how my resources can be operated proficiently within the AWS cloud. And a lot of times it's let's just take your application, the subset of your applications and let's benchmark them. And really that AWS is one of the key things is we are a data-driven environment. And so when you take that data and you can help a customer say like, let's just not look at hypothetical, synthetic benchmarks, let's take actually the LS Dynacode that you're running, perhaps. Let's take the open foam code that you're running, that you're running currently in your on-premises workloads and let's run it on AWS cloud and let's see how it performs. And then we can take that back to the decision makers and say, okay, here's the price for performance on AWS. Here's what we're currently doing on-premises. How do we think about that? And then that also ties into your early question about CAPEX versus OPEX. We have models where actually, you can capitalize a longer term purchase at AWS. So it doesn't have to be. I mean, depending upon the accounting models you want to use, we do have a majority of customers that will stay with that OPEX model and they like that flexibility of saying, okay, spend as you go. We need to have true ups and make sure that they have insight into what they're doing. I think one of the boogeyman is that, oh, I'm going to spend all my money and I'm not going to know what's available. So we want to provide the cost visibility, the cost controls to where you feel like as an HPC administrator, you have insight into what your customers are doing and that you have control over that. And so once you kind of take away some of those fears and give them the information that they need, what you start to see too is, you know what, we really didn't have a lot of those cost visibility and controls with our on-premises hardware. And we've had some customers tell us, you know, we had one portion of the workload where this work center was spending thousands of dollars a day and we went back to them and said, hey, you know, we started to show this, what you were spending on-premises. They went, oh, I didn't realize that. And so I think that's part of a cultural thing that an HPC, the question was, well, on-premises is free. How do you compete with free? And so we need to really change that culturally to where people see there is no free lunch. You're paying for the resources, whether it's on-premises or in the cloud. Data scientists, don't worry about budgets. Wait, wait, on-premises is free? Paul mentioned something that reminded me. You said you're here in 2017. People say AWS, web. What are you even doing here? Now in 2022, you're talking in terms of migrating to cloud. Paul mentioned outposts. Let's say that a customer says, hey, I'd like you to put in a thousand node cluster in this data center that I happen to own. But from my perspective, I want to interact with it just like it's in your data center. In other words, the location doesn't matter. My experience is identical to interacting with AWS in an AWS data center, in a colo that works with AWS, but instead it's my physical data center. When we're tracking the percentage of IT that is on-prem versus off-prem, what is that? What I just described, is that cloud? And in five years, are you no longer going to be talking about migrating to cloud because people will go, what do you mean migrating to cloud? What are you even talking about? What difference does it make? It's either something that AWS is offering or it's something that someone else is offering. Do you think we'll be at that point in five years? Where in this world of virtualization and abstraction, you talked about Kubernetes, we should be there already thinking in terms of it doesn't matter as long as it meets latency and sovereignty requirements. So your prediction, we're all bad insights, it's super computing. In five years, will you still be talking about migrating to cloud? Or will that be something from the past? In five years, I still think there will be a component. I think the majority of the assumption will be that things are cloud native and you start in the cloud and that there are perhaps an aspect of that that will be interacting with some sort of an edge device or some sort of an on-premises device. And we hear more and more customers that are saying, okay, I can see the future. I can see that I'm shrinking my footprint. And you can see them still saying, I'm not sure how small that beachhead will be but right now I want to at least say that I'm going to operate in that hybrid environment. And so I'd say, again, the pace of this community, I'd say five years we're still going to be talking about migrations, but I'd say the vast majority will be a cloud native, cloud first environment. And how do you classify that? That outpost sitting in someone's data center, I'd say we'd still, at least I'll leave that up to the analysts, but I think it would probably come down as cloud spin. Great place to end. Ian, you and I now officially have a bet in five years and we're going to come back. My contention is now we're not going to be talking about it anymore and kids in college are going to be like, what do you mean cloud? It's all IT, it's all IT. And they won't remember this whole phase of moving to cloud and back and forth. With that, join us in five years to see the result of this mega bet between Ian and Dave. I'm Dave Nicholson with theCUBE here at Supercomputing Conference 2022. Day three of our coverage with my co-host Paul Gillin. Thanks again for joining us. Stay tuned after this short break. We'll be back with more action.