 Welcome back to SuperCloud 2. This is Dave Vellante and we're here with Jack Greenfield. He's the vice president of enterprise architecture and the chief architect for the global technology platform at Walmart. Jack, I want to thank you for coming on the program. Really appreciate your time. Glad to be here, Dave. Thanks for inviting me and appreciate the opportunity to chat with you. Yeah, it's our pleasure. Now, we call what you've built a super cloud. That's our term, not yours. But how would you describe the Walmart cloud native platform? So WCNP as the acronym goes is essentially an implementation of Kubernetes for the Walmart ecosystem. And what that means is that we've taken Kubernetes off the shelf as open source and we have integrated it with a number of foundational services that provide other aspects of our computational environment. So Kubernetes off the shelf doesn't do everything. It does a lot. In particular, the orchestration of containers but it delegates through API a lot of key functions. So for example, secret management, traffic management. There's a need for telemetry and observability at a scale beyond what you get from raw Kubernetes that is to say harvesting the metrics that are coming out of Kubernetes and processing them, storing them in time series databases, dashboarding them and so on. There's also an angle to Kubernetes that gets a lot of attention in the daily DevOps routine that's not really part of the open source deliverable itself. And that is the DevOps sort of CI CD pipeline oriented lifecycle and that is something else that we've added and integrated nicely. And then one more piece of this picture is that within a Kubernetes cluster there's a function that is critical to allowing services to discover each other and integrate with each other securely and with proper configuration provided by the concept of a service mesh. So Istio, LinkerD, these are examples of service mesh technologies. And we have gone ahead and integrated actually those two. There's more than those two but we've integrated those two with Kubernetes. So the net effect is that when a developer within Walmart is going to build an application they don't have to think about all those other capabilities where they come from or how they're provided. Those are already present and the way the CI CD pipelines are set up it's already sort of in the picture and there are configuration points that they can take advantage of in the primary YAML and a couple of other pieces of config that we supply where they can tune it. But at the end of the day it offloads an awful lot of work from them having to stand up and operate those services fail them over properly and make them robust all of that's provided for them. Yeah, you know, the developers often complain they spend too much time wrangling and doing things that aren't productive. So I wonder if you could talk about the high level business goals of the initiative in terms of the hardcore benefits was the real impetus to tap into best of breed cloud services where you're trying to cut costs maybe gain negotiating leverage with the cloud guys resiliency, I know it was a major theme maybe you could give us a sense of the anatomy of the decision-making process that went in. Sure, and in the course of answering your question I think I'm going to introduce the concept of our triplet architecture which we haven't yet touched on in the interview here. First off, just to sort of wrap up the motivation for WCNP itself, which is kind of orthogonal to the triplet architecture it can exist with or without it currently does exist with it, which is key and I'll get to that in a moment. The key drivers, business drivers for WCNP were developer productivity by offloading the kinds of concerns that we've just discussed. Number two, improving resiliency that is to say reducing opportunity for human error. One of the challenges you tend to run into in a large enterprise is what we call snowflakes lots of gratuitously different workloads, projects, configurations to the extent that by developing and using WCNP and continuing to evolve it as we have we end up with a cookie cutter like consistency across our workloads, which is super valuable when it comes to building tools or building services to automate operations that would otherwise be manual when everything is pretty much done the same way that becomes much simpler. Another key motivation for WCNP was the ability to abstract from the underlying cloud provider. And this is going to lead to a discussion of our triplet architecture. At the end of the day, when one works directly with an underlying cloud provider one ends up taking a lot of dependencies on that particular cloud provider. Those dependencies can be valuable. For example, there are best of breed services like say Cloud Spanner offered by Google or say Cosmos DB offered by Microsoft that one wants to use and one is willing to take the dependency on the cloud provider to get that functionality because it's unique and valuable. On the other hand, one doesn't want to take dependencies on a cloud provider that don't add a lot of value. And with Kubernetes, we have the opportunity and this is a large part of how Kubernetes was designed and why it is the way it is. We have the opportunity to sort of abstract from the underlying cloud provider for stateless workloads and compute. And so what this lets us do is build container-based applications that can run without change on different cloud provider infrastructure. So the same applications can run on WCNP over Azure, WCNP over GCP or WCNP over the Walmart private cloud. And we have a private cloud. Our private cloud is OpenStack-based and it gives us some significant cost advantages as well as control advantages. So to your point in terms of business motivation, there's a key cost driver here which is that we can use our own private cloud when it's advantageous and then use the public cloud provider capabilities when we need to. A key place where this comes into play is with elasticity. So while the private cloud is much more cost-effective for us to run and use, it isn't as elastic as what the cloud providers offer. We don't have essentially unlimited scale. We have large scale but the public cloud providers are elastic in the extreme which is a very powerful capability. So what we're able to do is burst and we use this term bursting workloads into the public cloud from the private cloud to take advantage of the elasticity they offer and then fall back into the private cloud when the traffic load diminishes to the point where we don't need that elastic capability the elastic capacity at low cost. And this is a very important paradigm that I think is going to be the very common place ultimately as the industry evolves. Private cloud is easier to operate and less expensive and yet the public cloud provider capabilities are difficult to match. And the triplet, the try is your on-prem private cloud and the two public clouds that you mentioned, is that right? That is correct. And we actually have an architecture in which we operate all three of those cloud platforms in close proximity with one another in three different major regions in the US. So we have East, West and Central and in each of those regions we have all three cloud providers and the way it's configured those data centers are within 10 milliseconds of each other meaning that it's of negligible cost to interact between them. And this allows us to be fairly agnostic to where a particular workload is running. Does a human make that decision Jack or is there some intelligence in the system that determines that? It's a really great question Dave and it's great question because we're at the cusp of that transition. So currently humans make that decision. Humans choose to deploy workloads into a particular region and a particular provider within that region. That said we're actively developing patterns and practices that will allow us to automate the placement of the workloads for a variety of criteria. For example, if in a particular region a particular provider is heavily overloaded and is unable to provide the level of service that's expected through our SLAs we could choose to fail workloads over from that cloud provider to a different one within the same region. But that's manual today. We do that, but people do it. We'd like to get to where that happens automatically. In the same way, we'd like to be able to automate the failovers both for high availability and sort of the heavier disaster recovery model between within a region between providers and even within a provider between the availability zones that are there but also between regions for the sort of heavier disaster recovery or maintenance driven realignment of workload placement. Today, that's all manuals. So we have people moving workloads from region A to region B or data center A to data center B. It's clean because of the abstraction. The workloads don't have to know or care but there are latency considerations that come into play and the humans have to be cognizant of those and automating that can help ensure that we get the best performance and the best reliability. But you're developing the data set to actually, I would imagine, be able to make those decisions in an automated fashion over time anyway. Is that a fair assumption? It is and that's what we're actively developing right now. So if you were to look at us today, we have these nice abstractions and APIs in place but people run that machine, if you will, moving toward a world where that machine is fully automated. What exactly are you abstracting? Is it sort of the deployment model or are you able to abstract? I'm just making this up, like Azure functions and GCP functions so that you can sort of run them with a consistent experience. So what exactly are you abstracting and how difficult was it to achieve that objective technically? It's a good question. What we're abstracting is the Kubernetes node construct. That is to say a cluster of Kubernetes nodes which are typically VMs, although they can run bare metal in certain contexts, is something that typically to stand up requires knowledge of the underlying cloud providers. So for example, with GCP, you would use GKE to set up a Kubernetes cluster and in Azure you'd use AKS. We are actually abstracting that aspect of things so that the developers standing up applications don't have to know what the underlying cluster management provider is. They don't have to know if it's GCP, AKS or our own Walmart private cloud. Now, in terms of functions like Azure functions that you've mentioned there, we haven't done that yet. That's another piece that we have sort of on our radar screen that we'd like to get to is a serverless approach and the Knative work from Google and the Azure functions. Those are things that we see good opportunity to use for a whole variety of use cases but right now we're not doing much with that. We're strictly container based right now and we do have some VMs that are running in sort of more of a traditional model. So our stateful workloads are primarily VM based but for serverless, that's an opportunity for us to take some of these stateless workloads and turn them into cloud functions. Well, and that's another cost weaver that you can pull down the road that's going to drop right to the bottom line. Do you see a day or maybe you're doing it today but I'd be surprised but will you build applications that actually span multiple clouds or is there in your view always going to be a direct one-to-one mapping between where an application runs and the specific cloud platform? That's a really great question. Well, yes and no. So today application development teams choose a cloud provider to deploy to and a location to deploy to and they have to get involved in moving an application like we talked about today. That said, the bursting capability that I mentioned previously is something that is a step in the direction of automatic migration that is to say we're migrating workload to different locations automatically. Currently the prototypes we've been developing and that we think are going to eventually make their way into production are leveraging Istio to assess the load incoming on a particular cluster and start shedding that load into a different location. Right now the configuration of that is still manual but there's another opportunity for automation there. And I think a key piece of this is that down the road, well, that's a sort of a small step in the direction of an application being multi-provider. We expect to see really an abstraction of the fact that there is a triplet even. So that the workloads are moving around according to whatever the control plane decides is necessary based on a whole variety of inputs. And at that point you will have true multi-cloud applications, applications that are distributed across the different providers and in a way that application developers don't have to think about. Walmart's been a leader, Jack, in using data for competitive advantages for decades. It's kind of been a poster child for that. You've got a mountain of IP in the form of data and tools, applications, best practices that until the cloud came out was all on prem. But I'm really interested in this idea of building a Walmart ecosystem, which obviously you have. Do you see a day or maybe you're even doing it today where you take what we call the Walmart SuperCloud, WCMP, your words and point it toward an external world or your ecosystem supporting those partners or customers that could drive new revenue streams directly from the platform. Great question, Steve. So there's really two things to say here. The first is that with respect to data, our data workloads are primarily VM-based, as I mentioned before, some VMware, some of them straight open stack. But the key here is that WCMP and Kubernetes are very powerful for stateless workloads, but for stateful workloads tend to be still climbing a bit of a growth curve in the industry. So our data workloads are not primarily based on WCMP, they're VM-based. Now that said, there is opportunity to make some progress there. And we are looking at ways to move things into containers that are currently running in VMs, which are stateful. The other question you asked is related to how we expose data to third parties and also functionality. Right now, we do have in-house for our own use a very robust data architecture. And we have followed the sort of domain-oriented data architecture guidance from Martin Fowler. And we have data lakes in which we collect data from all the transactional systems and which we can then use and do use to build models which are then used in our applications. But right now, we're not exposing the data directly to customers as a product. That's an interesting direction that's been talked about. And may happen at some point, but right now that's internal. What we are exposing to customers is applications. So we're offering our global integrated fulfillment capabilities, our order picking and curbside pickup capabilities, and our cloud powered checkout capabilities to third parties. And this means we're standing up our own internal applications as externally facing SAS applications, which can serve our partner's customers. Yeah, of course, Martin Fowler really first introduced to the world, Jamak Tagani's data mesh concept and this whole idea of data products and domain-oriented thinking. Jamak Tagani, by the way, is a speaker at our event as well. Last question I had is edge and how you think about the edge, the stores are an edge. Are you putting resources there that sort of mirror this triplet model or is it better to consolidate things in the cloud? I know there are trade-offs in terms of latency. How are you thinking about that? All really good questions. It's a challenging area, as you can imagine, because edges are subject to disconnection, right? Or reduced connection. So we do place the same architecture at the edge. So WCNP runs at the edge and an application that's designed to run a WCNP can run at the edge. That said, there are a number of very specific considerations that come up when running at the edge, such as the possibility of disconnection or degraded connectivity. And so one of the challenges we have faced and have grappled with and done a good job of, I think, is dealing with the fact that applications go offline and come back online and have to reconnect and resynchronize. The sort of online offline capability is something that can be quite challenging. And we have a couple of application architectures that sort of form the two core sets of patterns that we use. One is an offline online synchronization architecture where we discover that we've come back online and we understand the differences between the online dataset and the offline dataset. Now they have to be reconciled. The other is a message-based architecture. And here in our health and wellness domain, we've developed applications that are Q-based. So they're essentially business processes that consist of multiple steps where each step has its own Q. And what that allows us to do is devote whatever bandwidth we do have to those pieces of the process that are most latency sensitive and allow the Q lengths to increase in parts of the process that are not latency sensitive knowing that they will eventually catch up when the bandwidth is restored. And to put that in a little bit of context, we have fiber lengths to all of our locations and we have, I'll just use a round number, 10-ish thousand locations, it's larger than that but that's the ballpark. And we have fiber to all of them but when the fiber is disconnected, when the disconnection happens, we're able to fall back to 5G and to Starlink. Starlink is preferred, the higher bandwidth 5G if that fails. But in each of those cases, the bandwidth drops significantly. And so the applications have to be intelligent about throttling back the traffic that isn't essential so that it can push the essential traffic in those lower bandwidth scenarios. So much technology to support this amazing business which started in the early 1960s. Jack, unfortunately we're out of time. I would love to have you back or some members of your team and drill into how you're using open source but really thank you so much for explaining the approach that you've taken and participating in SuperCloud too. You're very welcome, Dave. And we're happy to come back and talk about other aspects of what we do. For example, we could talk more about the data lakes and the data mesh that we have in place. We could talk more about the directions we might go with serverless. So please look us up again. Happy to chat. I'm going to take you up on that, Jack. All right, this is Dave Vellante from John Furrier in the CUBE community. Keep it right there for more action from SuperCloud too.