 All right. Hello. Thank you, everybody. Welcome to Detroit. This is kind of weird because I'm only looking at a few people and everybody's over there and over there, but I'll make the best of it, I guess. So yeah, I hope you're enjoying KubeCon this year, and let's get started. My name is John Bellamerick. I'm from Google Cloud. And I want you to want to set the stage a little bit. This is an image of many of Google Cloud's pops of various sorts. But it's not so important that it's Google Cloud, just that you can see it's quite an array of sites. And you can imagine yourself being in the situation of needing to deploy a set of applications across this geographically distributed set of sites. And this isn't something we typically talk about here at KubeCon. At least we haven't over the last few years. Actually, I can take this off now. The few years that we've been doing KubeCon. But it's becoming more and more common. For one, it's exactly the type of scenario that you see phased by telcos who are trying to roll out 5G, which often is built on Kubernetes. But there's other use cases as well, retail edge, where you have tens of thousands of stores, factory automation. In fact, there's enough demand for this that the various cloud providers and others have been coming up with new technologies they call multi-access edge compute, which is basically a rack you can drop in any sort of pop anywhere. You're on your own site, your own data center, your own location, or in one of their pops. And you wire it up to the cloud. And you can use your cloud APIs to manage that compute resource out of the edge. This is really, really cool, actually, because one of the great revolutions of cloud is that you get this separation between the capacity provisioning and the hardware provisioning of that capacity and the consumption. So we have API on-demand driven consumption, which means other people, like cloud providers, can build that capacity. And you can just rent it and consume it as you need it. And so to be able to take that from just the data center to the edge and spread out across the world, that same consumption model creates enormous opportunities for cloud providers and for our customers, as well as many other people. But what it does also that people don't really talk too much about is it creates an enormous and painful headache for managing the applications and workloads on all those clusters. So what do I mean by that? What are some of these problems? Well, let's think about the decisions we have to make when we decide to deploy even one application, much less a set of interconnected applications as the title of the talk is. I mean, you have to decide, first of all, where do we run clusters? Because just because we say we have edge sites doesn't mean they're all edge clusters. Do you want to put a cluster at every single edge site? What if it's a tiny little edge site with one or two nodes at it? Do you want a whole cluster there or not? Do you want a control plane there? Maybe not. Once you've decided where to put those clusters, how do you decide where to put which workloads? When you pick the place you want to put a workload, how do you decide? How do you specialize the configuration for that particular site? Because if it's a one node site or a 10,000 node site, you may want to configure your workload a little bit differently. And even if it's not, just the IP addresses, things like that may change across those things. So you can see that there's just so many problems that aren't addressed by, hey, just stick a rack out there and hook it up to the cloud. It leaves a lot more problems to solve. To make it a little more concrete, this is a picture of a sort of a very hand-wavy, simplified picture of what you might have out there in the field if you are a telco, for example. So at the very far edge, the left side, that is, you have radio towers. Each of those radio towers has a cabinet sitting at the base. It has a little bit of power going to it and has a handful of machines in it. So that's hundreds of thousands of edge sites there. Maybe a couple of miles from each of those radio towers covering a certain area, you've got a small building with a couple thousand square feet and a bunch of racks in there that can handle that. Maybe a little bit farther away, you've got a bigger data center that can hold more. And finally, maybe three or four dozen in the world, there are these giant cloud regions with acres and acres of machines. You can see that that decision of where do I run the workload isn't just about putting it just using capacity. It's about why does it need to be that close to the end user? What are the latency requirements, for example, that the user needs to interact with that workload really, really on a very low latency basis? And what is the cost of that? The base of the tower is a very limited capacity there. And so it's essentially very high cost. And if you need to make a change, you've got thousands of them out there, hundreds of thousands. And it's very expensive to do. So focusing in a little bit more on workloads, the two boxes on the right, you'll see UPF, AMF, SMF. It doesn't actually matter what that means. I'm not even going to tell you. But what is important here is these are different workloads needed for a 5G core. And they have a relationship with one another. So the UPF is going to run a whole bunch of them closer, a little bit closer to the user. And then this thing called the SMF is going to sit in the middle. And they're all going to talk to the SMF. So this is when I go from, we talked about a single workload a bit, we go to interconnected workloads. They have to talk to each other. They have to talk to each other across these tiers of sites, and not just across sites, but across tiers of sites. And those workloads don't just have to have interconnection, but they're actually interrelated. If you double the number of UPFs you have, you're going to need to increase the memory and CPU of the SMF, where you're going to have to maybe provision another replica or something in another place. So this is what I mean when I say, imagine yourself deploying a set of applications that are interconnected or interrelated across this. So it's pretty challenging. Putting it in user terms, an example of where you might actually do that, and we'll drill into the details of the problems we face when we do this, is one thing that the telcos want to do is private 5G. So they want to be able to rule out a more secure network that's just for your company. So an example might be, I've got these 20,000 trucks in the Northeast US, they're driving all over the place, and they need to send telemetry data back so I can know where they are, and I know whether they're functioning properly, and maybe I'll use it like a CB radio system too. So there's certain latency requirements for that. There's certain bandwidth requirements, but they're pretty low. And so this theoretically, if we can do it properly, it should be cheaper than using the general purpose and, of course, more secure. So just to roll out, we want our customers to come to us with that sort of statement of what they want, and then we have to figure out from there, how do we implement that? So just to do that for day one, we go through some more of those problems I talked about in the abstract in a more practical way, or a more concrete way. We need to identify what are the sites in the Northeast, what types of sites are there? Are these edge sites? Are they cloud regions? Which workload will I run in each one? What are the infrastructure needed? How do I configure the nodes? Because typically there's a reason we're running this workload at the edge. It's not because it's just a generic workload, but it has some special thing it wants to do, which may include a special sensor, and then maybe a GPU or a GPU that processes the data from that sensor. Those things are often not super cloud native, not super fungible, and so you actually end up having to reboot nodes and things to prepare the node, or in telco cases you have to set up special networking. So often there's a preparation step for the node. So there's just all of these problems we have in day one. And even if we figure all that out and we deploy all that and get all that done, we have day two to worry about, so more complexity. We have to monitor and make sure that things stay the way that we've sort of asked them to be, make sure there's no configuration drift. We have to handle changes of the topology. The customer calls up, says, hey, every time my truck's drive up by 95 this far, I lose track of them. And so we need to build out more capacity up that corridor, and that means we're adding more UPFs, and that means we need a bigger SMF. So we have to be able to track all of that and figure all of that out. And if that's not enough, it does get worse. So every single one of these things I've talked about, the topology, the cluster creation, the workload configuration, the workload manifests in Kubernetes, all of those are done by different systems. And your topology might just be a spreadsheet that says here are the sites I have, here's the capacity available, I'm gonna put clusters in these different places. Then you have to call cloud provider APIs to instantiate those clusters. Then you have to call the Kubernetes API server on each of those clusters to insert a workload there. So every single one is done by a different system. You may have vendor specific systems. So if you think about, I said like the UPF and the SMF to know about each other. Well, in this particular case, it's a telco thing. Telcos don't use DNS because we all know that outages are always DNS. And being a DNS person, I can say that. But the, so they actually don't use DNS. They fix IP addresses in these places to just to take a little bit of the complexity out. And so, but that means you need some central source. These aren't cluster IPs, they're not local to a cluster. These are across many clusters. And so you need an IPAM and IP address management system to keep track of all of those IPs and give the right one in the right config at the right time in the right place. And not just in the communities manifest, but in the actual config map that represents the configuration of that particular workload, it needs to have a list of these IP addresses. So how do we do that? We do it with some vendor proprietary management system that probably SSH is into the workload, not into the node, into the actual workload and makes changes and runs commands against it. So tons of complexity, tons of difficulty. All right, so what do we do about it? Why am I here? I mean, I could have just taken the last 15 minutes and just said it's really hard and then skipped all that, but I hope that it gives you a character, a flavor for why it's hard. And it helps guide us to where we wanna go with it. So we were faced with this question from our customers, not just to Telco, but Retail Edge and other customers. And so we sort of gave it a lot of thought and we came up with three basic ideas for how we can dramatically reduce the complexity and hopefully take this seemingly intractable problem and make it solvable or tractable. The first is to stop all this nonsense I was just talking about of many different interfaces to actuate the different layers of the system and come up with some sort of single unified platform for automation across all of those layers. The second one is that whatever that mechanism is, whatever that unified platform is, really needs to solve day two and we have seen from Kubernetes one of the great ways to solve day two is through declarative configuration with active reconciliation. So that means I need to be able to say here's how I want the world to be and I need to have controllers out there or agents out there that are looking at that and then looking at what the world is and trying to make it better. And then the third insight into how we can reduce this complexity is that there's just too much configuration for people to understand or manage. And so we need to make that configuration processable by machines but not just sort of like in a very shallow way but we need to be able to build automations and sort of deeply understand that configuration and can do things like understand that when you scale up this particular workload, it's gonna have an impact on this configuration of this other workload. So in order to do that, the machines have to understand the configuration in a pretty deep way. All right. Those are the insights or the theory. How do we actually make it happen? Well, the first two we have decided we can address with Kubernetes everywhere. So this doesn't mean that we're gonna rewrite all vendor's software to be natively Kubernetes because that will never happen or it will take 20 or 30 years. So we're not gonna try and do that but what it does mean is that we need a layer. We need an API that represents those configurations within Kubernetes resources. And then we need actuators or controllers for actually applying that configuration on the lower layer. So this is not some great insight of ours. This is what everybody here has probably been doing for four or five years with operators and CRDs but it's really more about that we're going to apply this consistently across all of these layers and we're going to make sure that that intent-based sort of declarative management is the style we use. And specifically picking Kubernetes not just because we wanted to be able to come to KubeCon but because it's a well understood extensible framework for building these kind of platforms that has at least some mileage under its belt and a great ecosystem. The third bullet, we're addressing with something we call configuration as data. So this is a new approach to configuration management and you may have played with it some. Customize is sort of our first foray into this but it's got its challenges and its limitations and we're trying to make that better through some new tooling and technology. But this is a sort of recognition that existing configuration management, templating systems like Helm have some, they have their place and they have a certain role but they're not effective at the level of scale we're talking about. And part of the reason is that every single Helm chart is effectively creating a new API. The values file itself is a new API that's specific for that individual package. It's really, really hard to write automations that have to deal with bespoke interfaces for every single thing they wanna automate. In fact, it's not just hard, it's really pointless and impossible. So the idea here is based on four essential principles. One is represent the configuration in well-defined structured data model and we use KRM, Kubernetes Research Model for that which shouldn't shock anybody here. The other is use a GitOps model. So put that into a version store prior to the live state. Now the reason for this is that it enables one, it enables things like undo or rolling back which is important but it also gives us a pre-actuation or a pre- runtime place where we can coordinate our interactions in a configuration. And this is a key sort of enabler is this idea of collaborative coordination between independent actors on the configuration. I'll explain that by that a little more in a minute. Another main point is that tools operate on the config and those tools are reusable. So this one I tend to think of, if you look at templating systems, what they do is they intermingle your code and your configuration. They treat the configuration as text and the code is sort of wrapped all around it and you feed in some inputs and it spits out some hopefully well-formed YAML in the end. That doesn't work very well at scale and it doesn't make reusable pieces of code and testable pieces of code. So I tend to think of it a little bit like the journey that industry made from writing individual bespoke programs with bespoke data structures in order to produce your TPS report in COBOL or whatever. You had to write some very specific program that processed a very specific data structure and output something. But then eventually somebody said, hey, let's make streams in Unix. Well, everything will just be a stream. And then they said, hey, let's put a little structure on that. Let's make line breaks. Now all of a sudden we have line editors and WC and all these tools that can operate on any stream of data that follows this very simple structure. You had column breaks in there and all of a sudden you can build a relational database management system on top of this very simple structure. So just a little bit of structure can enable highly reusable tools. And in a similar way, configuration as data says, let's just put a little bit of structure. Actually it's a lot of structure when you think about Kubernetes resource model with metadata and all of this stuff. And then we can build tools that can expect that structure to be there and they can make assumptions and operate on it because it's there. So it ends up being quite powerful. Last main point on configuration as data and I could talk about configuration as data for hours so I have to be careful not to run out of time is that the clients interact via APIs. So we put an API layer on top of that storage instead of them directly interacting with storage. This allows for much better concurrency and sort of consistent transformations of that configuration by these independent actors. So I'll give an example later about the independent actors. So that's our approach. That's today in, we have a suite of open source tools called Kept. Quarantine config sync that kind of embodies our current work towards configuration as data and that's our solution. So this is a lot of things to accomplish and Google Cloud did not feel like it's something that we should try to do all on our own which is why we made open source projects but we even went a step further and we joined up with Linux Foundation and we created a new top level Linux Foundation project called Nephio and we got a bunch of our friends in the industry to join us in this effort which when we launched back in April, we had about two dozen and now I think there's 60 or 70 different companies. Telcos are a lot of the representation mostly because they operated a scale that they're running into these problems. A lot of these problems you don't hit until you hit scale. So everything seems fine and then you try to get to a certain level of scale and you're toast. So that's why we have got a lot of Telcos and they're vendors but it really is not a Telcos specific problem. This is more of an edge fleet problem and it applies to edge applications as well. So the idea and the goal of Nephio is to implement this vision we just talked about. At a very high level, we do this and I do say I'm pretty low on time to leave any time for questions but at a very high level, what we do is we take and we build an orchestration cluster this is one management cluster that kind of has a bunch of contains these APIs that sit on top of the a top of the storage layer that manages the configuration and we use that those APIs to implement Kubernetes controllers that can operate on top of those configurations. So these configurations we, I'll use the word package a lot. A package is a little bit like a Helm chart except that it's all KRM inside. There's no templates that might be O form or anything like that. So it's all sort of machine processable because it's KRM. So the idea is that you can take these packages that you're vendor, you're off the shelf vendor that makes some edge application produces and you can clone them into your environment, modify them with automation, these controllers and human inputs collaboratively and then you push them to a Git repo where it gets realized on an API server via a Git sync or like config sync we're using in our reference implementation but something like Argo CD works just as well. So that's the sort of platform level architecture we need in order to solve this problem but we also need sort of content for that platform to process. By content I mean kind of the packages or the models that represent the things we want to create. So make that a little bit more concrete. This diagram shows three sort of swim lanes vertically along like this. The very left hand side is infrastructure. So the idea is we need a model for infrastructure. We need kept packages that represent a cluster that represent a network and we need to be able to run them through that configuration process and realize them. So we're not gonna produce in Nefio new actuators for clusters. We already have a cluster API and we have Kubernetes Config Connector which is for Google Cloud. We have ACK for AWS. Azure has another thing. We have crossplane. We have a number of ecosystem tools that play that role but what we do is provide linkages to our automations for those different tools. So that's the idea. The middle swim link are the workloads. So remember we talked about you've got to deploy a workload across a set of clusters. First you have to deploy the set of clusters. That's the first swim lane. Then you have to deploy those workloads. That's the second swim lane in ordinary edge applications. Those are just Kubernetes resources, ordinary Kubernetes resources. In telco we can take it a little further because there's sort of these special models of what a UPF and AMF is and SMF and different vendors can implement them but effectively that's the content again that this platform needs to process. The third one is application config. This is sort of farther out but the idea is what we talked about earlier. The application has to know, has to be reconfigured potentially when a new workload that's related to it comes up. We need a way to actuate that through KRM but there's a lot of variability in how that's done so that's a couple of years out. Okay, so that is sort of the, that in the previous slide there's sort of the platform, the content, how we move content through an FEO platform but what does the user interaction look like because I know this is pretty abstract and can be hard to grasp. So from a user point of view, we sort of tend to think of it as a few different types of users. One type of user is that ISV, independent software vendor who's got some package software that they make and they sell that they want customers to be able to deploy on their edge sites. Another one is an organizational platform team or in the Talco case it's like a network designer. They may want to take that off the shelf package, they clone it into their environment and they customize it for their particular organization. They add some policies, they add some configuration that points to some central services, they configure the IM that they use, things like that that apply to anything running in their organization and then you've got your app teams or your deployers. They're gonna further specialize that package by cloning it to their environment. So we're using multiple Git repositories to clone this package, but it's not just a fork like in GitHub. This is actually done in a special way and this is part of the kept functionality for configurationist data where when we clone that we keep track of the upstream and this is super important because we're keeping track of the upstream not just like it's in that spot but the actual tagged revision of that upstream and this is how we start to address day two. So you can imagine, you take the package software, you clone it to your local environment, you create your own version of it, you've got 50 app teams that all clone that into their own 50 environments and then, and we'll get there in a second, they fan out on across 10,000 clusters. Now the guy in the upstream software vendor says, oh, there's a security vulnerability new image being all of a sudden we need to ripple that day two operation all the way out to those 10,000 clusters because we've tracked upstreams, we can actually automatically in a machine processable way or a completely automated way propose changes to those configurations all the way down at the end with the new image in them and we can canary them and roll them out. So that tracking of upstream and that cloning of the packages and that machine processable nature of the configurations can enable some really incredible automations. I'll take one more minute to describe one more little bit of that automation. So after that app team deploys that, one of the things I talked about earlier was we really want humans and machines to be able to interact with the configuration to make it all come as one. I'll be a complete configuration and so to make that like really specific, I don't want a giant script that takes every possible input and tries to create 10,000 instances of a package across 10,000 environments. What I want is individual automations that understand a very narrow thing. So as an example, I can include in a package a little resource that says I need an IP address here and I need that IP address to be based on the fact that I'm this type of workload that I'm connecting to this particular network which might get defined by the cluster I'm being deployed in and I'm in this particular site, this particular geographical region. So these three pieces of information are very, when we fan out the variance of that package across those 10,000 clusters, those three pieces of information are going to vary or at least two of them are going to vary for each one of those instances of that package. We've actually created separate instances of those packages, those configurations. We're not taking a values file and spraying a bunch of 10,000 API servers with ephemeral configuration data. We're actually storing in Git the actual configuration that's going to be for that particular cluster. And there's a controller. There's a controller sitting in the management cluster that's watching for new revisions that contain that resource that says I need an IP address. And that controller can pick up that configuration. See, oh, you need an IP address and those are the three parameters. I'm going to go talk to my IPAM system. I have a central IP address management system in my organization that has an API and this controller is going to go talk to it. It's going to allocate an IP address out of it and it's going to inject it in that configuration and it's going to do that 10,000 times with 10,000 different IP addresses across 10,000 clusters. And so now I was able to build that automation that does that integration with IPAM once and I can use it across every single workload that I deploy across my entire organization forevermore and test it once. So this is the kind of focused narrow automation and reusability that I'm looking to get out of this. So that's kind of the big picture story of Nafio. And we have another event coming up mid-November. We'd love for anybody to join us there. We're going to have a three full six hour workshop. Anyway, information is all on the wiki there. There's a bunch of information here. Any questions? Yes, sir. Can you, I'm really hard, I'm sorry. Can you speak up? Oh, here, good, better. Hello. Hey, thanks by the way, great talk. What I was asking about is at one point in the talk you mentioned, users can interact with these rollouts too. So something like Argo for syncing configuration would work. How would, what would that look like in the setup? Like you had the configuration sync. So I'm not really talking about Argo workflows here. I'm just talking about the get sinker piece. So essentially, and I'm not an Argo expert. So I can't say in detail. I do know that some of our customers want to use that in place of config sync and that it should be able to serve that role. But Nafio is new, we're still building it. Several of our community members have expressed interest in that as well. So I'm sort of looking to them to build that integration. But essentially, the final completed configuration gets pushed into a get repository, but it has to get somehow from the get repository to the actual API server. And so we need a get sinker agent to do that. And that's the role I would see it playing. There may be other integrations possible as well. Yes. Oh, okay, over there. In the black sweater and then the blue sweater. So how does it integrate with OS and BSA systems? Because you talked about the left side package management, right? So do we have the APIs that the systems can integrate with? So that's a good question. So that's something we're looking again. I'll pump that to the community. We're looking at the community to solve that. There is a desire to implement some of the standards APIs. Personally, in my opinion, so I've talked for example to folks from some of the service orchestration companies and they're like, well, yeah, just whatever API you have is fine. It can be a Kubernetes API. So they don't really care, right? They're gonna build an integration anyway. They're happy to just build it to a Kubernetes API. So we did talk early on in the community about building some interfaces that provide standards based APIs. And if necessary, that's something we can do. But really we're downstream of all of those. We're southbound of all of those. So we may need to present some standards based on the Northbound API. Sir, in the blue. Do you have any canadian capability? Is it like rolling out all the clusters at the same time or can we roll out to like maybe 10 clusters first and then test it and then roll out to it? I'm sorry, can you repeat that I couldn't quite follow? Do you need to roll out to all the clusters at the same time or can we? No, definitely not. So the idea would be that, so each of these packages goes through lifestyle stages. There's a draft stage. So when all this editing is happening with controllers and humans all jumping in there, I think one of our community members calls it choreography. When that's happening, they're in a draft state. And actually we model this after the same way if you're familiar with readiness gates for pods. So pods have conditions. And when a certain set of conditions have been met and may come true, then we flip a readiness gate to true and that's when the endpoints controller will add it to a service. We do exactly the same thing. We like to learn lessons from other people. So we package revisions, have conditions. Once a certain set of conditions are met, which is definable by the package author, then we can go to a publishable state. Once we're in a publishable state, then we'll actually end up on the cluster. So we can either have a rollout controller that toggles the publishable state or we can have a rollout controller that toggles the get sinker configuration in the edge cluster to base, to twiddle the commit it's pinned to. There's different ways we can actuate rollout but absolutely rollout's part of it. We just, you know, we're new. We haven't gotten there yet. I think I'm out of time. One more. I can, one more question? All right. Hi, great talk, by the way. So I wanna ask, so this is all using gate repositories, right? So, how do you think about using, like, are there any discussions regarding OCI? So instead of using OCI, instead of using get use OCI because OCI is more scalable that way? Yes, absolutely. So in fact, Porch, which is the API layer, it's the configuration, it's sort of kept as a service, kept as a CLI tool. Porch is the API surface we put on top of Git. It also can write to OCI. So actually, for a lot of these edge type of work, that is likely will end up using OCI, it's just, it's a subtlety most people don't get, so I just talk about GitOps, but yeah, absolutely. Good question. All right, well, thank you all very much. Enjoy the rest of your day.