 I'm Ricard Gasparetto and I work for Vodafone Group in the network architecture team with me. Is Tom Kevelyn? Hi, good afternoon everyone, good morning still. Tom Kevelyn, principal cloud architect, same team as Ricardo. So today we're going to give you a presentation on the lessons we learned from deploying Kubernetes for 5G. I'll give you some brief context introduction, so Vodafone has many many many telcos here, I'm sure, is modernising the network, obviously, we're launching 5G core for 5G standalone in a number of markets, we just went live actually with the first few CNFs in UK, in Romania, many other markets are in the pipeline so we're well underway and we're obviously using Kubernetes, here we are. To support and manage these new network functions, all network functions are coming containerised. We are then building a telco cloud, Kubernetes-based telco cloud and it hasn't proven as straightforward as we hoped as it would be, we hope what NIDLA-IT like platforms would be, we have added a number of telco specific features and we had to tackle some more novel issues that were specific to network functions. I will go through a couple of these and give you some examples of the more important ones in a minute, but so how are we doing it? This is a new, we had to reorganise the restructure, how we, our business does things and how our teams collaborate with each other a bit, but also how we ourselves in the architecture team do work, so for example among many things we're treating the documentation, we're documenting our designs and blueprint with a central blueprint as a single source of truth for example, so these architecture and design documents as Josh was mentioning about low level designs as code, we also treat blueprints as code which means obviously using software development techniques to release blueprints more frequently, to track the changes, to collect issues from all of our stakeholders across markets, teams, engineering, vendors even and we're also contributing to an open source telco cloud blueprint, I'm the work stream lead of Anuket RA2 which is the open source version of that design, everybody is obviously welcome to check it out and contribute if you will, the RA2 specifically is the project I'm looking at which is the Kubernetes based reference architecture, you will see a lot of design decisions, gotchas, insights in that, but there's also an open stack based project in parallel and then other Anuket has many, many other software and specification projects as well, and then back to Vodafone, what we have done is also create a central onboarding process so that we can, for our vendors and their CNFs, so that we can establish the cloud nativeness and the compatibility of the software that we're onboarding into our platform with of course the features and specifics of our environments when ensuring that the CNFs are designed and dimensioned using cloud native principles and then we also define much, much more importantly the lifecycle management of the CNFs, so treating the CNFs obviously as cattle not pets means that the automation of the lifecycle of the CNFs has to work straight away, things like obviously instantiation, instantiation and configuring of the network function are the basics, but things like upgrades take a lot of, take a lot of, are taking of our time recently and then things like scaling resiliency to cluster operations and so on. But let me show you just a few examples of the things we learned. I'll give you, okay, this is a simplified overview of a Kubernetes based telco cloud platform. I'll give you an overview of maybe two main areas that we have been working on in the past few months, years. One being multi-tenancy and network multi-homing. These are tightly related and then the standardization of a catalog for things like node profiles, add-ons among other things. So I'll start with this one. So to avoid the sprawl of custom node sizing, custom cluster configurations, we've come up with a set of standardized node profiles and configurations. You will see some of these actually feeding through, for example, also in Anoket. The environment involved as you will imagine is very diverse. There's multiple tenants, multiple vendors, multiple markets doing different things. So ensuring things like segregation, multi-tenancy and controls within a secure telco cloud is something of critical importance for us. So things like diverse and different and custom designs, that sort of thing we're trying to limit with, for example, two main profiles for basic and network intensive nodes, supporting obviously control and signaling playing versus data playing applications, for example. And then these are then declined into multiple flavors. For example, with and without hyperthreading. We see that hyperthreading is very beneficial to some of our both control and data playing functions. So enabling that or not is quite important. And then obviously for network intensive application, the types of acceleration, for example, are we using virtual switches? Are we using SRLV? What exactly can the application expect from the platform is then built into the catalog this way? Then the other thing I wanted to highlight is obviously that we have multiple networks, multi VRF environments where we have requirements for to connect our network functions and therefore the cluster where they're hosted to different systems that are located in different part of the network. So that is obviously a very common requirement. Sometimes this is solved with a single network interface, especially when network functions don't support multiplexing. So with external routing and then obviously a single network interface, we can use the native vanilla Kubernetes networking model to then connect pods to an external routing object. That obviously let us use the Kubernetes techniques policies, but obviously we had the firewalling, the external firewalling, the external problem that was mentioned before. There's no such thing as an advanced egress IP in the vanilla, let's say, Kubernetes distributions. So obviously that is forcing us to segregate applications that don't support multi homing into dedicated clusters, adding to the overhead. Otherwise, the other way to solve things would be to have multiple network interfaces such as here. You see that there's different networks connected to different interfaces, but that comes with its own set of problems such as the fact that these networks are not treated as first class citizens. We can't apply network policies. We can't do services on them. We can't do what we would like to do with the, let's say, first class citizen objects of Kubernetes. Tom will talk about the multi networking object enhancement proposal in EMEA, but this is something that we are actively working on and we hope to work on with the community. Other things I haven't mentioned include, for example, management of external networking and so on, but I will take any questions if you have any, but here you go. Over to Tom. Thank you. Thanks Ricardo. So some lessons about building the cloud and now obviously as operators we also, we don't build the CNS, but we operate them. And so some lessons we've learned through launching our 5G services include the release cadence. So that was mentioned earlier. I think we've mentioned it. So aligning releases between the community, the platform that we're building, the CNF vendors and what they've validated their software against. You know, there's a balance between freshness and stability. That's been a challenge and just something we've had to work through with all the parties. Cloud native design patterns is an interesting one. So as Ricardo mentioned, across the markets and across the domains we operate in, we've got lots of different vendors, lots of different solutions. We do tend to see patterns such as guaranteed pod sizing and large pods still and multiple concerns per pod. It's starting to change with noticed and things like the CNF test suite, the CNF certification program, all of those best practices we're trying to build will help towards that, I think. So we can move towards burstable pods and a more kind of flexible approach to the networking stack. But also configuration management, lifecycle management. As Ricardo mentioned, it's been months, if not years, of design work and work with the vendors and the platform teams and so on to get to launch. And not all of that is about just deploying the day one thing. It's also understanding how do we lifecycle manage the platform afterwards and lifecycle manage the CNFs. And unfortunately we're still seeing some legacy element management type approaches, Netconf, SSHing, that type of thing. And I think there's a general understanding, I think, that things like the operator approach and cloud native control managers for the CNFs is a good way to go. And whilst I'm not particularly involved, and Ricardo and some of my colleagues are in Neffio and that's potentially bringing some of these new thought processes to the table as well. And so that's a load of things about the lessons we've learned, the problems we've had, the challenges we've had, the things we've faced. The question is what do we do about it? And I don't think any of us are arrogant enough to suggest we know the solution. But I think the answer is we all need to work together, as has been mentioned on a few of the talks so far. There's a community of operators of CNF vendors, of cloud vendors, platform vendors and the open source communities themselves. And we can all, I think, improve the situations that we mentioned. So Buck mentioned about the collisions earlier on. I think we recognise all of those. The operators, as was mentioned in the Swisscom talk, there's all the technology in the world can't solve the ways of working that we need to modernise. And CNF vendors, we need to work together to modernise the way that the CNFs are lifecycle managed and configured. So we're moving away from the legacy approaches. The communities, I think it's quite easy for us as the operators to sit there and go, well we need this, we need that. We all need to be part of the communities like the multi-network in KEP, which one of my colleagues, Apos, is participating in. I know there's a lot of people in the room driving that. And I think that's another good thing to work towards making those multiple interfaces first class citizens, as Ricardo said. So there's a lot of things we can do as a community. But I think the key is that we need to keep working together and keep building that community to drive those improvements and build the solutions. And so I've added some links. I know there were some links in some of the earlier presentations as well to the multi-network in KEP. For those of you who aren't aware of it, I'm not fully up to date with it, but I understand it's looking at an improvement to Kubernetes to introduce multiple interfaces on pods as a first class citizen, rather than something like it is today with Malthus, where it requires a separate solution to enable. So the CNF working group, for those who don't know documents best practices, there's a bunch of us organising the community event this afternoon, which I hope to see some of you there, that'd be nice. CNF test suite, so that's the testing, I think the harness presentation mentioned it as well, which was great. That's what's behind the CNF certification programme. Anacut, as Ricardo mentioned, has been running for a few years now, documenting some specifications and informs and drives our blueprints quite heavily. I mentioned Neffio as a potential future for lifecycle management of CNFs. Something I haven't mentioned there is Silver, which is a bit of a mistake. That's certainly something that is on our radar. I was involved in the early stages, but as an open source community build of a platform, that's an excellent way for us operators to kind of contribute back to some of the solutions that are needed to the problems we face. So, yeah, thank you. We'll do Q&A individually afterwards, I assume. Okay, thanks Tom, thanks Ricardo. Stay there. First of all, we are running quite ahead of schedule, and I got notification that actually it's not only about the people in this room, but the people who would like to come to the particular slot. So, we got the time we're making after this short coffee break, but before we go for the break, I would like to ask Tom and Ricardo. I'm quite intrigued about your experience from the multi-vendor environment. You, I guess, as we all have a multi-vendor CNF set-ups for our service chains. What's your experience, or first question, did you achieve, or do you have a multi-tenancy inside a single cluster of the multiple vendors? And the second question is, how do you synchronize the upgrade process? It's not only about single CNF and infra to be synchronized, but across the multi-vendors, you have the colorful set-up, I guess. So, we'd be interested to hear your reflection on that. I'll take the brilliant process. OK, Ryan. So, multi-tenancy is, let's say, tackled in different ways. Some clusters we managed to make the multi-tenancy in multiple applications from the same developer or vendor can coexist. Our requirement for that is that multi-homing, for example, needs to be supported by the application so that external security devices and external security policies that are applied somewhere else outside of the Kubernetes control planes remit can tell what application is talking towards external systems. So, that's an absolute mandatory requirement. But, like you asked, multi-vendor clusters is something we don't do. What we do is to manage the lifecycle of Kubernetes clusters so that we can automate the deployment of many dedicated Kubernetes clusters on a single hardware or virtualization platform, depending on whether we do Bermetal-based Kubernetes or VM-based Kubernetes. But, automating the lifecycle of clusters allows us to maintain and manage and upgrade and scale separately, individually, multiple clusters for multiple applications. That's how we separate multiple vendors today. And then, how do we make everybody agree on upgrades? That's actually a mix of stick and carrot. We are obviously ensuring backward compatibility for all of the management components that we are providing as part of the platform. So, we don't, for example, mandate... We support, for example, multiple Kubernetes versions at the same time so that upgrades can be staggered across time if a vendor or if a solution needs to have more time on an older Kubernetes release. Provided is among the supported ones. We can do that. And then, when the platform upgrades have to come... So, Kubernetes releases now in the community are every four months, which is good. We have a longer 12-month window for the community support. But, obviously, as you can imagine, both of them go with commercial distributions of Kubernetes. These have extended support windows. So, that allows us to have overlapping releases that allow us to have upgrades at different times for different markets and different applications and even different things in the same environment. So, it's a mix of forcing people to upgrade at our pace. There's going to be, obviously, some delay from the community releases to commercial distributions, but the pace has to be the same. We are forcing, telling our vendors, our engineering teams, our operations teams that upgrades every, for example, couple of Kubernetes releases. So, eight months have to be part of their planning and strategy and mentality going forward. So, much more frequently than we did with BNFs. And then, backward compatibility, that's a carrot. So, we try to, obviously, not break things. Both in the, I've seen in the architecture here, we have management components, like the Kubernetes control plane and the cost manager would be the lifecycle management for Kubernetes clusters. But also, we are going at sea with the architecture for NLV orchestration and the lifecycle management of the network function. So, those management components that are also part of the platform, those are also part of the upgrades and the lifecycle of the overall thing. So, it's important that we maintain and don't break anything when touching these components. And then, if you have anything to add, or...? I think all I'd say is it's a balance, because I think that the multi-tenancy within a cluster is kind of seen as a dream, as a goal. But I think that brings an awful lot of complexity in the upgrade process. And having the overhead of some extra control plane nodes to ease that part of the complexity is a balance that may be worth having. I think there is a balance. But I think automating the clusters is a huge part of the answer of how we make it as efficient as we can. So, you are essentially pushing your pace of upgrades but allowing the opt-out or delay in the comfortable time window so that everybody can catch up. Exactly.