 Thank you and welcome all We're going to talk about a telco CNF journey from zero to millions. I'm Shahat Rao. I'm a solutions manager at Erikson working on delivering end-to-end solution architects and And the delivery of continuous network functions into various operators I've been working with cloud from on 2011 Initially with open stack and VNF world and now moving to CNF world with Kubernetes Hi, I'm Abdul Hanan. I'm working at Erikson for 12 years and delivering Primarily core control plane applications for our customers from 2g to 5g and also from a purpose built to now cloud native Let's begin our journey So we all know what is a cloud native application, right? So we are aware for a similar principle across the different spectrum It talks about using loosely couple services and these loosely couple services talk to each other using you know well-defined interface and all the benefits that go with it a telco containers network function is a little bit different from a typical cloud native application in the sense that in addition to this The in addition to the things that the CNA provides it also needs to cater for certain other important requirements including high performance right a latency Faster packet processing so we use like SRIO we are DPDK in order to do this faster packet processing as well as support from multiple networks Not saying this all these requirements are required with the very same CNF at the same time It's going to be a mix and mash of these requirements and we have different Different areas in which they have CNF right can be a 5g core CNF It can be an IMS CNF CNF or it can be a charging CNF or it can be a RANS So based on what kind of subgroup you'll belong to you have different requirements that that you need to cater to in order to do the deployment and Here we have put together some numbers For some cloud native deployments in public US operators And again all of them together. So these are 21 million unique subscribers and Similarly, they have 33 million unique IP sessions deprived all across the US on multiple different deployments and also making 60k plus unique API calls and There happens to be a particular win DCT here So we have taken numbers from a very a single deployment that was covering This location earlier this summer where we had there are 1.5 million subscribers on the node That externally leads up to 1 million packets per second that it had to Process and this does not include the internal microservices communication. It's just talking to the external network So taking a step back after that How did we end up with cloud native? So we were living in a very rosy world of purpose-built hardware No problems. Everything is fine, but we decided to disaggregate, right? so went through the whole VNF transition and Somebody decided it's not complex enough. So We're going to do this Well Regardless, here we are and the benefits of course, we're not gonna talk about that We already that this thing is already made some of the challenges, right? This was discussed earlier in the I believe the elephant talk as well. There's some integration challenges. There are R&D costs involved and also Performance optimization the fact that you're gonna put anything anywhere, but you need of course need to optimize for what you need and At the end of the day, you have to satisfy at the bottom, right? You have to satisfy all the 911 calls. They are around 500,000 made in the US every day I think that adds up to 240 million in the whole year and that's like 80% of them are on the wireless network That's like one call per adult in the US per year, right? So and that call has to be made. It cannot be dropped. It has to go through It has to talk to multiple microservices within CNF. It has to go to other CNFs in your network and You don't know where it's going, but it has to handle. We have to be handled, right? It cannot be dropped. So that's the primary requirement I think the most stringent requirement we have right now But on the left side, we also have this whole ecosystem of devices that we have to kid or two iPhone Samsung's underlying underlying that you have Qualcomm MediaTek. You have no laptops They are drone use cases so on and so forth, right? And all the devices behave differently Despite the fact that we have specs and everything, different scenarios devices do behave differently and you have to handle those Do that behavior because different use cases require different requirements and then also and then you know Taking from the previous slide. We also have these different configurations that people want to deploy in their network and from the operation perspective What we're building is not just for North America. It's for globally the maturity level of the operational tools that the different operators have globally are different and the application has to work with that as well. So Quite a complex picture today. I think and that we have to work with and I believe I hand back to you. So let me look at this journey, right? So I hope to talk about all the complexity that's involved But then how do you build a CNF, right? So you have you have to cater for a PNF, PNF and then the CNF, right? So you have typically you use a lift and shift strategy because you've already built a lot of the underlying framework in your fire different network functions earlier, right and lift and shift strategy is something typically that It's recommended as well as we increase our maturity into cloud native That's when the lift and shift goes away and you slowly replace it with purely, you know, using the whole 12 fact of them From an infrastructure perspective, right? It's it's not straightforward in the case that as you're saying across the world You've got different data centers. You got You've got data centers at the edge location, which may not run more than a few servers You've got you've got data centers where space may be a premium So you need to handle that and you've got very large data centers For example in the hyperscalers, you know, you have virtually unlimited capacity, right? So you you are application or your CNF should cater for all these different Aspects as part of the development process and then comes the CI CD part, right? The CI CD part is is its own journey. I think that's why they have a separate session on CI CD here, right? CI CD is not so easy to integrate when it goes into an operator's world, right? It's CI CD part and if you split up with the CI part the continuous integration Which the development team looks at in terms of testing and the integration that part is well big and it's being followed through But the CD part is still something that is lacking Just want to add that when we are taking an existing network function Operators typically ask for feature parity off the off the bat, you know, you're placing one network function That's working in my network. You're just all you're doing is making it a containerized function So I need feature parity and I also need time to market So those are also considerations that go into Objection development and if you're introducing a new function Then of course the question is will it fit in my existing network? Will it fit in my existing environment so the new has to adapt to the old in a sense as well? Exactly. So now that we have built the CNF. Well, how do we take it right into the production? So typically you do the crawl walk run kind of thing, right? So you cry you crawl by first putting it into your own lab if available and Then getting it get some basic testing going and some functionality validate the functionality of the product across Right, then you take it into an operator's lab where you not only do the basic functionality testing But more importantly as to all operators and CSPs They don't have a single vendor to satisfy one function. You usually have multiple vendors satisfying the functions So you need to work well with with your peers in the community as well, right? So that's that's first step of testing there Then you need to integrate with various southbound and northbound system that exists in our operators environment For fcaps, right? So for fault monitoring for alarming for the ticket their ticketing system so on so forth And that that kind of testing happens in the operators lab now that we have Not a fully baked good, but good enough that all it needs is a nice crusting on top we take that and put it into an oven and turn it on to broil and Hit it with the load that it can handle, right? This is where you use a lot of simulators Because you don't you don't necessarily translate real-world, you know bees in in a lab So you don't put in your entire ran deployment in the lab So you use simulators and you hit your CNF with this kind of load and then you certify that right? so now once we are finished through these different labs we take this baked CNF and Put it into a production environment for folks to devour and that's what happens, right? That's where the numbers come in and it's growing every day One thing is that not all operators have all these different labs, but you have it may not be physical But it is there from a functionality or a conceptual perspective. That is always available Then Okay challenges. Yes So once you're done, I think that throughout the whole journey like even when you're Doing all the validation. You're kind of not really validating the CNF You're validating the whole solution from top to bottom left to right as you're saying you have different vendors as well involved And they're also plugging into your network function. You are integrating with them You are to doing all the signaling the million packets you're sending Your control flame function sending to the gateway and then so on and so forth your policy charging everything is involved So at the end of the day you are validating the entire solution, which means you do have to talk to a lot of people and There are a spectrum of expectations that you have to satisfy Especially like the closer you get to your net own network function the more the expectations are you know going to be You will have to handle and you have to manage if policy is sitting far away from you then maybe not really big a problem, but if you're sitting on a cloud the Application folks have different expectations the legacy that the folks who have been dealing with the cloud They have different expectations. They have different ways of working They have when you're troubleshooting with them Different approaches different, you know ways of looking at the problem. So and all of that of course is coming off from their Of their experience though, right? Everybody has their own fast experience where they're coming from and now In some cases it does end up colliding There are also expectations for people from from again from parity perspective that this CNF will behave just like my VNF that's been in the system for five years No, right. This is changing the term got changing the way it works is now different. So That was one. I think I think the most important challenge at the end of the day is working with the existing pre-existing conditions of that expectations that folks have Secondly as I was mentioning that Different operators are different whoever wants to deploy You know, even private networks, they will have their own different configurations Some people will be some operators will be able to buy a hundred computers to throw at it Some will only have a you know, the switch already exists. It physically exists. It can only take this much power It can only handle one more rack. That's it You're you're only getting a rack worth of computer not getting anything else. So how do I minimize the footprint? Some are willing to put their applications in the public cloud. So you have all these different kind of Configurations and I actually even came our use case that I want to deploy an app and I want to deploy a backup app But it should be sitting in its own unique boxed-in cloud. That is it's only for this application There's no sharing there. So you have these different styles that people want to deploy with So you have to deal with them as well and that does become a thing a very interesting challenge The third one about it protocol makes that's more of a kind of an FYI I think telcos we are kind of used to the 2dp protocol because of gtp v2 is the very famous, you know control plane Protocol that runs over UDP In our 5g distance. They're moving to service-based HTTP They're relying on TCP transport and that kind of sometimes catches people off guard like I'm used to my part Automatically, you know you to be fire and forget a load balance But then TCP is like a stream and it's a client server and it's stuck there. It's not moving Why is it not moving? Why is my why is all the packets going on this stream all the time? I Upgraded, you know my one of my server parts restarted. Why is it not establishing any new connections? Well, it's a server part isn't the clients have to establish but the clients moved already to somewhere else So there are some different behaviors again nothing new here But you know kind of gets people that I've seen off guard that will be using TCP and the behavior is now different and Something you have to deal with and and then comes the So if you look at it, it's people process tools technology, right? so you have to all you always have challenges in processes and The big challenge we are facing right now in process is the fact that they have adopted an application development strategy in Doing a lift and shift approach of the processes as well So they have lifted the exact process that they follow for a virtual world into a continuous world They expect this part should be running on this node come what may and that never happens and Their systems their back-end systems are built just for that purpose their back-end system are also expecting this part this part of your CNF runs on this node throughout its life cycle and the challenge that we have with that is that Kubernetes will never allow you to do that once a node reboots. It finds the best possible location and keeps on moving the parts around Right, so that's where we have to cater to that process additionally right so we the the processes are still in kind of Journey towards maturity if you can use that acronym again is that The they are not yet so mature to understand how cloud native should work They are getting there, but they're still stuck So when they even talk to you they don't talk to you using cloud native terminology They talk to you saying like you know PNF terminology and that's that's because they are also handling some of the Some of the applications on the PNF side on the VNF side and on the CNF side, right? So that that part is very challenging in terms of the processes that we are involved because we need to adhere to their requirements Then the final thing comes with upgrade and the fatigue that it brings Typically in CSPs and operators, you can't just go and operate willy-nilly, right? You need to take a maintenance window. We need to ensure traffic is offloaded There's no impact to subscribers so that you know you don't get hit with the 911 call outage or and get fined by the FTC Kites of things so they are very careful all operators are very careful when it comes to doing upgrades and what it does is that it slows down the process of upgrade and As it's it's well known right in the community that we need to upgrade and we need to upgrade As regularly as possible so we cannot just no stop an upgrade saying we are not going to upgrade for an entire year kind of thing that is leading to upgrade fatigue because if you deploy it in hundreds of sites across the US in production network and Each upgrade takes a few weeks But the time you finish hundreds of sites you're actually back to the first site So the folk supporting it would actually have where I are in a continuous upgrade cycle And this is now you know, this is what is causing them not to like CNF, right? So that's where one of the other challenges that come then they say upgrade that kind of hesitant right, so What really worked in our case, right? if you if you look based on all the challenges that we faced was actually the fact that The key stakeholders that we interfaced with were actually pushing for the CNF's and They supported us. It's not so easy unless you have a partner true partner in crime on the other end Also believing in the same journey, you know, also believing the same path that you want to take the the the advantage we got we got lucky month across multiple operators is that that the key stakeholders believed in the potential of CNF's the believed in potential of all the things that we are doing when we take it into containerized network function and They knew the end goal and that's what kind of helped drive us through some of these people and process challenges If not for them, it would not have been so easy for us to actually deliver the CNF, right? I Just talked about upgrade fatigue But upgrade even though it was painful and we had to bake it through the different labs Eventually started worked, right? So we are able to upgrade our CNF's And what it does is the more we show there that upgrade work and Show the time it took showed like things like rolling upgrades and all the good features that comes with Kubernetes It kind of built trust in the folks that are managing it and that trust is very important to chip away at the upgrade fatigue That they have and that's what kind of helps us Push more automation make more parallelism and help you know reduce the upgrade cycles Obviously Kubernetes is really good at what it does and it did help us out in multiple ways We had a human error which bought down multiple worker nodes But the parts moved the subscribers were not impacted, right? This is just one example They this kind of resiliency that is baked into Kubernetes kind of gives us that advantage There was not there before because in earlier cases if something happened and your VM went down You know it bought down that entire set of subscribers Then we have to you know do we have to get in and physically manually do things to get it up and running in order to Handle but in our case the Kubernetes kind of helped is you know to make it more resilient Yeah, yeah, and I think the last one for me again, I could personally troubleshooting is is painful, but but the tools we had especially I think I think we've using Victoria metrics Prometheus and and Grafana to collect all the data and The same problem that took us months to troubleshoot earlier A year or so ago this time we could narrow down in a matter of days or weeks just because we have this Monitoring and observability tools not only we're getting the data, but we can manipulate the data on the fly That I have meant, you know verified CPU load across all the 96 cores now I don't have to look at it now. I can just go towards the mean that there's no bill of my own dashboards on the fly so that was I think in extremely helpful for us in especially when we are doing the low test verification in the labs Even you have when they are trying to push twice as much as subscribers that you're going to typically see This tool the tooling that we have available really really helps a lot and that was not present in the Rosie old world All right So call to action I think based on the challenges and what work? I think it's very very clear is please invest on your people and processes I know we all you know technology exists is there But it has changed it is different. It's not a question of better or worse. It is different So the the people please For example, the operations folks typically are the last to see the technology But they are the most, you know, the other main users they use the technology day in day out You know people ask me like why are you taking away my SSH access? It's work I had a very good life. I could print whatever needed, but you know, here's there's a different tools They work differently. They are better. This one is more secure or so on but please invest them in them as well And similarly same thing. I sure talked about processes earlier they need to be uplifted along with technology and Building off of the the last comment I had We are the apps are now heavily dependent on infra we lose sight of what's happening very quickly Package leaves apart and then you're gone, right? You don't know so Fault triad is a challenge still I think even despite all the good stuff we have and as Call to action if anybody can please help it how to triage a fault that's observed on the application But it's because something happened on the infra. How do we get narrowed down that to the root cause? Yeah, and finally talked a little bit about the upgrade fatigue and and the reason for upgrading so constantly, right? And that one of the primary reasons is the support It's not looked at so often, right? So the Kubernetes currently supports 12 months With two months of external support and that is not sufficient not just in telco industry But I think if you start looking outside as well in finance health care They cannot afford to have such short upgrade site shut short support periods, right? It's causing them to therefore the call to action is to for the community to start looking at Doing a longer term support for Kubernetes Which will help because we don't want to start forking Kubernetes, right? Nobody wants to start forking communities We want to contribute back to the community and make sure we follow the community path, right? So that's that's not the final call to action and with that hence our talk unless All right Now we have to thank you so much and in the meanwhile if you have to forward any feedback Please scan the QR code Time for questions if there are any It's another question. It's a comment. There is a Kubernetes LTS working group So you can join them and help them to have long-term support for Kubernetes All right. Thank you. So call to action already implemented Any other questions over there? Um, are we actually going the wrong way by Doing the long-term support for Kubernetes? I mean you guys talked about how hard it is the upgrade fatigue and stuff What we should be doing is improving how we do upgrades so that they're not so manuals We're not using maintenance windows all the time so that it can be rolled out easily and so that upgrades become something that Happens without anybody really even noticing, right? I mean, I understand in the short term that you can't get there overnight, but But we really need to figure out ways to get to that, right? correct and the reason for asking long-term support there right was was to answer some of the Questions that you have posed there right exactly with We need to get to a place where upgrades happen seamlessly, but that journey is still far away, right? You need to first build trust you need to end the the reason why it's far away Is because of for things like regulatory requirements for 9-1-1 calls, right? You cannot afford to drop so what happens if you're doing upgrade? Let's say in off peak hours, right? And that's a 9-1 call then right? You have to be careful to handle that we are looking at things to improve the upgrade not denying that But there is still some while away right in our before we can get there We do need support at the Kubernetes level and I also want to say one more thing, right? There are two upgrades that we are talking about right one is the platform which is Kubernetes the second is the applications, right and Both of them are not traveling at the same speed So therefore if we get a lot more support at the Kubernetes level Which is giving me a base then I can start pushing the applications to you know do more similar supports That's that's why this call touch any other questions All right. Thank you very much. Awesome. Thank you