 Hello and good morning everyone. Yeah, we are excited to tell you more about our cloud native journey So it's fantastic to be in this lovely city Amsterdam and after this very inspiring lightnings talks Alessio and I looking forward for an amazing day to share you our experience and cloud native journey We did so far at Swiss gum So I'm Josh Hiller and I'm Product owner at Swiss gum and I'm very passionate about DevOps and bringing together people and ideas to build innovative solutions and with me is Yeah, I'm anti-alessio senior DevOps engineer at Swiss gum my first time at good one But I hope I will be able to give you some technical details on how we try to move that from Spreadsheet learning escort. So let's get it started So, yeah back to the title. So first we want to explain. Oh, no What's that? Oh, no, my teammate put me on the On the on the call service for our 5G core. So now dev Ops Jenny is calling me. So I will take the call Sorry, I I missed it. So Yeah, Alessio will explain you more about What's going on actually What's going on here is that someone by mistake probably deleted CC CC RC, which is our NRF function, I don't Cannot Here maybe yes, as you can see here, there is a CC RC namespaces that was created lately in three in three minutes, which means that it was been deleted and now it's back alive This was on our own permits Infrastructure and now if I managed to find my mouse Yes, you will see that it's exactly the same situation here for The NRF in AWS cluster we have so we have a freshly created namespaces Thanks to our automation We don't have to worry about that and we can continue with our talk and we will see at the end What's the result? So back to you Josh Yeah, so why we are doing this So first we really want to improve our time to market to build new services for our customer and as well we want to benefit of To benefit of cloud elasticity so we can build our Telecover cloud on different clouds on public cloud private cloud With this we also want to increase or improve our innovation to build more Reliable services that they can operate our services at scale But all this new Technology is is arriving and we need the skills really to be able to manage and build this automation So for this we have our journey from telco to tecco and there we have four pillars So one is really about simplicity. So we so telco Infrastructure or services are increasing in complexity. So we really need to be able to simplify to be able to build new services or products Fast and get them out to our customer. Then the next pillar is cloud nativeness. So it's not about religion It's really about sustainable Operational benefits we want to achieve with the elasticity of cloud and for this we also really need the automation There yeah, we cannot use any more spreadsheet. We need to have everything as code So we have it described in our repository and we can push it to the different infrastructure and To do this we also need to change how we work So it's not enough to be able that Someone just knows about a certain capability and the next person knows about the next capability So we really need to work together to be able to make this automation work So how we do this? So first we have one backlog for the whole 5G core we built at Swisscom and there it really helps to prioritize and as well to to share the knowledge between all the different teams and We really encourage the people to work on tasks of other teams and share Their knowledge so really a boundary less Collaboration as well. We really take together networking engineers Kubernetes experts and application specialists to build this new 5G 40 core together and We we we start to To experiment way faster, so we test something we fail and then yeah We we try it and with another solution So we really try to rapidly find new solution for the problems we are facing and The last point is really we have a close collaboration with our vendor So there we can give them fast feedback and push them in in the right direction to build more cloud native workloads With this I want now to go more into the technical detail so to make all this work and be able to Spend more time building new products or service for our customer We really need to have the end-to-end automation. So we don't need to spend our time on tedious repetitive work anymore So to give us give you a short introduction on our automation pipeline So first there is an overarching pipeline today We use Jenkins for that and then we have a few sub pipelines triggering different functionalities so we get the package from the software packages from our vendor To our software gateway from there we use the sourcing pipeline To check this and monitor if there is any new software take it if needed decompose it push it to our art Artifactory all the artifacts the helm charts and the images and Then we also do security scan so we get alarms if there is any vulnerability in this software We receive from our vendors in the end and then the most important thing our DevOps team Which? Manage everything as code in our git repository. So our full service is fully described in the git nowhere else and then we have our Different clouds we have a public cloud private cloud and there we have our Kubernetes Cluster and we use of an operator called flux to go to git read the description What to deploy then go to the Artifactory to get the helm charts the image repo and then synchronize this to the Kubernetes cluster and with this then we fully automatically can deploy our 5G 4G core Then if a CNF is deployed it will notify that the overarching pipeline and will trigger some post deployment step or Then also the configuration the business logic of the CNF also this ansible is fetching this configuration from git and then pushing this over net conf to the different CNF's in the end and As well then we have additional pipelines to trigger an element Manager or to configure an element manager a security manager to to know about the new CNF which was deployed to be able to communicate with it and get the information needed for their capabilities and Then really one of the most important part is our testing pipeline there We have different tools we can trigger and they're like an example chaos testing which was mentioned before from litmus so there we can trigger this after a upgrade and after a Config change as well. We do this continuously testing. We will show you later and Do this to different CNF's or as a system under test or as an end-to-end test So and they're also very important is our observability capability where we get all the metrics the locks We have as well a dedicated cost a cluster for the long-term storage where we store all our Data and this cluster is as well managed by the whole pipeline and then we use our internal Monitoring as a service to create the dashboards to and as well to create the alarms Unfortunately, we were not able to hear it before but that would be obstiny calling me Yeah, that the namespace of the posts and our apps got deleted with this. I would like to hand over to Alessio Thank you, Josh. Thank you for the great introduction I would like now to go to some more technical details about the git repositories We have and how we manage them the first one I would like to talk about is I pillow level design repository Which is one of the milestone change we lately introduced It's an unseable inventory where we store all the peer-related Informations for all the application infrastructure and all the tools we use to manage our 5g core It's really what enabled us to Further automate everything else. So as you can see in the left side, we have The unseable inventory with us, which are the Kubernetes clusters we manage that are divided in groups Different groups regarding their characteristics on the right side instead you can see an example of how we store the parameters So it's a plain YAML file with a hierarchical structure And it's as you can imagine far away from what we were used in the past with spreadsheets Where you have a spreadsheet in somewhere in SharePoint with a freestyle structure Everyone can go there modified it Now if this one if the these parameters are starting it you have a centralized way you have a peer review process to introduce changes and it's the only and unique place from where you get Parameters the way you can get them is Described at the bottom of the slide. So if you are within an Ansible playbook or a Jinja template is natively Supported with a path and the name of the parameter you look for and it's as well Quite easy for a generic script like for example Python or whatever you can use the Ansible inventory command piped into a jq o parser and use the path and the variable name to get The value you look for this is what we do not only for the IP related parameters But for all the parameters of application and infrastructure we manage as I will show later in the next slides Let me now go into some more details about the code base for our CNS as you can see from the top right corner, we have Actually at that moment one repository for each CNF we manage It is this was not the case at the beginning when when we had an all-in-one repository However, this is not manageable at all You can imagine the amount of PRs that are coming for Introducing different changes in the different network functions. It was not scalable and manageable at all so we decided to come to a Splitted version for each CNF the structure of each repo is Anyway, the same one and it's the one that is depicted in this slide So as you see the first folder we have is the Ansible folder where we store The application low-level design so all the parameters that belongs to the application which are not IP related It's again another Ansible inventory Where in this case os are the different instances of one of the CNF we manage in the example You have ccpc so the policy control function and you see all the hosts that all the different Instances we have we store parameters as I said before per host. So we which are specific of specific CNF instance or In group bars where when they are shared across different CNF instances in the same folder we have Sorry, we have as well Ansible playbooks and Jinja templates we use to produce the configuration and Kubernetes Deployment files. So there is Ansible playbooks that fetches values from the different low-level design parameters and Populates templates to produce the application files, which are again stored in it with the structure You can see in the slide. So for example here sorry for example here we have All the files for the day zero which are the Kubernetes resources and the day one configuration files Which are the files that Josh showed you before are pushed via net coms For what concern the branching strategy we have in place We have one branch for each cluster. We manage in the different sides. So several branches for each cluster in dev staging and Production how merging strategy works is depicted in these slides So the idea is that when we want to introduce a change or whatever and upgrade or a new feature We start clean from a feature branch and we produce all the files from templates for all the different Clusters so when all these changes are ready We raise a merge request to move from the feature branch to one of the development Branches so so to one of the dev clusters we perform some tests this which are automated and when the feature is Validated to be working as intended we move then from what the first dev cluster to the other dev clusters Again when moving from dev to staging the same process a lot of testing when it's validated from one stage clusters spread Across all the other stage clusters and finally eventually hopefully to the production What is really important here is that with the template we really gained a lot of Velocity and we reduce the failure rate because the structure of the thing you introduce the change of whatever it is It's always the same and parameters are gathered from the low-level design inventories Going now to the infrastructure repo. We have here We managed as we said before two kind of infrastructures So the we have the on-premise one, which is the one that the which is Deployed through the package we get from our vector Ericsson cloud container distribution eccd Which is a flavor of Kubernetes deployed on top of OpenStack VMs Here in this case, we have as well ansible Inventory so we store parameters for eccd in a specific repo and again Ginger template and playbooks to produce deployment files environmental file file You may be familiar with for OpenStack. That is then deployed by a pipeline for the AWS case we store as well in Git all the parameters for the flow and join step functions that we use to deploy all the needed resources in AWS so EKS VPCs and whatever we need Going now to another repo which which is the one we called common repo. This is another Milestone, let's say repo which is kind of glue among all the other Repositories because here we store some common artifacts like for example the Jenkins pipeline We use to trigger that they want configuration via netconf we store as well a bunch of playbook ansible playbooks that are used to configure the Some other systems like over the for example the gateways the routing and the firewall rules We need to operate our 5g core. There is as well the definition of the flux operator Which as is the operator as just showed a big at the beginning we used to? Deploy our Kubernetes apps Just as quick detail on a quick info on how we manage the rest of Applications which are different application kind like for the one that we use for testing or for our Observability stack all of those are as well of course storing in git and they are managed as well the same way as the other so inventories flux and Eventually some ansible playbooks and pipelines But how all of these comes to together and glue to have what Josh introduced at the Beginning, this is what I tried to depict in this picture So the idea is that at the beginning there is a manual input and introduction of the parameters in the different Low-level design repos at the moment is manual But it could be as well automated like for example via an IPAM when although all those parameters are ready The automation kicks in like for example for the infrastructure. We have a playbook that is pulling in for Information populating template producing configuration files and then again with another pipe pipeline is the infrastructure is Deployed when the infrastructure is ready. We have other playbooks that Again pull the information from the all the all the Application and IP low-level design populace template and produce configuration files Day zero files which are in our language Let's say Kubernetes files and they want files. So net conf configuration files Flux now can take those file from the git and deploys them on our cluster as well as for some bunch of Tools like for example the external secret operator. We use to sync secrets from the Ashikov bolt Now then there is the day one pipeline which configures via net conf and when the CNF is ready a bunch of testing Cases are executed to complete the picture We have as well some others more people we manage like for example the one where we define alerting rules which is One one of those others is the one that we received at the beginning of this presentation alarming us that a namespace was deleted and Lastly this repo here on the upper left side where we have all the other tools we deployed as I said with flux After all of that our automation. I hope should have worked as expected and let me now go to See in the dashboard if everything is back as intended So we prepared this dashboard here for the cube con Then my view. Okay, the VPN. I hope it's working back again Yes, so actually what you can see from here, this is the Grafana dashboard So for the med with the magic that we get from the cluster We see that there was a period where the of course we lost of Lot of pods because the namespace was deleted and now the situation is more or less back to normal Unfortunately, we don't have all the NF back Registered but yeah, this is real life. Sometimes it takes time to get all the NF registered back into the nrf What I would like to show you as well is that I Think that we have problems with the VPN, but the idea was to show you here some logs entry about Yeah, it's even worse now Anyway, that's sure. We should have been some logs entry about the changes we made so the net conf Configuration that were pushed and as well some information about the testing Let me see if I can recover this situation Probably it's reconnecting. I don't know why Really sorry about that That should have Okay, yeah, anyway, let me maybe try quickly, but now I will not make Really sorry, but yeah and With that, let's switch back to the presentation for some conclusion from Josh Yeah, so really important there was also that we really are able then also to show that the test progress So you would have seen how many test tests were successfully completed after this redeployment So we don't need to touch anything to get the CNF back or to deploy a new CNF So that's can now be fully automated and with this is really that Basically our conclusion is that automation outrun spreadsheet. So we will not work with Text a file or something stored on an individual laptop. That's not any more possible in such an environment And we really can benefit of cloud elasticity Especially if the workload supports it that you can deploy it wherever you need it It's important to have this low-level design as code and the smart repository structure and Yeah, you need to bring together the the good engineer from all domains to build really the Yeah, to figure out the best solution for the problems you need to solve Which really is able to then transform with the technology So for us we have started the our journey from telco to teco and Yeah, with that. Thank you for listening to us So if there is still time for a question or we can move to the next talk Possibility to ask the questions that audio setup doesn't allow that. That's why I'm going to Play this role. Thanks for this presentation Very insightful. I am quite curious How do you tie this in? for example, the Change process with the merge request promoting from environment to environment into your existing well established change management process in the company So that's have not been done yet. So there we We we try really to go new ways We are working with our change management to see how we automate changes and Which which changes really need to go through this change process and what's just done? Automatically because yeah there with with how we work now. It's not possible anymore. That's someone understands Why this change is now going to production and even yeah have the time to click somewhere to accept it or even Ask someone why this now needs to go to production. We really want to get to a change capability where we bring Like 20 hundred changes a week or even a day to production. So it gets impossible to know exactly What which change is really doing and there we also think that Because we we have this structure and all the testing that we are very confident that everything which get pushed to production Is not breaking something Thank you very much. Thanks Justin. Thanks Alessio next on the stage. Please give them a round of applause