 Good morning everybody. It's really a pleasure to be here with you virtually at GITOBS.com. My name is Michel Severa, I'm leading cloud native 5G DevOps team in Deutsche Telekom and today I would like to discuss our topic which is managing legacy to cloud native with GITOBS and Deutsche Telekom. I would like to use the context of our 5G journey exactly to describe like what are the benefits of such a transition to cloud native model and of course what are the challenges related with this one as well. In the course of this presentation I would like to get you through three steps. Step number one is answering the question why we need cloud native production model at all. Once we will cover this we will focus on what section and here I would like to mention a little bit desired state concept and the 12 factors and finally we will go to a section how and here the focus would be more on challenges and opportunities but also we will do a deep dive into a GITOBS-based operations including structures and flows. So first of all let's try to answer the question why we need cloud native production model at all and a very good background for this question would be Kubernetes documentary movie. In this movie you will spot an interesting topic which is desired state concept and desired state is very much related with Promise Theory proposed by Mark Burgess so understanding resiliency in context of faults errors and tolerance within systems and that was actually the very important discovery that you could transfer this sort of desired state by automating the reconciliation process between a desired state and a running state within your cluster and having this in mind we can really achieve increased agility productivity improved scalability of course lowering the cost and finally completely getting out of the vendor lock so we can we call it melting the cheese on a pizza so basically unifying the approach of management of all the systems independently which vendor is delivering this system. Desired state concept was really a big benefit and big milestone for IT systems it was absolutely obvious that sooner or later telco segment will take a look on this concept as well especially that telco was also under the journey the journey which started with very much box-based systems like Atka platforms where you have hardware and software coming from the same vendor and then later we had this virtualization step that's in middle of the slide so network function virtualizations means that each network function is represented by VM which is still very much silo approach because if you think about the box like MME or gateway represented by the by the VM it's very much aggregated so still it's a silo you cannot see exactly and you cannot benefit from a microservice-based architecture however it was a very important milestone because that was the moment where we've seen decoupling of hardware and software in the telco industry and it was quite natural that the next big step would be even more decoupling so from assets perspective hardware and software now is still fully decoupled but in terms of the software we have even more and further decomposition of the network functions into small modules so now if I think about the component like Istio for a service mesh I can use this very same service mesh component for each network function completely separately and I don't need to deploy or manage it from the application perspective so I'm very much focusing on my microservices itself and taking benefit of the desegregation and this desegregation can be nicely explained when we'll compare the architectures on the left side you see virtualized silo on the right side webscape cloud native so in virtualized silo in every network function things like life cycle apis or database they are integral part of the applications on the right side they are completely extracted so application developer can really focus on the key aspects if you think about amf that will be termination of gnb functions things like service mesh or alarming or database management is no longer part of the amf that's very important part so it basically means we are taking we are talking right now about a solution which is not yet another black box it's also no longer a single point of responsibility for a vendor now you need to consider all the layers for involved parties however there is a very important remark which I would like to make here running cloud native does not mean just running containers and that's a typical mistake in a lot of slides or a lot of presentations in the industry you will find the claim that things are cloud native just because they're running in containers you could absolutely run a very much silo application which is running in containerized framework but it will be not cloud native running cloud cloud native means that we need to bring entire automation in in specifically github's based application and automation management with cacd pipelines and that is the prerequisite for being a cloud native not just running containerized applications and this is very nicely presented in a cloud native manifesto an operator view by the ngma alliance you will exactly see a list of points which are critical items for being an application to be considered like a cloud native so in order to understand like what is the biggest difference for cloud native model versus legacy for telco we need to understand like how the classical management of the legacy network looks like so if you think about the slide site and this site contains a lot of systems and each system is coming from a different vendor so we have system one to system n each of such a each of these systems has a very specific vendor configuration management it also have a completely different concept of desired state desired state which can be achieved by creating a backup so now imagine that there is a change request someone wants to introduce new feature then this new feature needs to be translated from your head through a protein base interface through your hands towards the keyboard and you need to go separately from one system to the other using completely different procedures and also in case of disaster recovery if you will lose this site and you want to recreate it you need to make it completely separately and with different procedure for each vendor it basically means that there is no way to achieve desired network state in an automated way in such a model also time for let's say the reconciliation means addressing the desired state into a running state is very long and that is very much different with cloud native concept concept which is based on GitOps so here everything including the infrastructure it's considered as a code it's sitting on the left side and here we have a single sort of true which is represented but what we have in Git as a desired network state on the right side we have our running state and we have pipelines in between CACD pipelines and in this model we have automation of transferring desired network state into running network state and that is the essential difference between the legacy model where there is no possibility to for such an automation you have very much imperative way of implementing things towards a declarative way which is based on GitOps in our environment element which is responsible for such a CACD automation for reconciliation is flux. Flux is exactly providing capability of reconciliation of desired state and running state in our cluster so whenever we change anything on a repo let's imagine I'm adding new dnn it might be just a configuration change or I will add yet another new slice in my network so new instance of smf and upf for example then source controller is detecting it and it's triggering the entire reconciliation process so that is giving us the chance for fully automated reconciliation in the network it is also very important to understand that in order to be declared as a cloud native we need to avoid vendor specific network element managers it means that applications should follow so-called 12-factor principles and a 12-factor application is storing every time configuration in environmental variables so let's imagine based on this slide that I want to change parameters in my application let's imagine I want to create a new apn I can go to my cluster repo on the left side I can change the values flux is detecting it it's triggering reconciliation but of course it's using only standardized kubernetes apis so as you can see in this entire process up to the change of the config maps towards the application there is no involvement of any northbound interface by the way there is no such a northbound interface at all this is basically not present basic principle of this is that each application independently if this is vendor specific microservice for AMF or this is past generic component like Istio will be managed exactly in the same way that was this melting the cheese on a pizza concept and in if application is really following the 12 factors principles we can exactly achieve it and that's very important for the architecture in our representation so far we covered why and what sections now let's just try to focus on how in essence how to introduce cloud native new mindset and there is a nice quote that the most dangerous phrase in the languages we've always done it this way it was also a very big risk on our side we knew that if we'll try to port all legacy principles into a new cloud native deployment it will cause a lot of change a lot of challenges so in our case it was essential to get with new clean design paradigm in essence moving away from world of boxes use of declarative deployments with do not repeat yourself principle use of canonical sort of true and following with GitLab airbag systems to support telco processes that was all very essential it was also absolutely critical to finally get rid of all unnecessary practices in essence in legacy network you will typically have one major software release upgrade per year and instead of it in cloud native you should desegregate it so basically the risk the process of upgrade by making a lot of smaller changes in our case we do frequent changes even a few per week but on a different levels and on a different components and that is significantly de-risking the overall operations it was also very essential to focus on the core competences on the engineers side so if you think about the typical network operator you will have telco experts and you will have cloud experts and they don't speak each other so typically telco experts know everything about 3GPP protocols they didn't have an experience with classical telco management but they have no clue and maybe no experience about new cloud native principles and vice versa cloud experts they have a very solid github's framework experience also use of declarative deployments and web scale applications but basically they don't speak telco language and for us it was very essential to create a hybrid team mixing cloud and telco competences together and trying to figure out the answers for key questions like what we can change on both sides how we can start with greenfield how we can empower the teams to do things on both sides and also what is very important how we can also celebrate mistakes so unfortunately during the course of the project we had a lot of cases where network was somehow affected or we had any severe issues in the lab but each such a learning each such a discovery was a good opportunity to take an improvements next time and to do it with iterative approach so we ended up with the system which is taking best of two worlds on one side having all the best cloud native principles with CACD pipelines and with flux-based reconciliation which is pretty much presented here on the upper side of this slide and on the other side we had also a very classical telco integration points in essence things like RAN or SPIRENT for traffic emulation and performance tests what is also very essential is that the lifecycle management of those systems is happening is working independently for all those components so pretty much you can manage the AMF lifecycle management SMF lifecycle management and in the same time things like past components completely independently it is also very important not to underestimate the complexity on one side new cloud native operating model is bringing a lot of opportunities things like de-risking the upgrades by higher frequency on the other side we have a lot of challenges which are related with the architecture itself in essence things like rolling upgrade and in software system upgrade which needs to happen without the impact on the service they are really creating a lot of problems and unfortunately the benchmark towards the lego system is really very tough so because lego systems were developed by years you will have really very good quality KPIs things like I don't know one or two minutes of downtime per one year is a you know absolutely possible in cloud native I am a little bit joking that a single rolling upgrade can consume those two minutes or three minutes in case that things will go wrong in just one day and that is definitely the challenge which needs to be resolved in an upcoming months also the things like end-to-end troubleshooting or tracing they need to be redesigned and rethink completely it's definitely very much different from a legacy systems I would like to also mention so-called butterfly effect in highly distributed systems so because we have a very high frequency of changes and because we have a very high number of microservices with a lot of dependencies the challenge is how to assure service quality and reliability towards our customers having such an environment so in legacy systems the typical approach is that you are testing in the lab and once all the testing is done smoke and regression tests you are doing an upgrade and that's pretty much it in cloud native environments with a high number of microservices it is a strong recommendation to use non-stop testing concept and this non-stop testing concept is assuming that you are running all those tests also once in the production and basically this can provide very early warning that certain things are wrong after you are doing the change so even if this change was totally tested in the lab you cannot make sure that it will always be without issues in the lab because number of possibilities and number of dependencies are very high also super important to mention that root cause automation needs to be done differently with a very high number of microservices and with the complexity which we are talking about using a manual approach where you are just delivering pickups and you know human beings will analyze it it's definitely out of you know scalability so one of the solutions is of course use ai ops so basically to provide all the logs and all the data points towards one system one big data lake and then run additional ai-based algorithms on top of it to predict and to support the root cause automation okay time for final conclusions so first of all running applications in container is definitely not enough to declare such an application as cloud native we need to consider 12 factors principles in essence we need to think about storing all the configurations in environmental variables so that's point number one second is that githops as a framework is essential component of the transition towards the telco and we need to have the ability to convert desired network state towards running network state not only in a daily operations but also in case of emergency and backup recovery the third point which is also very important is that we should follow declarative deployments using this desired network state and we should definitely try to avoid any imperative based of actions there are still a lot of systems which are considered to be cloud native with a lot of scripting based on imperative but that is always that will always cause an issues in essence during disaster recovery and last but not least I was mentioning this so-called butterfly effect which is very much related with highly distributed systems so in order to avoid it there are two recommended approach first one is we should use proactive service assurance be basically a so-called non-stop testing even in production so even if we tested the things in the lab we should consider the system to be tested non-stop and second one we should not rely on the manual process of troubleshooting because the data are overwhelming and the scale is beyond you know human capacities definitely use of AI ops and those kind of solutions is very much recommended I would like to say big thank you for your time in case of any questions I'm absolutely available you can find me on LinkedIn and I hope that you will enjoy the rest of the presentations thank you very much