 Hello everybody. This is Dorian and I'm Roman. We are from a really good company Maybe you can see it already in the color from Deutsche Telekom And we're trying to talk about Observability and how we do monitoring in cloud native tackle world and how we can tie this all together Okay, let's look at our short agenda Question is how we collect everything? How can we get the metrics? How can we evaluate them if we have them what we do with them? Of course, and of course an outlook to the future Okay, here you can see our Ecosystem you're talking about you can see that there are multiple clusters from the ship. Thank you. Look So and as you can see we not even not we only do have 5g components But we you see there. It's called MME. So the good old 4g We want to support our customer stairs. Well, and you can see also that everything is Deployed via get ups. So here short I can forget right and of course this central Instance where we push those Metrics and locks and everything else in there is small. So it's also a platform Underneath there's elastic. So an environment to analyze those data Okay, so what's observability it's events and We have different kind of events there. We have the locks with the metrics We have some alerts and of course also really important. We have traces there. Okay Next we talk about each of these steps about lock shipping metrics alerts and then we have a special guest And then we'll pass now to Dorian who's saying something about lock shipping So, thank you So for lock shipping in Kubernetes, there's not a lot of ways So the picture you see on the slide is taken directly from cuban it is documentation Basically, you have the option to run one side car per pod collect the logs there send it to some sink Of course, it's possible that the application itself is able to stream to something in our case elastic search unfortunately Not the case everywhere somewhere. It's possible somewhere not so we need a more unified approach and of course You can also log to disk directly to the worker node Collect the data there and send it to some logging backend as I said the slide is directly from cuban it is documentation Not really a surprise We tried several of the lock shipping agents available At the start of the journey we had fluent bit and it just worked it was fine But our platform provider a cuban it is as a service provider, which is the shift a torture telecom Wants to move to a vector like a pretty new project. It's on the CCC and CF umbrella We had some trouble with it getting it right and then we moved to file bit because we have at least two engineers who are really knowledgeable about file bit, so we moved back to that one But we're already planning to go back to vector once we can and find a bit of time But I'm not quite sure if the journey is at the end there Or we need to iterate further and maybe at the end of the day, we will be back at fluent bit. Who knows let's see What are the problems? Why do we iterate through so many lock shippers? Well security is really a problem if you really think it's true mounting the whole of warlock containers or warlock Pods to some kind of application running inside the cluster. This is of course against need to know principles And maybe we are able to get some audit logs. Maybe there's some stuff which should really not be in there and being sent to an analytic solution and Well a lot of open questions there because I don't see a perfect solution in anywhere in the cuban ethos environment To be honest right now Then of course, what does the platform provider want to do if they want to push vector? We will try to comply obviously and One more thing is support as you may be aware Company elastic just stopped the support for their hand charts, which we're using to deploy So we have a small problem there. Let's see if we can solve that and also Maturity because for vector when we started it and we started to use it for the first time It was not possible to even send data to multiple endpoints So if you have for highly high availability multiple endpoints vector was only able to speak to one of them Not a problem in cuban ethos world But in this case application like the logging sink was running not on cuban ethos didn't have the single endpoint But had for high availability multiple and vector was not able to support that There's more problems, but just scratching the surface here So let's talk about metrics different kind of beast less in volume like in terms of data storage But still there's a lot there So inside the box we have the solution provided by the shift again so of course they run from ethos and the pots expose their metrics there Thanos is a front-end to query and the shift is also providing crafana, which is running per cluster and If crafana can get the data from there, of course, it's possible to scrape the entire data and send it to the logging solution again which is elastic search and Again, we had huge problems there with cardinality Inside the prometos we found something like 70,000 time series and querying them and sending them to another thing is a problem because This is the memory usage of the tunnels that we are using to query and you can see we're scraping we tried to scrape just everything there and send it away and Thanos was using like 4.6 gigabyte of RAM during the queries and then slowly getting back to normal And we had queries that took like 20 something seconds to go through and send the data back to us And if you want to query this data for every 30 seconds, you can already see it's becoming a problem If there's like, I don't know 5,000 time series more. It's not gonna work out It's possible to mitigate some of that but I think cardinality with the prometos is one of the hugest issues That's awesome interesting talks about it maybe we'll find some answers and of course, there's a special type of events that are alerts and For one, we've used of course alert manager from prometos to create alarms But also I just picked this vendor out of multiple vendors that we have Because it's already public knowledge. Anyways Mavinia, for example, is creating their own alerts and that's good. The software does it The problem is how their alerts look like what fields do they have what meter do the metadata do they have is quite different from maybe the other alerts that we created an alert manager and the problem is always how to unify it again and Having one central thing to do it in in this case elastic search and then create even more alerts out of alerts is one of the solutions and one special type of event that we like every Git commit to the main branch which will result in deployment and Being sent there as well. So you can see these red lines and the red lines indicate that at this moment in time There was a commit to some cluster so that means if You can see that something apparently is wrong lock volume increases like tenfold or something but you see this red line you at least know where to start looking and I've talked about the huge differences in quality about lock formats and so on and so forth And we found one special guest which quite it's quite helpful to have a more unified look at the things and Roman will present about that one Okay Before I come to this special guest. I want to talk a little bit about our network functions We of course have a multi-vendor network function system here and the question is now They need to talk to each other, but how we We create some how we can achieve some observability between them so In a good old legacy world everyone lost a p-cap, right? But if we look at 5G stand alone With all this new services, we have some RFC 1945. Does anyone know what RFC 1945 is? Raise hands. No one maybe you're too young It's HTTP 1.0. So actually HTTP is even older and HTTP 1.0 is from 1996 by the way actually 5G stand alone, of course Is using RFC 7540 so which is HTTP 2.0 from 2015 Yeah, so we decided to put something in the middle to abstract those communication between those network functions and we chose for time being Istio for example and the black box the gray boxes here are Showing the design course there Which allows us to abstract certain certain things right like security and so on and Also with this Side course there we have the possibility to use tracing solutions Like Zipkin or like Yeager for example But our current way is we are trying to use the access lock first That means every message which goes from one network function to another will be locked and Here you can see An example of that This is already as Jason because Jason makes it easier to to parse that in a in a useful matter Especially if you have it in a yeah elastic for example, you can see here. It's the method the protocol The network function that the path it's using so it was a registration message You see if it was successful you even have the duration So lots of things and you see here. I put some dots there And it's even bigger and if there's an error like for 400 message or 500 whatever You would see then also a root cause there Yeah, of course If every message gives us a lock entry this means also it's from performance point Could be a problem Currently we are using this one But maybe we're coming back to or we are going forward to Those more enhanced ways to use use it for the SPI communication Okay, then let's jump to Evaluate all this data and Dorian is taking care of that so we have a lot of data on all the places and we have to see that the The 5g of course is new. Maybe we have new vendors inside there. What do we do with the data? And the most important question is is it working yet? Can I use it and usually you can go from the bottom up and check like is the info okay? Are the servers running is maybe the connection to run even there or has it broken somehow? Then you go up to the Kubernetes level and you can check simple things as Such like our pots even up are some of them not running and if everything seems okay Then the only solution you have of course is look at the application locks itself like the transaction locks like 5g Signaling protocol and of course the real application locks as well where maybe they report a problem Maybe they don't but if you do all these steps to really have a great understanding of why is it happening? and what can I do to change whatever I don't like about it for example having high latency or crashing applications and The other option which we're also using is to receive the answer directly and the answer obviously is 42 always But the problem is that we don't really know the question. So we have machine learning in place and it's telling us hey Something is out of the ordinary like something doesn't really really look right. Please take a look But maybe we don't even know why that's happening and worse. We don't even know what to do against it So this is the top bottom approach and the other most bottom up approach and it has to meet somewhere in the middle I'll explain a bit more in a few slides from now. So please bear with me and There's one metric which I like this is showing the Successful and failure registrations at the AMF level and Unfortunately the lower bars like the really tiny ones. That's like 8% 10% something. This is the successes So this is a horrible state right everything is failing and so on and so forth the problem here is if device for example attempts to register with the AMF and is rejected for valid reasons for example, it's not Allowed to use 5gsa or something. This is counted as a failure So what does it tell us? Well, we have a lot of data inside here But this data is not really that useful to determine if the application is working because this is perfectly fine It could be 100% KPIs on everything else, but this one, okay So we did some evaluation and it's cool to look at it, but not that useful my opinion. So this is to evaluate step About the machine learning for example, we receive emails like this one This is again the smops platform the machine learning jobs tells us. Hey, there's a metric It's called LTE something so we know it's related to MME component to 4g and it's way higher than use usually And this is a registration attempts in this case But the problem is if you receive this out of the blue and you've never seen it before You have the same problem again. Okay, you know something there is wrong But how to fix it? Where does it come from and then you need to go back to steps and trace it back to inverse case hardware level or Maybe some broken connections to run So of course we need to take action and now it's going to be a lot more broader than before What do we need to do if we have all the data? Well, we've seen that we need to automate all the interactions especially with everything ledger see especially with things like the run and of course also with the paperwork because Every time we do a change and we want to do the small incremental changes We need to notify a lot of systems like we notify need to notify our own first line and say hey Don't worry. This is a plan change. We have everything under control We can collect this for example from him chart hooks if it's a pre-upgrade hook We can just silence the alarms that are going to inevitably show up we can collect it from Kubernetes events and Like one small nitpick even the change requests, right? We have change advisory board change needs to be created send to them then to be asked Hey, is this okay? Can we do it in this time frame? Well, if we do eight or like maybe 50 changes in a day, which is maybe the goal with all the cloud nativeness We need to automate that as well there's no way around it if you have multiple clusters and want to do it like It's in no way possible to do this with a manual work and maybe create some Excel sheets or PDF files and send it via email and Of course, one thing you have to do is exercise You need to do worker node roles that means replacing old machines with new machines With probably higher level of kernel patch level or something else changed You have stuff like chaos mesh and the friends to do some chaos testing just kill some parts See if the application is as resilient as you are hoping to Do a lot of small changes to the environments really get to know is this change something I could do in the middle of the day when everyone is on their phones Or do I need to maybe shift it to the night? Of course, this is not what we want But might be the current situation and of course you nearly need to check like is it a change hot reloadable That means like can I run it without any application even taking a restart or does it include a pod restart? And if the pod restarts, what does that mean for the customer because that's the only thing that we should be really worried about in the end, right? and As an example at one point we ran like these node roles like completely replacing the entire hardware supporting us for like five days in a row So we really had then all the data like and hey, what are we knowing what can go wrong? What doesn't usually go wrong? And also to of course to train machine learning because it should be able to know Hey, this is a node role and this is expected result a bit of the outlook Talking about how to deploy tooling and some key takeaways. Hopefully So deployment strategy, there's one proven way in which everything can work And that is of course you take one core out of service You drain it you wait until there is no customer lift running any session on it and then you delete everything and then you let for example worker node replacement happened or you just run the software update that you're supposed to run and Then you wait a bit and your profit because if something goes wrong will no customers on it. No one cares you can always bring it up later if you're sure that it's saying and The config is working and so on and so forth but the problem is this takes a lot of time and If you want to keep up to date for example with security patches, you're not going to be able to like only do deployments once every couple weeks or something else, so of course really want to be cloud native and Because this is not the goal to be cloud native in itself But the goal is to be able to for example fix security problems fast and And then and please take this with some humor because this is a collection of the worst things that happened in the past And it's not happened all at the same time and it's like not to be taken that serious, but You start your rolling upgrade of either infra or application And then you run into a Linux kernel bug that you never heard about before and it only happens in your specific circumstances And then maybe you'll notice that there's some undocumented breaking change in some component proprietary or open source And suddenly nothing is working anymore That's kind of bad then you notice that you have a problem with multis like I believe this is the part where applications are not really Cloud native and let's see if the Kubernetes path will change it And then then you get the call from your boss over fixed line because mobile is no longer working. That's a problem and Then you promise that next time it will really work next time it will be better and hopefully you're right because I think that's still the goal where we want to go so one short Exercise about tooling like we're really interested right now in psyllium We are really interested in this enhancement request about multi-network requirements that will make multi-networks first-class citizen We already heard it before today. Of course and wdaf. Maybe lock shipping. I told you before this isn't the end of the road Maybe We folks is working on this helm cluster state drift detection, which is interesting and in the end Maybe it will turn out we'll have to write our own operator. Maybe it's worth it. Let's see So some key takeaways Observability is never finished. This was the main topic, right? Let's see how far it we will go and how well where it will bring us and The cloud ecosystem is huge and heterogeneous like we have I don't know Hundreds of pods different kinds of software some proprietary some open source some of them have problems some introduce new problems some fix old ones who knows there's a big change always and The cloud nativeness is not a goal. It's a tool Like we want to have the tools to not be forced to stand up in the middle of the night to make updates we want to do it during the day and We want to be fast for example with security patches and this is maybe the right tool for it. Let's see So cloud native is really difficult, but desirable. That's where we want to go and Interface to ledger C including all the paperwork that will not go away So the only solution is to automate most of all most of the things all of the things and That's something we need to tackle as well. So thank you for your attention and I'm not sure about the question part because book mention. It's going to be hard But of course we are also up for questions if you have some later I Again, please Again need to improvise Thanks, Dorian. Thanks Roman for the presentation. I have obviously some insights into what these guys are doing And might ask unfair questions But can you reflect it's obviously how it's going now? What were the major hurdles that you needed to overcome to come to this stage? So how it all began and give us a little bit of retrospective What you used before this and and how is this helping? talking about observability or general Such a broad question I joined this project last year in the middle of last year and The project was moving so fast that there was little in place like there was alarming and there was the cluster local Grafana everyone was looking at all of the time and That means of course you need to log into one specific Grafana for every cluster So I think the biggest win was moving everyone to one come and sink and you can see all of the clusters in One dashboard and can determine. Hey, I don't even need to look at that one because that looks perfectly healthy So I think this was the biggest win on that side Tooling, I don't think we're at the end of the road. There's still a lot of work ahead Especially in sinking with first line and the other colleagues Not really sure if that answers a question. Yeah, definitely another one in terms of you mentioned the tracing and As bi interface yeah, that's okay How about the telco protocol tracing not within the cluster, but end to end what's the experience? And what's the way of working today and where it's going? The problem of the legacy protocols, of course, scdp diameter and all the stuff, right? So We are still forced and with our five years a core Including 40 right. We are still forced to use pcaps or tracing in that matter. So it's It's simple as before right something in a switch you put it in there And then you just kind of get the same packets the network function then will receive so still old Hopefully we're getting better. Of course. This is also An effort for the vendor to to implement fancier things Which then do not need something on on network side because if you imagine you want to scale it or whatever, right? Or you want to bring it to a new servers? Implementing something at this could be hard because a technician needs to go outside and Why a remote whatever? And need to configure that right so and everything which can be do in an automated fashion is of course The best way to do to do it Thanks Roman last one or actually it's a more a comment in the Kubernetes Logging to the standard out from container from the pot is a kind of expected thing No, no brain and normal thing. Was that in your case always like that? Ah Asking the mean questions, so I will not name them a specific vendor But basically I've been told that if they cannot lock directly to some host mount path that they are not able to have any insights anymore This was sometime in the past We've fully moved to standard out but we had to push the vendors in this case and I think it's a good sign that they're being pushed and that they're allowed to Push it and hopefully it will help everyone like we don't want to tell a common specific solutions We think there's general solutions that will have everyone including the vendor and that's what we're usually pushing for And and in the slide from water phone. We had this hand-in-hand, right? So it's exactly what we need here, right? We we as as telecom or as water phone We do not go to Ericsson and Co and say please implement it like that, right? So this is not working. We need to do it hand-in-hand and improve forward in maybe even a DevOps way with the Infinity infinity loop. So, yeah, that's I think that's the goal We need to do and and that's also the paradigm shift in the heads of all the technical vendors and the managers there We need to yeah