 Hi guys, so my name is Falk Bednarczyk. I'm a senior engineer in software mind together with Jakub. Yes. Hello, my name is Jakub And I'm also a system engineer in software mind on today's agenda We will talk about what observability is then we go briefly in 5g network architecture Then we'll try to combine 5g network and observability based on our lab experience Then we go through a small review of observability tools, and then we will try to convert Show you, yes show you some demo regarding combatic logs and tracing as observability itself Based on our lab and observability stack that we use At the end if we get some enough time we'll get some conclusions and the question and answer section Okay, so introduction to observability In general when we have money to applications we use alerts right to trigger actions when whenever we need our involvement from the operating perspective perspective, so let's imagine that we have only to application And we introduce some alerts. It means that we introduce some states Logic into the states that you know the applications is up. It's down it performs on the some threshold above or below and We are triggered logically. Yes. No, we are informed that something goes wrong. Yeah, Mr. Has to check what's inside so if we have application We log into the some virtual machine check the logs and you know, probably, you know Mostly that's how it works and you know, we are using SNMP channel and the text message in an email as a channel to inform operation operator operators That's something went wrong And this is a common approach from end days Another level of our monitoring is application performance monitoring so we Arm our application with some metrics and those metrics can be collected by some external tools and those external tools can present those metrics in some dark boards so we can use those dark boards in scope of Extend our visibility of an understanding of the system so we can track some performance metrics error detection, even you know some suspicion behavior from the users To track if something goes wrong, so we do not Inform that something went wrong or something is we are let's say informed that you know that we should go through So any operating actions to be taken, but we are just see In easy way how the traffic things looks like what's the user behaviors and so on checking just metric and dashboards Yeah, and what's cool We can model with data the user experience so we can also present it later in a graph that or something and APM and monitoring simple alerts are cool when you have only the application. We don't have any scattered environment, right and There's a game changer with microservices, so the microservice concepts brings Additional, let's say flexibility, but also makes additional complex of monitoring such applications, so Just imagine that you need to build your application from from with small bricks. That is called microservice so it's a small collection and Very limited from business perspective logic of application that you can Combine with mesh of such microservices. They are connected through some API in general, it could be rest API and Mesh of communication and interaction between those microservices has to be tracked somehow and this is brings additional complex of monitoring for scattered infrastructure and scattered environment like with microservices and This is why we would like to present you observability So there are three priors of observability. So first of all, it's known metrics from APM, right? so Aggregate table data measured over a time that you can simply Collect analyze view rise on the dashboard and so on right Logging so is a event that occurs on the application so we can understand what happens in applications so we can check logs with the timestamp and validate if what what was the state of application during some periods of time, right and Something new and something that makes observability fulfilled is tracing so But the fact that we have System with built with microservices Simply forces us to track that the application as a whole Has to be monitored from the end user perspective And due to the fact that microservices are built as a block of application is built as a block of of microservices that all interaction between microservices Just simply goes as a chain of interactions brings that Whole delay or whole response time Is a sum of all latencies from each microservice separately Yeah, and when we are talking about microservices microservices in scope of 5g network so service-based architecture Philosophy brings us into microservice walls. I mean that When 5g network was introduced the idea was that Sba is provides some moderate framework that just Provide us very small business business application business functions to build our 5g network So it is a simplify microservice world, right? So we can build small We can build small applications that are that I called functions those functions are Connected to each other by a service mesh by by interactions and for 5g in the core plane network rest API was defined for for such communication channel and It's easy to detect if you have just one instance, right, but just imagine that you have few not in a few but you have hundreds of Containers running on your in your setup each container Prepares a lot of data as a locks each container has a chain in scope of talking to different containers in the same time it do different microservices and this Produce a lot of Valuable data, but data that makes a complex how to collect those data how to visualize this data and how to Prepare your data to be Somehow and helpful for you as operator. I mean that in case if you lose Visibility of those logs you may simply lose some information that let's say can bring you into the trouble that you miss some bottlenecks or some security concerns or service breakups and This is where fire observability Kicks in and hopefully that we will be able to Show you live demo how more less it works Yeah, but now let's quickly move on to our lab setup who redeployed a whole observability stack And firstly we need to say that if you think about observability, there are plenty of tools that can do almost the same job Okay, so now Firstly we've deployed our setup on the place. He is not to focus There should be free free notes. Yeah, yeah Free node cluster of proxmox where we deployed our machines Everything is managed in philosophy. I I see so infrastructure as a code we've deployed every machine with terraform combined with packer and Also, everything is going to configure to via ansible So we've deployed a whole stack on kubernetes and In kubernetes namespaces and what's cool about this setup is that we have separate namespaces for observability or monitoring or logging And we can duplicate these namespaces and sets of pods to another clusters just to connect it to Observe observability stack. Yes. So now quickly let's move to observability stack One of my set as kuba said that, you know, when we ask you each of you What tools, you know from observability you can have some different tools and you are all right because On the market, you have many tools available whatever on your software mind we See if we focus on open source projects. So Today we We're going to present you just some family of Observability tools. Yeah, so let's say around grafana. Yeah, so we stick to grafana stack mostly because it's well integrated To itself. So firstly, we start with grafana, which is still to visualize our gutters data And what's it? What's important to note is that grafana does not act as a data source We need something to which some component which would gather the data Grafana only visualizes it. So we use promifuse as a data source promifuse gutters the metrics from all the applications in our Kubernetes cluster Via HTTP endpoints. Kuba, would you mind to explain how more or less Prometheus and application works? Yeah, so all our applications ship the HTTP endpoint where promifuse Requests for metrics and then remote right seed to to Thanos Short application has to be armed in Prometheus metrics right so to deliver those applicator those metrics. Yeah towards and be ready to prepare those metrics for Yeah, and and what's interesting you are using Thanos here. Yes Thanos because Well, Prometheus is very good data source and it gutters the metrics very well But it is not well optimized for long-term data So we need some solution to if we want to analyze from longer period of Time the data. So we need something that would store it properly So Thanos is a solution for long-term data storage And it was cool about Thanos. It uses the same query language as promifuse Well, it's because Thanos is somehow For a project of promifuse So we can query the promifuse as well as Thanos with the same queries Okay, so we have metrics now. Let's add logs to it to our system And we use for it Grafana Loki. It's a log aggregation system Loki acts as a log aggregation system and it gutters logs from all the pods The poll how it works is that our application prints the logs to standard output and Kubernetes gutters those logs and Sends it to Loki Yeah, the agent it can be fluent D or prompted or anything Okay, so it's a centralized place where we can simply verify all logs from all application from containers So just imagine that instead of you know logging to logs to Kubernetes Each container separately. Yeah, you may search Logs in the last place and check if something goes Exactly as you could see as you could see now. Let's say look your from let's say logs Container logs, right? Yeah, so logs metrics what lacks are traces so to trace we use Grafana tempo Grafana tempo is distributed distributed tracing system and how it works it It depends on open telemetry protocol. So our applications should be armed with open telemetry protocol support and it gathers all the requests and sends to open telemetry collector and Open telemetry collector when the sense those collected traces in the form of chunks of spawns and correlated traces to some exporters and we use as it is exporter Grafana tempo Which allows us to present beautifully? You can see even the whole time of Execution of the single Request on the single request is a sum of latencies of each Microservice here. So you can even track which microservice Took the most of the time of the request handling. Okay, so this is one of the crucial thing that we What brings observability so important in case of? Distributed tracing system. Yeah, so Next we as we have metrics tracing and logging now we can catch the context of whole problem or situation in our Deployments, so we have all the three pillars of observability. Yeah Well What's next? Importance is because probably I missed due to stress context Having clocks having metrics and having distributed tracing you can catch the context what happened in the network in the Particle time so we can detect some issues. We can troubleshoot bottlenecks or at least understand what Our application our system in the distributed or microservice world. Yeah, and One important thing which is also deployed in our cluster is service mesh Done by Istio Well, why we need service mesh in here in 5g. It's because we've faced some problems with support of Implementing of observability tools we figure out that our Product do not support open telemetry protocol it's by itself So as a worker out of this we applied Istio as a service mesh. So service mesh is an extra layer I would say so your application between your application and the Outside words we put additional Network layer I would say that is Istio and we can control what gets in and gets out from the application for as a first Second part is that also this extra box a liar gives us additional tools that we can monitor and track What happens in the network? So it means that if Istio is aware about all requests that comes in goes out from the containers We may track it and Istio has also ability that Can arm our application with open telemetry metrics and send it to towards our Applications so we could at least track some spawn correlations between Lation of each microservice. Yeah, so not only security enhancement, but also useful observability tool Okay, so as we've set sail to broad observability words We wanted to experience it the most and we also deployed Jaeger in our setup to test it And compare with Grafana tempo So okay, I guess this is the whole stack go to to short demo firstly, we'd like to show you Kali Which beautifully visualizes as our service mesh and applications deployed okay, guys, so Kali is a web interface for Istio probably some of you already know it What is cool that you know based on the flaws between micro services you can track Even understand the architecture of your of your system Probably you don't let's say remember, but perhaps you do you don't know Architecture already displayed before in the few slides so as you can see each service here is connected Connected to SCP node SCPs function and this is if you function simply Works as a extra Connection connect connection proxy towards other microservices in the world. So next topic is that you know if you click on this Service you may see some additional information from the traffic perspective Or if you click on application You always You also see some let's say connections from from each application to each different Microservice you can also track some extra Metrics and so on so this is let's say additional tool that gives you some Overview on the on the system and this one also uses Yeager. So yeah And those traces he produced by open telemetry ascent to our gaffer now All right guys, so if I think we can Yeah, continue so we'd like to present you some scenario regarding Your illustration so we have a Simulator running that That's trying to register to our control plane So we'll see interaction between a MF MF Getting requests from the UA. We'll ask a yourself For authentication authentication process will look like that. We have to be we have in chain Unified data management function unified data management management function will ask unified data resource Function where data structure data are in place to get the data then ASF would respond to MF and MF will respond to UA. So Having a trace hopefully that we should Show you this Displow this is a I would say network Flow more or less that is simplified without a CP in the in the path So we see out of the indication our a yourself will ask for generate out today tell you then was we cooperate UDR and Can add response then we ASF should respond If something if everything goes right, the 5g a car confirmation Endpoints will be reached to exchange all keys for to close all the secret connection and after this procedure authentication should be successful and We can proceed with other registration steps So guys we have already corrected some Trace for you guys for that. So if we see a connection here We can see that a MF function is asking for ASF function Probably I need to Can you see it well? So it's asking on the UA authentication endpoint. The response was 201 status code Then we can see that a MF was triggered here as well. So Yeah, and what I was asking from to UDN 200 is one for some security information generate out data And we should see also that UDM will ask UDR To continue One thing that you may see is that we have on the span context available. We don't have a Creation context. This is a crucial and this is one of let's say dark site that we have only is to your service mesh and Open telemetry supports is supported is from Only from from there. So if our application is written from the beginning developer should simply develop also let's say include Open telemetry so we could make a correlation context from the beginning, right? Okay, but could you tell us what is a correlation context? Because it's some metadata that makes our Let's say our distributed trace Glue together. So it's good. He said it's a sum of spawns that you can trace together in the singles in the single trace without correlation context, right? We need to We need to let's say be more aware that This trace it depends on this one and this one depends on this one and so on right. So it makes a Our life a little bit more complicated, but you know We can see those chains of Microservices and latency as a similar in the response Okay One thing that like to also cover here is a scenario when we have a Dashboard, okay. That is out with some alerts. We are simply catching here 404 course codes from the HTTP of our microservices and in case if 404 Course code appears we trigger alerts mainly one of the possible scenarios that you do not have and the user in the Database and we try to register to our 5g network some Agent that you should not be registered at all. Okay, so let's proceed So right now we are tracking no arrows. No, no, no 404 arrows at all Green heart means allergic. It's okay. Allergy is not triggered So in the moment we should see that 404 occurs just because our user is not registered in our database and the flow is a Little bit as you can see right now the amount of error is getting up So we have one occurrence of such alarm is triggered. Yeah So the cp right now is yeah, but now the alarm is in the pending state. Okay now So it means that you know, it wasn't in pending state then it is triggered It means that something went wrong. So you should get an alert through defined by By grafana a channel so it could be You know And so we know that we need to get to work try to figure out what happened. So we can use Here export we are we are using all TP here last five minutes So we try to check I'm looking for 404 Traces Yeah, and also what needs to be said is that we've prepared before These sites and the square is but these queries are really easily to to build Whereas we don't want to make additional Latency of searching for that. That's why you have those but as we can see UDM was asking for to you they are For some data, let's see what happened And it did status code is 404 we can check that there's a MSI with such one application data and that it is a subscription was Unsuccessful not found so we can go to Grafana looking now If you know that our application is UDR Let's select UDR time Need to change the time also the last five minutes Firstly change the time To last five minutes Okay So we have UDR, right? Yeah This one is just to call eyes out of all the locks. So as you can see here is we've got some more logs So let's go to the density increased until checking locks here and Here go warning cannot find subscriber Permanent identifying database. So this is one of the possibility how to Let's say combined your metrics your Distributed tracing and the locks from locking system to figure out what happened and troubleshoot your microservice world Yeah, so we could really enhance and speed up the process of finding the solution for our problem but because Now we know that it's UDR that caused the problem not any other network function. Yes Well, this is a simple example, right? just to Just to try to somehow convince you that with such let's say set of tools You know, you may simply know better about your applications in the 5g core last but not least references. So if you get interested about those Those are an article that we were using to build our You know presentation and to be prepared for our meetings today. All right. Thank you for now