 Hello everyone, welcome to our talk on observability brings clarity in 5G world. My name is Praveen George. I'm a senior principal engineer working in 5G system test team in Affirm Networks. Along with me is my colleague Yamini Sridharan, who is a senior principal engineer leading the platform as a service team in Affirm. Affirm Networks is a telco company. As of April 2020, we are a fully owned subsidiary of Microsoft and from November 1st 2020 will be completely integrated into Microsoft Asher team. The agenda for today's talk is to first provide an overview of 5G network and then we'll see what are the key observability requirements in 5G and how they are satisfied using the CNCF observability stack especially for metric collection, logging and tracing. And also we will see the demos as we move along. Telecom network architecture, the various network elements in the network, its supported protocols, the interfaces that each network element support etc. are defined by a standardization body called 3GPP. In 5G, the network elements is called as network functions and they are deployed in different Kubernetes clusters in different geographic locations. Each network function will contain at least 10 to 15 different types of microservices within it. As I mentioned before, in a 5G core network there are various virtual network functions deployed in multiple Kubernetes clusters across different geographic locations. This is a very simplified network diagram of 5G core and whenever user equipment tries to register with the network or attach us to the network there are a series of 3GPP defined messages that are exchanged between various network functions over the 3GPP defined interfaces before the user equipment can either browse the data or make a phone call. This brings our first requirement that we should be able to continuously monitor performance of multiple network functions. Also as the messages are exchanged between different network functions over different network interfaces it's highly likely that there can be a fault anywhere in the network at any point of time. This brings our next requirement that we should be able to detect the problem at the very early stages by some sort of automated mechanism or alerting mechanism. In a typical legacy network the people will be staring at different GUIs for any anomalies or any alarms. We need to have a way to automate that process here. And the other requirement is that once the problem is detected we should be able to troubleshoot the problem at the earliest and then resolve the issue at the earlier so that there is minimum impact to customers like us. So for that all the relevant debug information whether it's stats, whether it's logs, whether it's phrases everything should be readily available in a centralized location like a dashboard kind of thing so that it's it's available for the developers or the support people to debug the issue. Let's now see one of the important key features in 5G which is called as network slicing. With the advent of 5G the applications which access the 5G network has different requirements. There are some applications like gaming applications which would need high speed internet connectivity then there are IoT related applications which would need a huge number of devices to be connected to the network but they may not be using the complete bandwidth or they may not need a high speed bandwidth at any point of time and then there are more mission critical applications like an example is an autonomous car driving mechanism which would need high speed internet connectivity as well as very high reliability in the network. There should be zero latency for those applications. So depending on the applications requirements it's not possible to provide separate separate 5G networks the solution is something called as network slicing where the underlying 5G network will be logically sliced and depending on the applications subscription and other parameters they will be allowed to use the network functions that are part of a particular slice. This brings our next requirement that though we can collect data from different network functions we should be able to aggregate the data collected from different network functions based on the slice from which it is coming. Let's now see how these key 5G observability requirements are satisfied using PNCF observability stack. Yamini will walk us through that. Thank you Praveen. Hi guys I am Yamini Shridharan. I work as Principal Software Engineer at FM Network a Microsoft company. In today's session we will see how CNCF observability stack brings clarity in 5G world. As Praveen mentioned earlier the 5G world consists of different NFs deployed in different geological locations. The challenges we had when we considered bringing clarity in 5G world are monitoring the performance of NF fault deduction end-to-end call tracing. Few of our CNCF observability component helps us to resolve our own challenges and bring more clarity into 5G world. Monitoring the performance of NFs in 5G world. Prometheus and Grafana has helped us bring clarity in monitoring the performance of NFs. In this slide we have few sample NFs like SMF and AMF. We would like to monitor the performance of SMF as per the slide. Prometheus library helps us in creating the gorgeous counters and histograms and the URI for the Prometheus server to pull the metrics. This metrics are helpful in monitoring the performance of NF. For example we would like to know number of requests reached SMF. Every time a request reaches SMF we will have a counter and the counter keeps incrementing by the Prometheus library and the value of the counter lets us know that number of requests has been reached SMF. So the metric value helps us in giving a clarity on how the performance of the NFs are taken care of. The Prometheus pulls the metrics and saves it in the time series database. The other tools have expression. They use the metric value with the arithmetic operation and the prediction evaluates in a particular time interval and fires the alerts to the alert manager. The alert manager forwards the alerts to alert receiver, example, slack or email. The Grafana visualization tool. It's a popular visualization tool which has a data source and few visualization plugins. So we have used few of the plugins for example graph speedometer table and it gives the end user a visualization FX of the metric value which brings more clarity of the performance of the NFs. Let's see a demo on how we use the Prometheus and Grafana to bring more clarity in monitoring the performance of NFs in 5G world. Let's look into the Prometheus server. Prometheus server brings a clarity in monitoring the performance of NFs in 5G world. Let's look how it brings a clarity. This particular metric is from AMFCC microservice which is a 5G related microservice and the metric has the AMFCC which is a microservice name as well as the message transfer request. So this particular metric will give us a clarity that how many number of times the message transfer request has been achieved in this slice. As Praveen mentioned earlier in one of his slides about the network slicing in 5G world. So this slice which is a label over here relates to the network slice. So we have as per this metrics we have achieved message request, transfer request in this particular slice. Like that we have many metrics which has been pulled from each container of 5G and saved in the time series database by Prometheus for aggregating the metrics and the alert rules and any other arithmetic expressions. The rule is the record rule is a rule which gives you an expression for any arithmetic expression which Praveen supports. For example we have mentioned here a sum of a particular metric name without considering the label form. In Prometheus once we do a record rule this particular record name becomes a metric itself and stored in the time series database. Once we click on here this particular metric which was a record rule becomes a metric name with a corresponding labels per slice with the values how many atoms has happened for this particular slice. In this manner we get clarity of the performance of an NF in the 5G world. We have the alert rules. The alert rules has the alert name and the expression. This expression includes the metric name or the record name with a customized expression as per the business logic and this expression is evaluated every minute as per the for duration here with the label matching and this description and summary will be displayed in the receiver which was configured in alert manager. This expression if it is the evaluated and the evaluation result is true the Prometheus starts firing alerts to the alert manager and the alert manager will further start forwarding the alerts to the receiver configured in the alert manager and this here the metric name is used with the full expression will give the proper value and that value will be considered with the evaluated expression value. This is an example of an alert rule and this gives we will continue because of this alerting system that if any performance of the NF is going down or it is not unexpected of performance we have to see more clarity in the Prometheus or more clarity by seeing the Grafana. Let's see the Grafana which corresponds to this Prometheus. This is the Grafana we have configured the Grafana with multiple dashboards for each container and microservices and we have configured the dashboards the data source in this Grafana Grafana by default supports many data source but we use Prometheus. Prometheus this data source is configured in a auto provisioning method using the Grafana API and the dashboards are each and every microservices has its own dashboards and it is provisioned in Grafana using the Grafana API For Grafana dashboards and data source we are doing automatic provisioning using the Grafana API Grafana master dashboard for example by viewing the dashboard we will come to know that what are the metric values and even in a visualizing way how successful it is and we can achieve this by writing queries this query once we select the data source this query is the promql query which is going to query the Prometheus and the metric name mentioned here is in Prometheus and it gives the value according to what is available in Prometheus and for more clarity we can have a database query as well to this particular dashboard has Grafana tables and as well as more queries which gives a more clarity of what time and what phase the particular metric values are available for example here we see a metric value as per the cursor wherever we keep the cursor we will come to know the exact metric value which gives us a more clarity and a way of visualizing the metrics now we go into the next challenge of using centralized coding so smf and amf are the 5g network nf which has which can go down or which can create any fault in the nf so we have structured actionable laws which are being created by the nf's and which are being created as a standard out and standard error and we have the Fluende which is the forwarder which collects these standard out and standard error outputs and it forwards the law to the elastic search and stores the law in the elastic search the Elast alert is an alerting component which has some alerting rules configured with the receivers the receivers can be slag or email or web book so it depends upon the Elast alert rules that this rule has to be evaluated with structured actionable logs and the particular events the expression is satisfied or evaluated fine the particular alerts are being forwarded to the receivers if we want to see more clarity in the logs the Kibana is a user friendly UI which gives a more clarity in the logs let's see a quick demo on the Kibana now let's see the Kibana which gives more clarity in visualizing the logs the fault deduction the structure actionable logs are saved in Elastic search and the Elast alert which has the rules will query Elastic search and if it sees a rule which is having the similar log it will start forwarding the alerts to the Elast alert receiver so the Kibana gives more clarity which log has given the actionable log these are few logs which we have in our NS so the clarity which the NS gives is in adjacent format and once you expand it we will continue with every label value and what is the level of severity and Kibana gives us more flexibility by creating indexes we can create indexes, index patterns and we can use our own filters these filters are very useful when we will search for the particular value or a particular label in a huge number of logs and these are the already configured available fields let's see now the end-to-end call tracing is a challenge we had to bring clarity into 5G for example we establish a call in the 5G world and we need to debug the reason for the call drop-up to rectify the issue so the same SMF and AMF are the 5G NFS where we need to incorporate open tracing containers and open tracing creates the traces and spans and these traces and spans are collected by the eager agent and forwarded to the elastic database the elastic database stores these traces and spans and the eager UI displaces the traces and spans for more clarity a demo on eager will be provided by my colleague Praveen and thank you before I start with the end-to-end tracing demo let me ask you one thing how many of you have faced issues with your mobile network connection it could be different issues like we are not able to browse internet or our friends or relatives would have complained to us saying that they were trying to reach us but the calls are going to voicemail so several those kind of issues we would have faced it's common that we would have called the customer support center to raise a ticket with them to look into what is going on with our connection in a legacy network if you raise a ticket with customer support center what they do is they'll collect logs from different nodes look into the logs and then see that node one is healthy then look into the next set of logs for node 2 they'll see node 2 is healthy then go to the next set of logs or debug information from node 3 and then they will identify the problem which means that the debugging process is difficult it takes time which ultimately impacts the customers like us so we are trying to see how we can use eager tracing to make the debugging process easier whenever a customer calls a customer support center and raises a ticket with them each user equipment will be having a unique identifier using that unique identifier we can start eager tracing in the network say here the access network gets terminated on the AMF so AMF using command line interface we can enable eager tracing the requirement is that we should be able to trace messages exchange across all the network functions in the network we should be able to capture the traces towards NSSF, NRF, AUSF or UDM, SMF across the network we should be able to trace the network also each network function has at least 10 to 15 different microservice type we should be able to trace the messages exchange between the different microservice type as well now let's see how we can utilize eager tracing to debug a problem that is reported to the customer support center in the interest of saving time I already ran a call at the back end and collected the trace there is one trace for a 3GPP defined procedure called PDU establishment procedure and it has 37 spans let's look into the details of this phrase I mentioned before that the traces will be started at AMF and on the AMF the traces will be started at AMF into microservice when the message is exchanged from AMF into to other different microservices within an NF or across the NF all those traces will appear as child spans the first traces the first child span is a message towards AMF call control and from AMF call control the message will be sent to NSSF which is a separate NF for SMF discovery and then once the discovery happens the message will be sent to SMF NF itself so on SMF we can on any trace we can capture the various internal events in the log section so these are more like similar to the logs that we see in Kibana this gives more insight into what is happening whenever a message is received at the particular microservice it also captures the duration of the delay or the latency associated with each event happening within an NF and as I said I'm trying to simulate the problem where a user has raised a ticket in this case the PDU establishment procedure the last trace of the child span as for discovering a PC of data here as we could see there is an error for discovering a PCF from an RF and there is no PCF configured so this is why the calls are failing for the reported customer it could the failure could be due to different reason it could be some configuration missing or it could be that the PCF is not dynamically registered with an RF but at least we got a starting point as to what is actually going wrong with the network now we can browse the data in Kibana and see more details more specific details into it and then take corrective actions also another important aspect with eager tracing is that as I mentioned before it helps to understand the latency or the bottlenecks in the network especially how long each function or each network function is taking to process a particular message and then 5G is all about low latency right so we can see where the bottleneck is and then address that problem as well that concludes our demo and our talk hope you all enjoyed our talk thanks for your time