 So, let's get started. This talk is about how we can improve the monitoring and observability for Kubernetes using the open source tools. My name is Nilesh. I've been an enterprise architect for one of the leading bank here in Singapore and I'm also what they call as Microsoft's most valuable professional or MVP for last six years. And recently for my contributions to the Docker community, I was also awarded with the Docker captain recognition. You can find me on various social media, I put the links here and also a link to my YouTube channel where I talk about Kubernetes and various cloud native stuff. Let's get started with today's talk. Has anybody seen this iceberg picture related to Kubernetes before? So, if you're seeing this for the first time, maybe you can't see it right at the back. What it talks about is the different layers of different tools, technologies and information that is required when you start working with Kubernetes and when you want to make it like a production grade Kubernetes. So starting from the top, it's very easy to start with creating a deployment, getting your servers up and running and having multiple replicas, having a config map. But as you go down the different layers, your complexity keeps increasing. So it becomes very difficult or it is quite challenging for even a experienced Kubernetes administrator or a developer to manage things like stateful services, then having custom resource definitions and advanced things. And it's a very challenging aspect which most of the companies, they find when they start running Kubernetes in production. So how do we handle this? One way is to look at what CNCF provides in terms of some sort of guidance and these are some of the tools and technologies which they have published on what they call as the trail map. So if you are new to CNCF, these are some of the projects which have matured and these are some of the things you can start based on different things like containerization, CI-CD, looking at observability. Then we have the service discovery, networking, various things. This is good way to start. And then they also have these different landscapes. What I've taken here is just a landscape related to the observability part. And if you go on this CNCF landscape 2.io, you will find the landscape is quite wide. It talks about databases and various other things which the different CNCF projects are running or what are the different projects available in each of these categories. So we'll focus today in this discussion more on the observability landscape. And sometime back, this is almost like four years old now. There was also a CNCF observability radar which was created. I haven't found the updated version of this, but at that point in time, if you have heard about the technology radar which is published by ThoughtWorks, this is on the similar lines where CNCF gives its views about what you can adopt, what you should trial, what you should assess based on the maturity of the projects and what state those projects are in. So obviously over the last four years, these different projects which are listed here, they have gone through different phases and some of them would have gone from adopt to or some of them would have gone from trial to adopt space. So we heard about this in the morning sessions as well, the three pillars of observability starting with logs, matrix and traces. And that is what we will see in this session as well, how we can use open source tools to incorporate these three things with Kubernetes. Let's start with centralized logging and try to understand why do we need centralized logging. Now when we build cloud native applications, there are different aspects where we need centralized logging and there is a need to have centralized logging. Let's focus first on the container based applications. We might need logs which are application specific. I come from banking background and in many industries we need to retain the logs for a period of time, like six years, seven years, ten years, depending on the regulators. If we use the default logging mechanism, these logs are not available for this much duration. So we need to retain these logs for a longer period of time. That is one aspect. The other one is workloads which are scheduled on different nodes during the application restarts and updates. Traditionally when we were using, let's say, three tier applications, it was quite easy to just take the logs and store it in one central place or take a backup and archive them somewhere. But with cloud native applications, they are very dynamic in nature. We might enable auto scaling, we might have, we saw in the previous demo where K native was used to dynamically scale the number of instances. In such cases, if the node where your application was running and due to some reason the workload gets scheduled to a different node, then there is a potential that you might lose those logs. You might not have them after a certain period of time. In terms of Kubernetes infrastructure itself, some cloud providers, when you do the upgrade, they don't do an in place upgrade. They just replace your node with a different node and they drain the earlier node and they shift the workload to the new node. Now in this case again, if you lost the earlier node, your logs are gone. If we are talking of past services, not just Kubernetes based or container based, we also lose the control over where and how we store the logs. Things like when we use infrastructure as a service or platform as a service from cloud provider. We have very little control over where those logs are stored. We might just get a service to say, okay, you can fetch your logs from here. We don't have a finer control like that we have in our data centers. So these are some of the reasons why we need centralized logging. And if you've heard about this concept called or these patterns or recommendations called 12-factor app, centralized logging is one of the recommended practice there. This has been replaced or upgraded with 15-factor apps. But what it essentially says is you should treat your logs as a stream of logs instead of storing them physically on a machine or somewhere you should stream them and then as part of your infrastructure components store them in a centralized location. So in this example, I'm going to use a demo which is like financial services application which has got five different microservices running. There is a backend service, there is a account service, there is authentication service, a forex service, and a transaction service. And all these services, they write logs to a collector, which is Loki. Loki is again a open source project which does the log aggregation. And then I'm using open telemetry collector to write this logs onto the Prometheus and to visualize the logs, I'll be using Grafana. So let's go into a demo of this and see how this works. So here I have a postman which I will use to generate some workload. So I have this backend service which acts as an entry point for all these different microservices and it runs different calls to these other dependent services. So if you have used postman, I like one of these feature of postman which I can use like collections to generate different kinds of requests. I can send different post requests, get calls, and I can even call the Prometheus endpoints here. And then I can say how many iterations this has to run. And I can also specify some delay in microseconds. So here I'm going to run this for 500 iterations. And you can see that the numbers are going on here and also what is the response. So this particular collection has calls which are going through as normal calls. And there's also a failure call there to simulate some failures. So you can see most of the responses are 200. But there is also a 404 error which is thrown by one of the API calls. Now if I go into the Grafana dashboard here, I have a connection to Loki and Loki is the log aggregator. So it collects these logs from all these services and publishes it to Grafana. I can use Grafana to visualize. And here you can see that I have a filter for the back end service. The name of that service or a label associated with that service is back end service deployment. And below here you can see the heat map of how the logs are generated. I have been testing this since morning and just populating some logs. And you can also see the logs here at the bottom. So if we want to go into a specific log entry, we can go below and look at the logs. So here you can find somewhere that get customer details from account services called it's giving a start and end kind of info. So this is all real time. So what this means is I don't have to go into individual nodes or look at individual pod. All I have to do is to configure Loki in such a way that it collects logs from all of my services and sends it to Grafana. This is the application specific logs. I can also do the same for infrastructure related ones. So in here, I can build another query. And in the labels, I can here say look for a node. And I've got a three node Kubernetes cluster here. I can look at one of the node and I can get the logs which are coming from that particular node. So with this one component, I get the logs from both my application side as well as the infrastructure side. The next thing we are going to look at is the matrix. And again, why do we need matrix? Matrix, they help us to collect data at different point in time at aggregated level. Logs are good for knowing at an individual level or individual transaction what happened. Matrix are useful to know over a period of time how resources are getting used. So from application point of view, we might want to use matrix to collect things like the resource usage, things like scaling the needs to understand if our application needs to scale, if we need to enable autoscaling and things like that, and also to monitor any anomalies or outliers. From a Kubernetes point of view, again, we can look matrix for CPU and RAM usage, the health of APIs, and again, if we want to enable autoscaling. Autoscaling can be both at the pod or the application level or as well at the cluster level. So these matrix, they can help us to identify the needs for these kind of features. And same thing goes for the past services as well. We want to monitor resource usage. We want to see the scaling needs. And if there are any bottlenecks, matrix can help us identify that. So for the matrix demo, again, I'm going to use Prometheus and Grafana. And I use the same application. And I use what is called as a Kube Prometheus stack here. If you have used Helm charts to deploy any of the third party applications, I find this very handy where Kube Prometheus stacks comes with Prometheus, Grafana, Alert Manager, all these components combine together in one single Helm chart. And it sets up everything for us, even including some default dashboards. And for the application-specific services, the five microservices that I have here, I use a custom resource definition or CRD to collect the matrix using what is called as a service monitor. So that service monitor collects the matrix in the Prometheus format. And then using the open telemetry collector, it publishes them or it pushes them into the matrix storage, which is Prometheus. And then again, we use the Grafana to visualize this. So let's switch back again to the demo mode and see this in action. So here is an example of a dashboard, which is resource level dashboard. This is coming by default again with the Kube stack Prometheus. And I'm looking at the compute resources at the cluster level. There are various other dashboards which come through, but all this is provided by default for me. So I can look at what is my CPU utilization? What are the requests? What are the limits? This can again be at the different sources as well. In this case, I'm only using Prometheus as the source, but I can add multiple sources and all this is customizable. So all the widgets that we have here, they're based on the data which is coming in Prometheus matrix. So I can go and edit all these visualizations. I can use the Prometheus query language or promql and specify what query has to be used and I can also change the visualizations. So this is again something I find very handy where in order to get started with monitoring the cluster state, you don't have to spend too much time. You can use this built-in visualizations and get started very quickly using some of the community created dashboards. So if you want to know what are the other dashboards, you can just go into the dashboard section here and we can see that there are dashboards created for the alert manager, there is code DNS, there is the API server compute, there is the networking one. So all these different aspects related to our Kubernetes cluster, we can monitor them very easily using these different dashboards. And then I'm also using a dashboard provided by Istio for application-specific matrix. So here, again, if I go and generate the workload, we can see in real time that these different metrics here regarding what is the duration of my HTTP request, what is the throughput, how are the different processes consuming memory and all the threads, all that information I can visualize in real time using these dashboards. Think of trying to doing this manually without having something like this. If you were to do it using the legacy technologies, I'm pretty sure it would take quite a long time for us to come up with something like this. So that's on the matrix side. And in order to integrate these metrics, again, I'm using OpenTelemetry and OpenTelemetry has this support for different languages. My application is built using Spring Boot, which is a Java-based framework. So here you can see that it can automatically do instrumentation for Java language. So I don't have to make changes to the code for this to get started. I can define a custom resource of the type of instrumentation and I can specify the language. Java is one of the supported languages where it can automatically instrument and it can send the telemetry data to whatever backend we want to use. Now, with that, let's move on to the next part. The last part is about distributed tracing and again, earlier session where we talked about Yeager and using distributed tracing. This is a similar concept why we need distributed tracing. In terms of understanding the complexity of the application, when we do microservices-based development, there can be a lot of calls going from one microservice to another that can be calling another microservice. And when we need that visibility into how the flow is going between different microservices, distributed tracing helps us to understand the complexity of the system in terms of what calls are going where and with the traces, we can also identify how much time each call is taking. So if there is any performance issue, we want to monitor the performance, we want to optimize the performance. Distributed traces can help us identify those bottlenecks and obviously when there is problems in production, we want to debug the problems. Again, distributed tracing really comes into the picture to help us identify, let's take a scenario where you might be having five replicas of your service running and imagine that on one of the particular node where it is running, there is a problem. So distributed tracing can help you identify that particular node where that problem is happening. And I'll show you one example of this, how this can be identified. So in this setup, I'm using the same financial services app and I'm using Jagger as the distributed tracing framework where there is a micrometer component which spits out the matrix. And then there is a Jagger operator which I've deployed in the Kubernetes cluster which collects these matrix or which collects these traces. And using the open telemetry collector, it slides it to Jagger. So we visualize this using Jagger user interface. So let's see this in action. Again, let me run the same 500 iterations of the same backend service and switch over to the Jagger UI. So here I have got all those five different services, the account service, authentication service, backend service, forex service and transaction service. All of those traces related to these services are available here in Jagger. We'll focus on the backend service which is like the entry point for me. And when I say limit this search to 20 traces, we can see most recent 20 traces a few seconds ago. We had this and in this particular case, we can see that there were four spans involved and two of them had the errors. So in case of such scenarios, we can go, we can drill down into the method call level or the service call level and we can find out where exactly the error happened. And obviously we have the complete time span about how much time each of these operations takes. That's the failure scenario. In case of success one, we can also find that there are cases where it would have taken longer span. Let's increase to 200 traces. And here we can see another API call which has got 11 spans and it's taken about 20 milliseconds in here. So again, I can do the same thing in terms of drill down and identify how much time is being taken for each of that hop between different services. So we see that the backend service is calling the authentication service here. Then that returns the authenticated user information, then it goes into the account service and then it goes into the transaction service. So each of these call, we are able to trace right from the beginning till the end and we can identify what's the time taken for each of that span of the call. So in terms of the end to end observability, in this demo, what we saw that we have these five different microservices and we use low key for the log aggregation and we use Grafana for visualizing those logs. And when it comes to the metrics, we use Prometheus and Grafana as the combination. And for the distributed traces, we are using Yeager. All this is combined or all this is integrated using OpenTelemetry which is more like a standard. It doesn't tie into any of these specific tools. So if we want to switch from one of these backends to a different backend, it's possible. OpenTelemetry allows us to do that and it can be done very easily using the configuration changes. I like to use this analogy of using the right tool for the right purpose as well. If you look at many of these tools or many of these projects, they try to do many things at one time. I take this example of mode of transport. So if you want to move from one place to another, there are different modes. You could use a bicycle, you can use a bike, you can use a car. There are different factors which influence why you would choose one over the other. Let's say it's a holiday season now and you want to spend more time with your family. You might choose a cruise, but if you want to reach very quickly to your destination, you might use a plane. So it all depends on what is your requirement and what is the right tool for that purpose. Previously in earlier demos, I've used ELK for example for log aggregation. So there might be a case where ELK fits better in your scenario and you might want to use elastic stack for that matter instead of using Loki. So this is not a guidance or this is not a sure shot solution to say you should use only these tools. My suggestion is based on your mileage, find out what is the right tool for your purpose and use that particular tool. So in summary, I would say that the modern day cloud native applications, they need different ways to address the observability and monitoring. We can't rely on legacy ways of just looking at CPU and memory usage. And these modern day tools, they help us in announcing the observability of our applications. As applications become more and more distributed, we need different kinds and different ways of monitoring them and things like log aggregation matrix distributed tracing. They help us in doing that. So I hope you found this useful, what I demonstrated. There are some challenges when we get into this, things like if you're not using something like a open telemetry and you were to use these tools separately, individually by themselves, you might have to have too many agents running, you might have different SDKs to integrate to, that's one of the challenge. The way you overcome that is by using something like open telemetry which allows you to consolidate all those things in one place. You can see all your configurations in one place and it can be easier to maintain and manage your observability stack. The other one is instrumentation and vendor lock-in. In this case, the suggestion again is to use something which is open standard. If you don't want to have a vendor lock-in, try to stick to those tools and those products which support open standards so that it's easier for you to migrate if you want to in future. Things like cloud native locks, cloud native metrics, cloud native traces, these are new things and for that, you might have to use something which is recent and new. Things like fluent D, Prometheus, Thanos, we saw the example in the earlier demos. So these kind of things, again, when you use open technologies, they are easier to migrate to or they are easier to manage in my opinion. And then when you need a single pane of glass, Grafana is most commonly used nowadays, so I prefer to use Grafana. With that, I think I've come to the end of this session. In case you want to have some references here, I've put some links and these slides are published already to Slide Share and SpeakerDeck, so you can find the links there. I've also put some links to my earlier talks which I've done in other forums like Reactor and there is also this wonderful series of talks done by this guy called Usain Dalai where he talks about Prometheus Grafana. So if you are interested in implementing any of these in your projects, feel free to refer to them. The source code for this demo that I showed is available on GitHub and there is also a markdown file. If you want to reproduce this, you can create it from that markdown. And the Slide Share and SpeakerDeck links are there for this particular talk as well as my earlier talks which I have done in the past, they are available. So thank you, thanks for coming. I think we still have some three and a half minutes for Q&A, happy to take any questions before we break for lunch, yes. Sorry, I haven't heard about it. So the thing is all these catch words, right, they keep coming up nowadays. If you look at four hours, six hours, seven hours, so I think over a period of time as technology evolves, we see that these kind of things they keep added. A 12-factor app for example is a classic example. When it started, maybe 2010 or 2013, there were 12 factors which people were talking about. Now there are 15 factors, maybe six months down the line, we might hear about 18 factors. So I haven't heard about the fourth one but it's quite possible, it's there. And it's good if it is there. Thanks. Any other question? If not, thank you. I hope you enjoy the rest of the day.