 Hi, hi guys. Welcome to our talk. So it's a federated monitoring, leveraging open source technologies. So first, let me introduce ourselves. Like I'm Akhil John. I'm a cloud support engineer working with platform nine. So I have my coming around me. Hey, this is Sanchik Patuk. I work with Ajahn as a cloud support Internet platform and systems. Yeah, so let's get started. Like, let's start with like, why do we need monitoring and what do we want to so as we know monitoring is as very important in making sure the workload deployed in infrastructure is working as expected like making sure the business is running smoothly. So to understand which services bringing in more profiled profit or, you know, or loss to the company. So that's where which monitoring is really help. So, so that's how we identify the value and the effective resource. So there are two main deals. We have like four major usual aspects that we need to monitor from our systems. So they are like latency, traffic and saturation. So I'll explain a bit more on that. Latency is like identifying the system. If a system is suffering from respond to the queries or request. And in terms of the traffic, like, we can understand how many requests or response that are coming to the system and that will be able to identify the pattern of traffic flow. And for us, this is like, understand what kind of photos reported in the system, because this becomes a very common factor. And when it comes to a very complex distributed environment, it does. It is not an easy to understand because the locks would be, you know, the others would be collected has to be collected from distributed systems. So this is this is a very important aspects that we need to monitor. And the last last point is saturation. It is like identifying the trend. So this gives us more information or idea about the trend, like what surveys or application has been highly or least used. And based on that information, we can scale up or change the design of business accordingly. So saturation also plays a very important role in terms of monitoring. So that is one of the last major aspects in in monitor. Yeah, so let's move to the next slide. So here we will discuss about the challenges with the primitive monitoring systems. So the major challenge that we had with the primitive monitoring was like, we were not able to get or monitor the application in the service, which is running infrastructure. So primitive style, we had some agents running agents would be installed and it will be running, it will be only collecting the metrics of the host data within that specific host various applications running. So with this details along, it becomes very difficult to get more understanding about what specific service or application has been mostly used on terms of the resource or the, say, all the resources or the resources that have been used. So, or if there is an issue to the business logic or the application logic, which is causing or which is contributing to any of the issue. So such information was not able to be fetched with the primitive style of monitoring. When it comes to the modern, you know, distributed environment, say, let's take an example with DevOps, it becomes very complex to handle and needed more automation. So there are hundreds of process running on the server nodes and in interconnected fashion. So maintaining such a setup to run smoothly and without any downtime was very challenging. So let's imagine that complex infrastructure with many servers being distributed over different locations and having insights into what is happening in the hardware level or the application level, like, say, errors, response latency, or, say, hardware overload or out of resources. So this situation makes it very difficult to trace what is actually causing it with the primitive monitoring style, like one of them can crash and cause a failure to other services. And when there are so many moving pieces, there's a need of quickly identifying what exactly might on control and would be a time-time consuming when it comes to debugging because the only data that we have at the moment is in terms of the matrix, the host, right? So that's where it becomes more time consuming. And when it comes to, we won't be able to attain the very, very less downtime or a zero downtime. So just to conclude, like, we need more than just the metrics and the host data as the application these days are not monitoring in a complex set of different, you know, distributed environment. We have microservice running together and this becomes very complex and the primitive style of methods becomes not matching to the current distributed environment that we have. So let's move to the next slide. So we have this VM that's talked about the application. So I'll let my colleague to talk more about this. Okay, thank you, Aiju. So as my colleague mentioned that now the applications have transitioned from a monolithic architecture to a microservices architecture. Let's try and go through that journey as to how that revolution basically took place. So even though in the past decade or so, we have seen that there has been a widespread adoption of VMware and other virtual machine platforms, which did revolutionize the way the infrastructure was built to a large scale. And therefore, as well, you know, in terms of monitoring that infrastructure, but if you are being honest, it did very little to the way the applications were run or were being consumed. The virtual machine revolution added new infrastructure metrics to monitor, but overall we stayed the same in terms of the application design and the way the application was developed and run and tested and consumed by device and users. So in terms of the traditional architectural pattern, which even even even today most of the organizations in the world run, they have implemented the application built as a single autonomous unit which with every unit having dependencies with each other, which requires extensive manual testing, major development work, post post each upgrade and doesn't allow the application components to be monitored individually or being independent of each other. So after the VMs, like when after this VM resolution came the advent of cloud, and that was a major breakthrough in the industry with the introduction of platform as a service and infrastructure service of coming into the picture. These trends added added some more metrics to monitor, but it also added kind of certain complexity to the overall environment in which the applications were still being run. So the application architectures fundamentally remain the same, even after the advent of cloud. So what change containerization came into the picture basically that change the way everything was run. So things finally started to change from an application design perspective with introduction of modern container platforms. Like primarily Docker being the one when it was introduced to the world starting 2030. Once Docker came into the picture, everybody started to have an understanding of how what are containers how containers can help to create stateless applications, which will be smaller in size individually but they'll be 10 times more impactful within the environment within the scale and number, and the way they are, they are written, and the way they are designed and implemented within a containerized stack. So this led to the beginnings of containerized infrastructure to run to run basically this container as workloads which will be which will definitely be more transient, which meant that the stateless applications moved between the host servers constantly. They were more scalable they were denser in terms of their footprint within within the within the infrastructure environment that companies have. So, also in addition to this, because of the container portability, it encouraged a lot of industries to adopt sort of multi cloud strategies, which added even more complexity to the to the existing environments. So all in all, due to all of this, what happened is that the applications over the past few years have grown significantly more complicated. So complicated in fact that you can't extend monitoring strategies that work for watching machines in the past, and expected to work for for containers and containerized set it as well. Okay, so containers came containerized applications came and then the day with the advent of that came Kubernetes as being the, as being the biggest containerizing infrastructure platform that everybody started to implement in the has or has started to implement in the last few years, but with with Kubernetes comes along with, like it brings its own complexity brings its own concepts it brings its own fundamental fundamental aspects of it, which, which are needed to basically run your run your containerized applications or in a containerized fashion. So, running an application or running production applications in a containerized Kubernetes infrastructure requires you to keep track of your clusters, your parts your containers of your apps, then you have CSI plugins now coming in, you have container networking infrastructure, setups you have different options in that you have Flannel option you have Calico, your WeaveNet, you have auto scaling features coming in, if you are provisioning this on a cloud provider basis using AWS or Azure. So there's so many different moving parts in a complex architecture, or all intertwined with each other, which makes it really difficult to run production grade applications and consider this as a, multiplied into say hundreds or thousands of clusters that many large applications, I'm sorry many large companies run today in their setups. And with that what happens is that it brings, because of this complexity, all of all of the different different things that that are running in sync with it in sync with each other. So, in Kubernetes, you have applications that continuously scale up and then scale down to zero, you need to keep track of that. You have load balancing involved to navigate traffic across application deployments. There are memory leak problems within containers, certain applications are single threaded. In a way they are written, which results in slow API response times out of their methods or functions. This itself reaches, like the architectural components of Kubernetes reach certain limits, you have out of memory killing the parts, you have crash loop back off errors happening, happening across deployments. You have XID quorum issues, which happens across your master node components. You have resource contention continuously happening on a large scale, based on your underlying infrastructure. And all of these reasons basically amount to your cluster performance and uptime being impacted to a large extent without having a solid monitoring system or a solid monitoring strategy in place. With that, I'll let my colleague Yijan to speak more about what observability is and what are the different pillars of observability and how can we approach Kubernetes and container as infrastructure monitoring by taking the correct, by approaching it using the correct methodology. I would like to speak about the pillars of observability. We categorized the three major pillars of observability, the metrics, the logging and the tracing. We can call it like metrics as the what and logging as the why and the tracing as the where. So this logging, tracing and monitoring are in the different words for the same process. So the metrics are actually a numeric representation of data measured over an interval of time. So metrics provides an insight into the system's behavior. Like we have to choose the metrics that provides metrics such as CPU, memory, usage, disk IO. So this metrics really helps in terms of, you know, having an insight to the system. So an application metrics for a service discovery feature in your monitoring tool is very essential. The application metrics is that they are a system school to making it hard to understand anything else other than what's happening inside a particular system. And so for metrics, we kind of use some many of the open source tools like Prometheus. So Prometheus scraps and metrics and stores time series in the DB. We have these collectors acting as a non-expertors which helps in valuation. And also, we have this alert manager sending out the alerts at a specific threshold. So metrics is really helping in terms of identifying those information. And as we know, the metric is not only the primary factor which we can identify or conclude the observability. So we have this logging. So the purpose of logging is to track error reporting and related data in a centralized way. So generally, the logs are either stored locally or remotely, like a centralized storage for analysis. So the use case where it comes is like suppose an engineer reports some application or service has been behaving abnormally. So we can look into this logs and this logs provides us more clarity. And that's how we can, you know, troubleshoot or understand like what went wrong. So logging should be used in applications, specifically if they provide a crucial function. So we have to make sure this logging is also being taken into consideration. So we have open source tools associated with this like elastic search or log status that has been used. That is a widely used common open source tool. So now moving to the traces. So whether the logs provides an overview of the discrete and event triggering logs traces and comes to much wider or continuous work and application. So the goal of tracing is to follow a program's flow and data progression. So by tracing through a stack, developers can identify bottleneck and focus on improving the performance. So that's where tracing really helps. So in simple, we can say tracing tells a story of a transition or workflow as it propagates through the distributed system. So one of the widely used example I can find as a service mesh. So in service mesh, we have no micro services which has been distributed and we can trace it using the open source tool called galley or jagged. So it's a widely used tools that has been used in terms of tracing. So these are the three major pillars in observability. So moving ahead, when it comes to the Kubernetes, we have a, we should be taking a three trial approach in terms of monitoring the Kubernetes. So monitor your infrastructure. It's like connecting metrics about the infrastructure availability, the visualization and the capacity and the performance. So this is how we, you know, we can, this is one of the approach that we should be taking in terms of monitoring your infrastructure because infrastructure plays a vital role. And as you know, the Kubernetes will be deployed in a distributed way. So that's monitoring the infrastructure as one of the approach that we should be taking in terms of Kubernetes monitoring. And the other important factors like checking your logs, like you don't have to, you can log all the logs with the application. So that's where we have to be cautious enough, right? What are the important source of log service ability taken into consideration? Like say a country software stack, especially, you know, these stacks will help us in understanding what actually happened if there's a transition infrastructure. So we have to identify with specific logs or needs to be enabled and the logs are also very important. So the last one is using APMs. So it's like application performance management. So at the end of the day, we would be deploying our applications in the Kubernetes environment. So APS has a capability to understand the performance of an application, including the end user, each part of the application. Like the application dependencies, why application access or like common issues and collaboration throughout the team and the support environment. So all these factors come into the picture when we talk about application performance management. So we have a lot of APM tools that has been available. So we have to make sure we kind of integrate them. But just with APM tools, we want to solve all of the Kubernetes monitoring challenges. But they can help to determine where the issues exist and who the issues impact and what would be the root cause. So MP is one of the crucial approach in terms of Kubernetes monitoring. Now moving on, let's discuss about the key pillars. There are three key pillars to adopt in order to monitor the Kubernetes more efficiently. So I'll quickly run through these three points. It's like we should have a dedicated team. So when you say a dedicated team, it's like the team actually owns the Kubernetes monitoring. Even if we don't have a solid responsibility, it needs to be clear that who should be monitoring the cluster. Like who has to respond when something goes wrong. So the need of having a dedicated team or at least making sure the responsibility has been properly passed on is very important in terms of the Kubernetes monitoring. So that's why we consider this one of the major pillars in adopting the monitoring in Kubernetes. And the second point is like it has to be a very user focused, not a CPU focused. So when we have this AWS kind of infrastructure, like we kind of pay them based on the usage, right? So here, when it comes to monitoring, we should take you up first about the user focus because the end goal is always to keep the users happy and not to make the computer happy. So we have to provide or give a specific focus to the users. In simple you can say the user satisfaction, the customer satisfaction, that should be the end goal. So this is another key pillar in terms of monitoring. That's why that's how we can solve some of the major pillars of the experience we have, right? And the other key factor is like integrating the tools because Kubernetes monitoring involves so many layers and becomes very complex. And we cannot try to collect or analyze all of the metrics in just one tool, right? So we should have a different series of tools and we have to integrate them to identify which are the tools or the logs that we should be specifically looking at. And based on our requirement so that we can attain or provide the customer experience or we are making sure our business is running smoothly. So integrating all these tools, required tools is one of the important factor. So yeah, so these are the main three key pillars that I would say in terms of monitoring the Kubernetes more effectively. Yeah. Now, I'll let my colleagues talk about the start journey of the Kubernetes, the dashboard and what are the tools that we use in terms of monitoring. Yeah. Thank you. So we saw what are what are what are the different terminologies in terms of monitoring and metrics that that even legacy infrastructure cloud infrastructure and Kubernetes as we then saw how virtual machines and containers came into the picture and changed the way applications were designed and the need for why the need for Kubernetes came in, why the need for Kubernetes monitoring came in. We recently just just saw what are the different aspects from a theoretical perspective or from an organizational perspective that that organization should try and implement. As a starting point of when they are trying to go through the Kubernetes journey and the application monitoring journey basically. Now let's now let's switch gears and let's dive into some of the some of the some of the different tools that are available in the in the in the market, and how how those tools can help organizations to implement monitoring in a better way. The different margin margin strategies that we discussed, and how they can integrate all these, all these different open source tools to basically tweak and use it in a way that that suits best for for their infrastructure and their application. So, in terms of start of the journey, usually when when anybody starts on a Kubernetes journey and when they try and containerize that application from from legacy legacy infrastructure or from a virtualized infrastructure. The best place to start is, is to is to do it in a small on a small footprint. Like say, run a run a single question with a few notes to begin with, try and get a grasp of what Kubernetes is what are the different components and integrated into the Kubernetes cluster architecture is the Kubernetes dashboard. The Kubernetes dashboard gives a very fundamental picture or in terms of the metrics for a beginner to look at it provides your CPU usage memory usage for the source or in an aggregated fashion for each namespace or across namespace or a period of time for for your cluster to look at. So is this sufficient for a production grid setup, obviously, you know, but for for for a user that's that's that's new in the market, new in the Kubernetes space. This gives you kind of a kind of a very structural start to monitor your workloads or monetary applications in terms of CPU and then the usage and their consumption. Now moving to the big beast in monitoring perspective, which is Prometheus, which is where in our next few slides will be will be will be on basically. So over the past few years, since Kubernetes came into the picture side by side, along with that came came Prometheus which has now become the mainstream marketing tool of choice in the containerized and microservices world basically. So let's let's get let's dive deep into what what are the different components involved in committees. So, at its core the Prometheus server is what it runs and does that the does the actual monitoring work. So the stack itself, the Prometheus stack consists of time series database, basically, and when you say time series database it's basically stores all the metric data like currency PUSH number of exceptions and application in a structured fashion or a period of time that you that you designate. It's got the second concept is offer data retrieval worker. What this does is that it is responsible for getting and pulling those metrics from the application server services or the target resources that you would have in your infrastructure and storing them basically into this time series database. Once this data is been retrieved and stored. It has got a web server or a server API of endpoint facility that accept these queries for the store data. Then the server API or the web server components basically are used to display this data in the form of a dashboard or a UI, either through the Prometheus is own dashboard or you can have the other data visualization tool that goes hand in hand with which is which is So, like, like other monitoring systems, where in you have the application so with all the servers pushing the data, pushing pushing their metrics into a centralized collection platform. Prometheus does it in a different way. Prometheus pulls the data from the instances or from the target resources that you want to monitor. What happens with the PUSH system is that it creates a high load of traffic resulting in monitoring to become the bottleneck itself. Because you are pushing the data continuously to a single data source that that that point basically is going to get hampered in terms of in terms of the monitoring itself. And it also requires an additional additional software as a tool that required to push those metrics to that to that centralized centralized collection platform. So then this is where in Prometheus acts in a different way. What what the PUSH system does is that it also provides a better detection or insight into in the services up and running and collect the logs accordingly from those from those from those targets from those projects. This is also for very short-lived jobs where in where in the PUSH system isn't going to help also provides a push gateway concept where in you can do the do the legacy or the or the other monitoring style, style way of collecting the metrics, but you can have that as an add on feature only for jobs which are going to run for a very short time and where in the PUSH doesn't actually make sense. It is extremely easy to deploy Prometheus and Grafana and alert manager all of these things using health charts that's the easiest way to to deploy you can go go go to the help website and they have got a they have got a Prometheus stable repository and you can pull in those charts and in a matter of minutes you'll have Prometheus, Grafana, alert manager everything running on your on your workload and it will be basically ready for you to basically then start start pointing it to the things that you want to monitor. Now what does Prometheus monitor. We understood we understood how how it does work but what what does it why why do we need Prometheus and what what all things does it monitor. It could be an entire Linux or a Windows server it could be a single application it could be an Apache server or a service like database for all these things. And, and, and these things that basically are being targeted to be monitored in in Prometheus are called as targets itself. And it has got different units of monitoring so consider Linux or a Windows server you have your CPU statistics you have memory or this space usage for a standard application you have the request count exception count request duration, all these different all these API calls that will go through the application you need to monitor those. So these are some of the units that we that that any kind of consumer would like to monitor for for for a specific for a specific target in Prometheus is basically defined as what what is a metric that that that the monitor. And these metrics are what gets saved into the Prometheus time series database. So you have a you have a target and these targets have got units these units are called as metric these metrics get saved into time series. Very simple concept. Now the way in which these metrics get collected and saved it's got it's got a it's got a format to that so it's a it's a very simple human readable text based format. It's got two attributes basically it's got a type and a help help is whether you have the description of what the metric is type has its own categories so you can have type as a counter where and you can get how many times a certain thing happened, you can have a gauge type, when you can say, can this service, or did the service go up and down in terms of monitoring it to a high and low value. And then you have the third type is is the histogram when you basically can monitor it in a way that how long did it take to reach this basic to reach this data point at how big of a size request it was. In terms of in terms of the metric itself. So these are some of the metric attributes that that that that every metric has that from it is one of us. So moving on how does from each is collect these metrics from the target now we understood what metric is we understood the target is how does for me just collect this. So Prometheus basically pulls these metric data from from the targets by an HTTP endpoint, which is by default, the host address slash metric as the as the as the term. So these targets, they must expose the slash metric endpoint that that I just said, and the data available at that slash metric endpoint should be available in the format which Prometheus wants. So, all the targets that Prometheus wants to wants to tackle this, they should have us that it should have that slash metric endpoint, and the format of the data that is that that Prometheus understands the data should be in that in that form for for Prometheus to basically scrape. So some of the servers, they already exposed this slash metric and points by default, so that there is no need. There's no additional work basically needed to ensure that it's in the it's in the format that Prometheus wants to collect the metrics. These are called as clients. Basically, so you have a concept wherein you have wherein these clients they come up with a lot of different options in terms of software different picks that can be integrated using the language of your choice. And then directly be Prometheus integration within the, within the application itself. It could be a Java programming, Java could be using the Java programming tool that you write, write Java application, you wire up Prometheus during the application creation itself and have out of the box metrics for your application, which is ready for Prometheus, which is ready for Prometheus to be scraped out of. Then you basically have metrics collected from these target application that you built in out of the box. You don't need to do anything to ensure that it's in the format which Prometheus understands or wants. If you don't have, if you don't have that kind of an application, it's fine, because not all the services or servers which you want Prometheus to monitor will have will have that native Prometheus endpoint as per the format it requires. So that's where in the another component comes into picture, which is called as an exporter exporter scan can be basically in the form of a script or a service that fetches these metrics target from the that fetches the metrics basically from the target and then converts them into the format which Prometheus understands and then exposes the converted data at its own metric endpoint, which would be at which would be slash metric where Prometheus can scrape it. So if you have an exporter and you and you have that running in your application, which is not a not a native Prometheus built in application exporter will basically do the job for you it will convert it into the format which Prometheus understands and then expose that converted data to its own slash metrics. Endpoint for Prometheus to scrape it. Prometheus has got a list of such exporters for for all for right to have different services and systems and application and Prometheus continues to add these to to to they continue to add added to this list. So, just as an example, if you want to monitor CLNX server, you can just download on car and execute the node exporter, which will help to convert the metrics of that particular in a server and expose them at the slash metric endpoint, which then you can use inside from ages to scrape it. These these exporters which I just talked about they're also available in as Docker images so you can you can just build build them in as site car containers if you want inside your application. So that you don't need to write write it in the in the application itself when you're when you're designing it basically. So, Prometheus now is collected all the metrics from the target. What what and it has told it in a time series database what what do we do next with that. We need to basically have these metrics used and targeted in a specific way that will provide the end users or the administrative team of a particular organization with alerts that that's that's our end goal. So that's where in comes into comes in that's where in alert manager basically comes into the picture. So Prometheus basically has its own server, which will push these push the alerts to the alert manager, which will be responsible to fire these alerts via email via slack channel, via any receiver that it wants based on the rules. So you will basically configure rules inside alert manager. And then once those rules are or thresholds of the rules are met. Prometheus, Prometheus metrics will be pushed to the alert manager. And when the condition has been met the alert will be fired to the receiver that that you have already configured for that rule, which will be any kind of admin. That will be monitoring your monitoring stack. So, this is just a basic demonstration video that shows how a node exporter is, which is running inside a Kubernetes node, having Prometheus, which is being run locally so this is basically the Prometheus is dashboard. And you can see different, different metric metric points that you can have no describe time in milliseconds. So, this is basically one of the type of Prometheus exporters for hardware and operating system metrics, which has got its own set of pluggable metric collectors. It basically allows you to measure various machine resources, as I said, like memory disk utilization CPU utilization, etc. So, what it does is that it basically scrapes this information from the existing load or or the device which is in place on which on which it gets deployed. So, in terms of in terms of deployment when deploying and tracks, it, like all of these, all of the all of the Prometheus and alert manager comes comes into comes into the picture in the form of in the form of deployment. So, you will have certain deployments running which will, which will have the Prometheus updated running, you will have Trafana, which will be deployed as a standalone deployment. Then you'll also have something which is called as the Cube state metric, which monitors the health of your running resources, even the money, even the Prometheus resources and the other resources internally itself. Like whatever, whatever deployments you have running as your controller manager, your Cube scheduler and stuff like that, all of that will be monitored on its own via this Cube state metric endpoint. So, basically all in all, you have when you deploy Prometheus, you are you cover your basis to a to a large extent in terms of monitoring or application resources as along with that you will monitor your Kubernetes resources as well. Out of the box. So, moving on to Grafana. So, this, this is kind of an extension of Prometheus. It's a visualization tool. You can collect the metrics out of Prometheus and then visualize it using Grafana, which has got some awesome dashboards that you can, it's got a preloaded dashboards and you can create your own dashboards as well. It, Prometheus allows to query its metric data on these targets using the web server API as I discussed earlier, and the way to do that it does it, it does it using the promql query language. They have their own promql query language, which is quite easy to understand but you need to get a hang of it to write those queries if you want to create your own, your own dashboard space. So, you can use the Prometheus UI or you can use the Grafana which uses the promql query language in the background. Grafana can also import data from other resources as well. It doesn't, it's not like it, it can just do it from via Prometheus itself. It can, it can fetch, fetch the data out of Elastic Research or your Elxtac or your AWS RDS instance or MySQL. So, you can have, you can have your data put in via all these different resources and then you can build your own visualization from all these different resources into a single pane of glass. Then you'll have data from, say, a Kubernetes cluster and you'll have your data coming from your Elxtac as well, that you can visualize, visualizing the single pane of glass. So, this one is basically a demonstration wherein on a standalone single node master, single node master Kubernetes cluster I have the Linkery deployment running. And this is the Linkery dashboard that comes in along, along with the deployment itself. And if you can see, it's got its own Prometheus, Prometheus pod running, which comes out of the box when you deploy the Linkery, Linkery application. And it has got its own Grafana dashboard as well. And this gives you a picture of how Grafana looks. You can get pods for memory usage and network IO for your pods and the network pressure across the name space. And then you can go and go and specifically target a particular deployment as well that you want to monitor across any name space to see the request rate for that particular application and what has been the latency and stuff like that. So, you have these applications that I discussed already, right? You can have these applications which will have the client tools basically having Prometheus running as a part of the application itself. Or you can have them running as separate standalone pods coming in with the application itself like this one has and you have all the Linkery metrics coming in and you have the Grafana dashboard deployed as a pod along with it. That basically provides you like a complete separate monitoring, monitoring resource for your application as a sole application itself. So this is how basically you can see that how in a general way the integration works. You have Prometheus which is like pulling the metrics on the application. It's pushing it to a lot of manager Grafana is using from QL to show it in the visualization. You've got the node exporters which is pulling the data out of the third party application. So all in all, you kind of get in the sense of how things are integrated in a complete stack perspective. Moving on to Federation. So Prometheus has got its own feature wherein you can federate the Prometheus server. So Federation basically is an hierarchical structure wherein each cluster can set up its own Prometheus server, which will be responsible for scraping the data in the cluster. Or one can set up a Prometheus server which will be then responsible for scraping the data from the other individual Prometheus server. So here the federated Prometheus server will also monitor the individual Prometheus metrics as well. So it's got like two categories. The first is the hierarchy of Federation wherein the Federation topology resembles basically a tree-style topology wherein the higher level Prometheus, global Prometheus server will collect the aggregated time series data from a large number of different subordinate Prometheus servers. And then you have got cross federate, cross service type of Federation wherein you have CA of Prometheus server for one service which is configured to scrape selected data from another services Prometheus server to enable alerting enquiries against both the datasets into a single server. So in terms of configuring it on any given Prometheus server you can expose the slash federate endpoint which allows for retrieving the current value of the selected time series data in that server and anything needs to be specified for a match URL parameter to select the series that you want to expose. So if for us to federate the metrics from one server to the other, we have to configure the destination Prometheus server as well which will scrape the slash federate endpoint from the source server while enabling the Honor Level Scrape option as well so that no labels are written which have been exposed by the source server basically. What Federation provides is that it provides a way in which you can capture selective and aggregated time series data from each individual cluster which will ensure that less space is being consumed for that particular Prometheus server and it will provide you basically an aggregated data that you can have longer retention period say one year or whatever. Also what Federation provides you is that it provides you an unified display of data so you don't need to go and look back at individual cluster. You can have data at a single point where in it's an aggregated data where in you have what you need it basically and then you can even have federated dashboards which are running which you can run across all the clusters that have basically collected information and federated. Quickly moving into Elk stack I'll let my colleague Ajaan tell you about this. So this is clearly mentioned about like the Prometheus and working and like how it can be near the federated. So now let's discuss about a different stack that's called the Elk. So it's elastic search and lock stack and Kibana these are the three tools that has been stacked together. So this software we can call this stack. It really helps to take data from any kind of source and maybe formatted or unformatted unstructured data. So this helps you know in performing searching analyzing and visualizing this data for determining the patterns in real time. So this is also one of the monitoring tools that we use. You can see from the Prometheus and this Grafana. So it's in a similar way we have this Elk attack where we take the data and search it using the elastic search and then show in a visualization format. So with the help of centralized logging you can identify the problem of an application. So you can look for all the logs from a single platform. So it's a different platform altogether when it compares with the Prometheus and Grafana. Let's explain about these three different stacks. I mean the components of these stacks. So let's take elastic search. So I'll quickly run through all these components here. So elastic search as the, you can call this a heart of this software stack because it's basically has no SQL database. You know it stores data simply for looking at it quickly and it offers queries for better data analysis. And you know it has the index heterogeneous data. So it offers a real time search that we can find the documents right after they are in next. So that's how it provides a real time search. You can update and add more data to the document in the real time which is a plus when it comes to the remaining monitoring stacks. Another advantage is it offers a geolocation support and a multi-language support. And it also has this multi-document APF for handling individual records. So combining together you know can say elastic search as the heart of the software stack which takes care of heavy lifting work. And log stack is another piece of Elk stack where it fetches the input and feed it to the elastic search. So it gathers or collects data available for using it further. It also helps to clean the data for further use and it can support a huge array of data types. So in other words it processes log messages, enhances them and sends them to the destination. So that is a major functionality of log stack which acts as a first level for the log stack function. And once we have this logs fetched and fed to the elastic search and in a simple manner we need to have it in the visualisation format. So that's where Kivana brings in. So this is the tool that has been used for visualisation in the Elk stack. So it has this functionality of exploring a large volume of data. It has an extensive dashboard that has mainly features of graphs and diagrams. Which I know when you compare with graph on it, it reads the metrics. But here it shows the visualisations in a better way. And we can use this Kivana for searching and integrating. And you know mainly integrating with the elastic search data that contained in this instance. And the other beauty about Kivana Elk stack is that it can run on independent of the platform. It does not have to be on a specific platform. And it also has this real time visualisation of indexing data. And like I said it has multiple language support like it runs on Node.js to get a necessary package along with the installation. So its installation is pretty normal where we know we can easily integrate with the monitoring the stacks. And yeah, now when we say about Elk's use cases like, you know, we can use it for the login. And then, you know, the commerce impact is a big amount of data. So that is the piece in our hand in terms of selecting the monitoring stack. And it has this long term data storage and we have this. We can implement in a clustered solution across a distributed. So this is just of the Elk stack when compared with Prometheus. And so moving to the next slide, let's discuss about how does Prometheus kind of not solve everything. So we're trying to understand whether Prometheus kind of solves all of the problems that we have. So the major difference between Prometheus is used in metric collections, various system monitoring and setting up alerts based on this metric. But versus Elk has been used in all types of data and performing different types of analytics based on this data and searching on these data database and visualising it. So these are the two major differences between the Prometheus and Elk stack. So does this Prometheus kind of solve all the requirements? Does this Prometheus meet all the requirements that we have or do we need a better tool for it? So we have to say that Prometheus does not provide some of the capabilities that you expect from a fully-fledged as a service platform. Because I'll list a couple of capabilities that Prometheus is kind of missing, such as multi-tenancy, authentication or registration and building a long-term storage. These are the major missing parts which Prometheus is lacking. So I'll elaborate a bit more on those points. Like long-term storage, it's like an individual Prometheus instance provides durable storage of time series of data, but they do not access the distributed storage. When it comes to a given storage, obviously in a distributed system, right? And this free-to-let course, not replication and automatic repair, this means that durability guarantees are restricted to that of a single machine. So even though the Prometheus over a remote right API that can be used to pipe time series data to other systems, which becomes more difficult and extra overhead. But this is one of the major lag where Prometheus has. The other factor is the global view of data. Like described, the Prometheus instance has an isolated data storage unit. Like we saw earlier in Prometheus, instance can be federated. But that adds more, but some level of complexity to the Prometheus setup again. So the reason why we moved into the future is like making it very simpler to deploy and maintain. So when it comes to rated, so the complexity level of setting up a Prometheus increases again. So the Prometheus simply won't build in a distributed database. So this means that there's no simple path to achieving this single transition or a kind of a global view. So that's the other second missing piece that this Prometheus has. And then the other one is like the multi-tenant. The Prometheus by itself has no built-in concept of a tenant. So which means that it does not provide any sort of fine-granted control of a tenant-specific data access and resource using a usage quota. If I have a distributed environment in a different cluster, so the data that is described will not have this information about a tenant. So from the Prometheus dashboard, you won't be able to see or categorize a group based on this centralized view. So this tendency, the multi-tenancy feature is kind of a missing in the Prometheus. So how we can solve these, how can we fill these gaps? So that's where we can take a next tool called Cortex. So this is kind of an extension or kind of a tool where it fills the gaps where the Prometheus is lacking. So Cortex is of course an open-source Prometheus as a service platform that seeks to fill the gaps and thereby provide a complete secure multi-tenant Prometheus experience. So it's kind of an upgraded version, I would say. So as a Prometheus, as a service platform, Cortex fills in all these crucial gaps and provides a complete out-of-box solution for even the most demanding monitoring and observability use cases. So let's take a couple of use cases where the Cortex fills the gaps. So it supports a four long-term storage system out of the box, which the Prometheus was missing. Like the long-term storage systems were like, say, Cassandra, AWS S3, and we have this Google Cloud Bigtable. So these are the major storage which was provided by and supported in Cortex. And also it offers a global view of the Prometheus time series data that includes data in a long-term storage because it's greatly expanding the usefulness of the promque for analytical purposes because if I have a multi-cluster setup, as I have not been or as a customer, I would be more happy to see everything in a centralized format where I can see all the data that is required. So that's where this Cortex offers this functionality. And this has this, like I said, the built-in multi-tenancy features, like all the Prometheus methods which pass through the Cortex associated with a specific term. So I'll describe it, but I will just show the architectural structure, how Cortex is being deployed and the components, the major components that has been involved in Cortex. So if you see here, the slide shows the architecture about what is Cortex and what are the major components that are involved in achieving this and how we can integrate this Prometheus in terms of sending alerts, what all the plugins we can add and integrate. So I'll talk about a couple of the major components which plays a crucial role in Cortex. So the functionality, the essential functionality is split up into a single-purpose component that can be independently scaled, which is a very important factor here. So I'll talk about the major four components that are present in the Cortex to achieve this. That is a distributor and then we have the ruler, then we have the investor, then we have their QR. So these are the four major components that are present there. So distributor, it handles the time series data written by the Cortex by Prometheus instance using the Prometheus API. So the incoming data is automatically replicated and shattered and sent to multiple Cortex investors in parallel. So the investors can receive this time series data from the distributor notes and then writes to the long-term storage backend and compressing data into the Prometheus chance of efficiency. And then the ruler executes and generates alerts and we make use of the alert manager which by default is installed with Cortex and alert managers can be used in sending alerts to the admins or the associates in terms of this spike and based on the rules that are set. And then we have the courier, it's like handles from QR queries from the clients, like abstracting over both fmr and time series data and assembling in long-term storage. So this really helps in terms of an overall kind of solving the major gaps which was Prometheus missing. So this is a very quick walkthrough on Cortex. And now I'll mention about the other key factors that we saw in the previous slides. So in the diagram earlier we kind of saw the Cortex completes the Prometheus monitoring system, right? To work with system implementation we just need to reconfigure our existing Prometheus instance to redot the write to the Cortex cluster and then Cortex will take it, take care of the rest. So it's like an extended version, like I said, an extended version of the Prometheus which all the missing gaps have been filled, right? So let's discuss about the multi-tendency that was being implemented with Cortex. So that single tenant systems are fine for a small use cases and non-precision environments. But when it comes to a large organization with a large number of teams, the use case is the environment and the systems and everything becomes more complex. And so to meet the tax evasion requirement of such organization Cortex provides a multi-tendency, not as an add-on or a plug-in but rather a first-class compatibility. So like all the, for example, like a time series of data that arrives in Cortex from Prometheus instance is marked and belonging to a specific tenant in the request metadata. So that's how we can, you know, separate the, bring in the multi-tendency feature with Cortex. Now the data that has been, you know, staffed from the Prometheus, you see the different tenants that can be coded by the same tenants. And we can use this for, you can use the alert manager and configure it sending out the alerts, you can integrate with the Slack or any other, you know, alert the way that we are to send alerts to the admins. So each tenant has its own view. The Prometheus, Prometheus centric world at this dispersal. So we do not use Cortex in a single tenant fashion because it becomes add more or extra overhead and does not meet the purpose. Right. So the Cortex has been generally used in multi-tendency or in a very complex distributed environment. So on all the places where the Prometheus, the features which Prometheus lack. So we can expand out of their independence in a large pool of multi-tendency at any time. So the other important thing which I like to call upon is in terms of the components that can be independently scaled in Cortex. Like all the major components like distributed investor and a ruler, all this can be properly configured or tuned based on the requirement that we have. Like the way we have to send alerts based on the traffic, how Cortex has been managing it. So this is the very quick view of the Cortex. So moving next. So how we should be sitting on the, you know, self-service Kubernetes monitoring and IPL scenario. I'll let my colleague to take it over. Hey. So we just saw all the different tools that are available in the open source world right now and there are many other than these as well, which are being developed on a daily basis on a daily basis. But from an organizational perspective, rather than having all these services and you have your own infrastructure, you have your own application and you have your own monitoring team as well. Doing all the heavy lifting, rather than that, what can be an ideal scenario or an ideal solution to all these problems that we saw and that all these different concepts is basically having a self-service Kubernetes SaaS managed offering, which will basically provide an easy and a self-service way to access the monitoring data. So the software is a service solution would collect the data automatically for its users, which they can view anytime using Grafana or from your dashboards, which would be built into the Kubernetes offline itself. And they can view it at the click of a button for the relevant cluster. It would also help if it has a visual alarm over you where one can keep track of alerts and for the subscribe to them. All in all, you just, you don't have to worry about anything. Everything can be, you just have to deploy your cluster in, you'll have the monitoring in place as well to basically ensure that your application runs 100% of the time. And along with that, to use, if this SaaS managed offering also has its own Kubernetes internally monitoring setup, that would even make things even better. The solution then would offer the user management experience by having its own internal monitoring setup, which would alert the customer to specific scenarios based on the data that they would collect, which the customers will then be notified. So essentially what this will do is that will provide the best of both the ability for organizations and for the customer to dive deep into monitoring the data themselves, why all the tools that that will be export and also having hands of approach by letting the internal monitoring do the work that they wish for that the clusters would run on any premises or on any or on any multi cloud environment without having to worry about, worry about certain specific things. So that's an ideal scenario and an ideal solution that needs to be in place for organizations to leverage so that they focus on their application or designing and development and not worry about all the things that are there. Yeah, essentially this provides the best of both scenarios. And that's what that's what we basically wanted to cover, wanted to provide a brief overview of all the different problems that are there and how to solve those problems what are the tools available and basically what what can be done for for for anybody who's starting in this space just just as an overview. Yeah, with that. It's it's a wrap will definitely take any questions that that that you guys have. Thank you so much for your time. And thank you for working with me.