 Hello, everyone. Good morning. Good afternoon. Good evening. Wherever you are. I'm Michael waving over from Germany This time virtually And today I want to dive into practical Kubernetes monitoring with Prometheus with you in the next few minutes Before we continue before we start you can reach out to me as DNS Michi on on social media This is DNS and my CHI And without further ado, let's jump into action When it comes to monitoring we probably need to first start understand well We need to change from the traditional host and service objects monitoring We had state-based monitoring with Nario's back then with more like metric names tags volatile container names coming up and other Similarities and needed to understand well micro services and distributed systems need a new approach in monitoring And this is where we want to dive into today like say, okay, we have Kubernetes We have containers microservices and how we can we monitor that changing The ideas changing the knowledge is often times hard But here are a few tips of saying well we have microservices in containers Which can run and run anywhere in the cluster? So you actually don't know the IP address you cannot really say well It's it has an host name which I can ping and say it's okay The other thing is a business process model doesn't work with saying hey two out of three replica sets Means that the state is still okay because this is Different similar to how Kubernetes detects the health state and might be repairing the health state or even Add more resources to the pools in order to ensure that the application is running Because sometimes the service might remain broken from the outside and in the next second It's already fixed again and in a traditional monitoring system. This could be meant flapping between states and generate alerts which is not something which is applicable to Kubernetes and microservices similar to mapping states like like okay warning critically They don't necessarily fit Kubernetes objects and inside the cluster. So forget forget everything You know about this when approaching a Kubernetes monitoring and focus on metrics logs and and everything else which is highly dynamic Now where should be we be starting our monitoring journey? Should we be running premises inside the cluster or outside the cluster? What is needed because well if the Kubernetes cluster goes down and Kubernetes is running inside. We don't have any monitoring The thing is When we need to do external monitoring we might be to need we might need to open up the APIs If we don't have any sidecars and there is a potential for security problems because we also expose certain root access and Other things which are not really made by design. The design is to say hey I want to run premises in the Kubernetes cluster use security policies which are in there communicate with specific API servers and other things and follow the best practices with regards to permissions If I need external monitoring I can install ping probes to monitor the monitor Which I need to do with a traditional monitoring system already Now if we dive into Prometheus and say hey, what is that what? You probably know it already But how can we approach this or combine this with Kubernetes monitoring and getting the big bigger picture is like Prometheus scrapes Endpoints or how to the HTTP endpoints and Collects the metrics from there so at a certain point we need to define the premises targets and one thing which comes into play with Kubernetes is to have service discovery which Prometheus Discovers the targets in Prometheus in in Kubernetes. Sorry using the file service discovery for example and then Automatically adds the script targets to the monitoring post them in a in a in an interval and everything is fine with all the collected data we of course need to define alert rules for sending notifications and alerts and Also query the data with promql from the Prometheus web UI from Grafana form specific other API clients in that regard Now when it comes to querying data keep in mind promql keep practicing this because this really comes in handy with learning Kubernetes monitoring specifically knowing the basics and also using for example prom lands as a As a tool to test the queries now the other thing I mentioned is service discovery and this is like Really using a dynamic list of targets which promissius can then add to the monitoring. This is something to keep in mind to understand how this works with Kubernetes and promissius But let's get into action and say well we want to monitor Kubernetes and what is needed to actually do that and this is where the promissus operator and specifically Q Promissus comes into play which allows to install An entire package or the components in the pack in a package Which you would otherwise install manually by using the operator by using help charts and so on So Q permissus provides the premises or uses the premises operator to install Prometheus in a high availability setup next to the alert manager next to node exporters It also adds a promissus adapter for the Kubernetes metrics API So you don't need the metrics server anymore You also get cube state metrics pre-installed We will see in a bit. What is what it does and Grafana to have dashboards and Fuse available in in addition to that and this is the icing on the cake We get pre-configured cluster monitoring for for Kubernetes meaning to say we have Grafana dashboards and also alerting mix ins pre-installed and can actually make use of that and have the like the five-minute success of I'm deploying the premises operator cube premises and I can start playing I can start Investigating what's going on in my cuban it is cluster in addition to that the Cube premises allows us to dynamically create services carry objects Which is helpful for deploying own applications and adding dynamically The application endpoints or target endpoints for applications less metrics for example Now the installation is pretty straightforward. You just follow the documentation apply the Manifest and wait a bit and then basically everything is fine In order to familiarize yourself with the deployment The first step is to add port forwarding To have the ports available For production environment using ingress controller like nginx or something else in order to expose the HTTP front ends One of the front ends is the premises UI itself, which you can use for basic queries Visualizing the charts and so on so basically testing and inspecting what premises is doing For more advanced user interfaces and dashboards and panels and so on you get grafana pre-installed which which comes with the default password and so on and last but not least We also get the alert manager UI for that very reason and can start or when when we trigger an alert we can Investigate the alert manager UI Now when it comes to specifically monitoring Kubernetes with Prometheus we want to Have something available and the great thing is Q Prometheus provides a long list of grafana dashboards Which we can inspect and and and see what is going on for example the Kubernetes API server The compute resources in the cluster. We can access the cube let we can get the premises over view so seeing the health of the monitor actually and Many many more things which come by default which come out of the box like This is actually like a production view already on the Kubernetes cluster itself So oftentimes it's wise to just use that as a production view and then learn from the queries and from the panels to create your own dashboards and panels in grafana to Monitor the Kubernetes cluster even even further now The other thing which comes to mind is well if I deploy my own application into the cluster I also want to monitor it and this example is a fork from the Prometheus demo service from Julius waltz which basically exposes some synthetic metrics and in order to deploy it by using the deployment and service resources we can actually like run it into the Kubernetes cluster and Later on start monitoring it. So I would totally encourage you to try it out later on Deploy it into the Kubernetes cluster and then monitor the demo service Which basically means we need to create a new see a customer source definition called service monitor and Make Promises aware that there's something new a new service discovered to in that regard and And sure that it takes up the metrics and points Metrics and point and starts monitoring it since there were three replicas sets defined We can see two of two of them already up as a scrape target and one of them is still unknown which means that at a later point we have three targets away level and based on that you can go ahead and create a new Grafana dashboard a panel Selected Prometheus data source search for a specific CPU metric for example and then Use a prom curl query in order to fetch the data and visualize the graph The other thing which you get out of the box with cube Prometheus is container metrics so the cube let's provide an embedded see advisor exporter which comes at the endpoint slash metric slash see advisor and Allows you to monitor the resource usage specific out of memory kills and so on An example is it looks like this So building on that and using that gives you an insight of saying hey, there is a spike There is something which is continuously going wrong And in order to really get an insight into for example specific name spaces Using too many resources The other thing Which we also get out of the box with cube Prometheus is cube state metrics Which queries the Kubernetes API server for specific health states like the deployments denotes the parts we can query the replicas away level and their status and Everything is installed out of the box, which is super helpful to get an insight Without any further manual actions Now this is what we get as the basics so we didn't do anything yet or we didn't modify anything yet This is just out of the box monitoring out of the box it's you install the premises operator and cube Prometheus and You have everything available, but what if we want to customize it? So You promise use is Jason that and you can develop your own rules and dashboards so you can add them You can also monitor other name spaces and I tried this recently to monitor the easy CD But one can contribute name space in our Kubernetes cluster And this is how it looks like to create your own JSON net file modified and then generate the YAML configuration and Deploy that into your Kubernetes cluster again And this is like an ongoing process you You can modify it you can add even more things to it and use Cube premises will really as a basis and add your Customizations on top so you don't need to reinvent the wheel all the time the other thing around More advanced monitoring or what I want to bring to your attention is What else can we do with the monitoring data with the metrics collected and one of the things in the SRE? Handbook from in the SRE book from Google is the foregolden signals of monitoring like what is needed to immediately see what? That something is going wrong with your application and this could be latency. This means traffic This is like errors and also situation of the service and oftentimes you need to think about code instrumentation So actually modifying your code and exposing the metrics for premises to see the HTTP requests To see anything which has been failing and so on and you need to correlate correlate that in order to Immediately see that something is really going off now if you for example want to dive into instrumenting your own applications This is an example for python which I created a while ago for a workshop On the other side I would encourage you to check out all the premises client libraries for ruby for rust for go for Anything which comes to mind and if it's not there look into the existing examples how? To really instrument your application and deploy it and have it monitored Now monitoring and metrics is fine, but what about alerts and escalations as mentioned before the premises alert manager? Comes pre-installed with Qube Qube premises We can use the premises alert rules like the promcal queries and applying a certain threshold and times and so on and Alert manager ensures that there are no duplicated alerts that you prevent Alerts which have been fired already. You can acknowledge and silence specific alerts for a given time You have higher availability way level the transports are email or chat systems everything Which is basically known already is similar Here already The main difference with defining the alert rules is we need a premises rule custom resource definition which wraps the premises rules format and ensures that The alerts for example are being deployed into the Kubernetes cluster into the premises configuration But this seems rather strength straightforward and the link documentation for the premises operator is pretty awesome Now jumping into a Related topic to alerts with SLA SLO and SLI We define a specific service level agreement, which means potentially with customers 99.5% availability our objective is even higher saying hey I want it to be a 99.9 percent available and For achieving that we need certain indicators. So what is hurting our? Our production system. What is the error budget? What indicates that is it errors? Is it latency saturation and so on so when we start thinking about this we kind of Need to also think about SLOs and build our alerting based on SLOs for example now SLOs can be difficult or can be like you write a lot of code you need to Define the prom curl queries this and the alerting rules for the alert manager and so on So it needs a common specification and most recently open SLO was formed To unify that or to define the specification for SLOs so that SLO generators like sloth for example always generate requested format or the defined format and You can also use SLO management pura was announced Most recently which allows you to grab using UI to manage SLOs for premises for example, so you don't need to go into Manually copy pasting stuff in order to manage SLOs for premises and and also for Kubernetes then now SLOs with premises can also be used for for even more things like not only for metrics and alert calculation But also for quality gates When something is deployed into staging and you say hey if the SLO is failing in my staging environment I want to have a quality gate for example with captain and CICD is red and it fails just because the SLO is has failed and I don't want to Deploy any performance regressions for example into production and then debug it at 3 am in the morning and burn out from that So this is one thing to keep in keep in mind. The other thing is I I can can really touch it Would want to like talk talk about it for hours For kid lap comm the SAS platform or as a reason an infrastructure teams have also defined their own metrics catalog, which is Something similar to Jason net and this also gets generated and the great thing about generating all these Things is we generate the alerts the dashboards Which allows us to define the key metrics the thresholds And also add the script and observability tools So when someone is on call and the SLO alert is being fired They can immediately see what is going on and what is needed to fix the problem hopefully so that Production incident can get resolved rather quickly Now last but not least We need to shift from just monitoring to observability Just for the reason we have metrics with Prometheus the exporters the sidecars We have the applications with Metrics instrumentation and also integration outside of Kubernetes But this is probably not enough for running operations in the Surrey And so in that regard we shift into well, let's keep an eye out for logs Maybe we have a central lock management management of way label Probably have either elastic search with beats a sidecars in demon sets without the discovery and keep sorry, keep Anna as a front-end or Using a more lightweight approach with low key and and prong tail and flu and D and then using Grafana as a front-end This is something you need to keep an eye out for and keep and make your mind about it Next to metrics inside your Kubernetes cluster The other thing I want to bring to your attention is traces which should work a little different to logs and allow you to Really get distributed insights into application performance as well And for example, investigate slow queries the tools in that regard are Yeager tracing Grafana Temple and and open telemetry. This is something to check out And also prof profiling continuous profiling providing application performance insights for developers and defining the four pillars of observability in the future Last but not least secure monitoring is a hot topic is needed So you potentially have thought around automation and infrastructure as code and get ops like using the premises operator and cute Prometheus extending it customizing it with Jason net using Infrastructure as code for inventory for service discovery with Terraform and several and and the like and Deployed continuously practice GitOps with monitoring in your Kubernetes cluster The other thing is like security monitoring and security means you want to detect Vulnerabilities before they had production. This not only covers like code vulnerabilities, but also container images security scanning with dependencies and so on and Kubernetes security in a way of cluster image scanning But also ensuring that policies are being applied like the open policy age and cover no policies and speaking of cover no policies I want to bring the Promises exporter and the Grafana dashboards to your attention which allows you to immediately see What policy has been triggered when it failed when someone tried to circumvent it and so on? Which is super helpful for production environments? Now last but not least getting through ideas and some thoughts maybe think about Writing the data from premises outside to Kubernetes like using with remote rights with obstrace and the Grafana agent as an example Look into the GitLab SOE run books on how it's being done check out open telemetry to unify metrics logs and traces and Continue looking into continuous profiling Now the resources which I've which helped me over the past years to change from traditional monitoring To cloud native monitoring with Kubernetes with Prometheus are all linked here Special thanks to Julius for prom labs for the trainings. They are totally awesome I would encourage you to use the exercises in the slide deck to learn even more async and to think about What to prepare what to instrument what to observe and how to iterate on the specific things so like Throwing chaos engineering think about quality gates and optimize everything which is there So you you sleep well at night and everything is Hopefully in good shape and your monitoring is not red Thanks for attention. I'm happy to answer any questions and look forward to seeing you online and hopefully soon in person again