 Hello, welcome to the Agile track. I just heard that the tracks have been switched. So if anyone is accidentally sitting here expecting it to be the front end track, don't worry. This is not a very deep down technical code talk. This is more of a light overview and an intermediate talk around how you can do CI CD with your monitoring configuration. So let's get started. First of all, good Sunday morning, I was not expecting a huge crowd like this. It is great to see that most of the people here are waking up early and coming to a conference on a Sunday. So hello. I'm the only person standing between you and your Sunday lunch, so I'll try to make it quick. Who am I? Why am I talking about this? I'm Aditya Konade. I'm a site reliability engineer at Red Hat. And I work on Kubernetes on OpenShift teams. I'm also a Red Hat certified architect and a certified Kubernetes administrator, from which we can guess I kind of have an ops background but also working on development, which brings me to my next point that I can talk about this forever. I love to geek out about observability and monitoring and metrics, but we only have 20 minutes and I think the time is running fast. So we'll make it quick, please pay attention. We have a lot of info to cover. All the little developers in the room, please raise your hands. Yeah, yeah, yeah, only 10 developers, no. No, I'm very sure there's more. Raise your hands, don't be lazy, guys. It's Sunday, raise your hands, okay. Now we have more. All the real operations, guys, raise your hands. Yay, I really like how we have a 50-50 split, but this is the agile track, guys. We don't need to discriminate between the developers and the operations here. We want to be talking about DevOps and how we can collaborate and how we can do both the things. So let's talk about DevOps from a monitoring perspective and what it means for a developer or an operations person to be into the monitoring field. In the DevOps model, developers own their applications in production and if you have had any experience of owning your application in production, you know what an on-call means, you know how you get paged in the middle of the night and then you have no clue what to do, we'll talk about that and fixing that. For owning the application, you need to know how it works and if it works and when it does not work, what is it that is making the application to not work? And that brings us to monitoring. What is monitoring? What do you think about monitoring? Let me grab a very few keywords from you guys so that we can make the session interactive. What is monitoring for you? Anyway. Hell check. Hell check, good. What else? It's hard. It's hard. Yeah, I like that. It's totally hard. Metrics. Metrics, yes. That's very much what I'm trying to say. One more, we had one more. We can take one more. What? Friends. Friends, yes, graphs and visualizations and those nice Grafana jazz boards, right? So let's take a step back before we go to monitoring what is observability of your application. So if you are a developer running your code in production and owning the code in production, you need to know how the application health looks like. You need to know if an application fails, what are the failure points or what are the failure modes and all of that is encompassing a thing called observability. It's a concept. It's a very high level concept, not a very targeted thing. There are three pillars to observability, metrics, logging and tracing. You cannot do only with one of these. You, if you need complete observability, you need to have kinder sort of all three of these. But today we are gonna focus on metrics because monitoring usually deals with the metrics and for logging and tracing those are huge separate topics which can have their own track. But we are gonna focus on monitoring today. What are the goals of monitoring? I think someone said health checks. So the first thing that monitoring, in monitoring we want to do is watch the app called Fubar. Let's say there's an app called Fubar. We want to watch the app. See if the app is alive. What does it mean for an app to be alive? It means that the app is healthy. It is serving requests. It is doing what you intend it to do. And if the Fubar app is alive, gather some information about it that can help you in the long term, get some data, store the data in a persistent data store so that you can look at those very nice Grafana dashboards later. And if the app is dead, you don't care about the information, just sound the alarm, have someone look at it, run your reconcile loop, have your AI ops do the magic and reconcile it. Something like that. So this is so simple, right? You need to watch the app. You need to check it. If it does not run, just restart it. If it runs, just gather information. Why do we need to complicate this? Why do we need CI, CD and GitOps and all the magic and unicorns here? So the historical way of monitoring has been that the dev team, the developers who raised their hands will create an application and toss it over the wall to the ops team which will now take their application and run it on their favorite server. And then you babysit it for the next 100 years depending on what the support lifecycle looks like. But why is this a problem? Can anyone tell me why this is a problem? If you have been in the dev ops fight, the historical fight, I'm a kid, I'm a cloud kid so I don't know how the dev and ops fight works. So tell me, what is the problem with the dev and ops split? New releases. New releases, yes. The ops almost always have no clue what the application code looks like because they did not program it. They might have some health checks to black box monitor the application and kind of have a guesstimate of the application workflow but they almost don't know what the application code looks like. And that leads to gaps in observability. So if you're running an application in production and expecting someone else who does not know about the application code to own it and fix it when it goes wrong, you have some problems and some gaps in the observability because the code is not representing the actual health of the application. And the right person to instrument that code to write the health checks is you, not the ops guy who is just a kind of a black box monitor for your application. Have you ever tried manually adding 1000 plus services to your monitoring system? That is crazy. Trust me, I've not by the way, thankfully. The old model works kind of for monoliths where you have one big huge binary, a few GBs in size, you compile it, you put it on a server, you run it forever. And then you know how the application looks like but in the new world, it looks like this. So when it looks like this, you gotta make sure that each of these tiny dots is running the same way as your monolith was, which is difficult because you need to be able to have a brief overview of how this service is interacting with this service and how it works together. And the ops will just freak out because he or she does not know how the application is interacting and behaving. So you can do it. There is also the problem of toil. If you're not familiar with toil, that's an English term which was popularized by Google in their SRE book, which means that you do repetitive tasks and mundane tasks which over a period of time grow fatigue in you and then you stop performing optimally. So let's fix this. Why are we not fixing this? To fix this, we have a thing, but we have a problem. What monitoring tool are you using right now? Except Prometheus, tell me the old monitoring tool names. Like we are using Xabix. Xabix, yes. Xabix plus one, anyone? Xabix, yeah, monolith, Nagios, yeah, Nagios, yeah, all of them, fluently, yeah. So the problem with all of these, so I see people starting to gossip and rant about, hey, they've had these problems with this monitoring. But the problem, as you know, is that it's not very dynamic to use these tools. They were not created for a dynamic world. They were created for a static world where you have 10 machines. 10 machines, you just go there, run a demon set and read proc and look at the CPU, look at the memory. But in the cloud native world, we do not want that. We can kind of take that for granted if you're using a cloud provider that if you're disfail, the cloud provider will take care of it. You do still want to monitor that, but you do not want to be on that level. You more of want to focus on the business logic and your application level. And what do we do with that? Since the microservices are so dynamic and your releases are happening probably a thousand times per day, the old traditional monitoring systems have a difficulty to keep up with the application configuration that you have. So what do we do with this? And that brings us to the actual topic of the talk, which is Prometheus. And now I would like to see how many of you are using Prometheus. Yay, I see a few people which is really nice that they raise their hand both for Xabax and Prometheus. So yeah, that's what we are trying to do as well and how we did it is the story. Prometheus is an open source monitoring tool which was created at SoundCloud. For their monitoring needs, obviously the huge microservices diagram that you saw earlier. They had a similar story and they wanted to fix it. They found that the traditional monitoring tooling was not working for them. They needed something special. They created Prometheus and it has grown quite to a large community now and it is one of the native things you can use with Kubernetes. This is not a Prometheus 101 but let's go over some of the magic that you can do. If you are not familiar with Prometheus concepts or Kubernetes concepts, don't worry. I'll try to explain it in a very brief way. Again, going back to what the traditional monitoring problems look like, they were not designed for dynamic workloads. They had big fat demons, Githops, all the best. So if you have ever tried putting a Xabax configuration into Git and trying to apply that via Ansible and reloading that, also doing your Xabax upgrades via Git and Ansible, it gets pretty complex very easily and it's a stateful application. So you cannot just restart the Xabax server. You need to make sure that you have that replica for the MySQL and I don't want to get into the technical details but traditional monitoring systems were not meant to be restarted and hot-swapped all the time. Which is the case with Prometheus? It can be hot-reloaded. It has the config as a code in a config file and I have 10 minutes so I'll rush through. It has the config as code and it is built for the cloud native. You can restart it as many times as you want. It will come back up. It will work like it was before. It has service discovery, integrations with Kubernetes, you can read the rest. One of the really nice points about Prometheus is like it has federation and easy sharding. So if you have multiple clusters, you can federate your Prometheus and that's the correct word by the way. You call it Prometheus, not Prometheus-S. You can do sharding and federation across all of your clusters. You can have a central data store like I've heard that Marcel talked about in the AIOP stock. You can have a huge data store centrally and you can store all of your metrics there which is really neat, which you cannot do with Xabix by the way because Xabix or most of the traditional monitoring systems I'm not trying to target Xabix but you cannot do it. The binary is just 35 MB. You can run it inside a container on your laptop and it works on a pull model which is really neat for you because what you can now do is if you have metrics endpoints, you can run the binary on your laptop and do your development and testing from your laptop without having to worry about going to the server and scraping metrics from there. Let's focus. I have only 10 minutes. Let's focus on CI and CD now. How to do GitOps in CI, CD for Prometheus? It's very easy. Prometheus has a concept of exporters. Exporters means a program that exposes your metrics somewhere, okay? That's all you need to know for the stock. For exporters that live on VMs or bare metals, like we would traditionally run a daemon, instead of that, now you can use the node exporter and you can also write the custom exporters if you do not, or if you need some more information apart from what the node exporter provides. For that, use Ansible. That's our favorite rule to deploy node exporters. It's very easy. You just put the node exporter config in Ansible file, Ansible playbook or role, and then just deploy it to your VMs. It's pretty neat. For exporters, then run on Kubernetes, use Kubernetes. I'm not even kidding. They are so lightweight, you can just run it on a pod. And beyond that, if you are trying to collect application metrics, this is where it gets interesting. If you're trying to collect application metrics, use the Prometheus operator. Any of you already using the Prometheus operator? Yes, no? Just one. If you're using Prometheus and you are not using the Prometheus operator, you are missing out on a lot of fun. So you should use the Prometheus operator, and here's why. Operator is a control loop that takes care of all the magic. It takes care of creating the Prometheus instance. It takes care of other things, which we'll see. This is how the Prometheus operator looks like. Here you see the Prometheus, the actual Prometheus that you have been running without a Prometheus operator. Here's the Prometheus server. Here's the operator which will, if you create a Prometheus CR, which we'll see soon, and it's your service monitor. These are your pods. These are your services. Services in Kubernetes, if you don't know, is just imagine it like you're code running somewhere. This is your code running somewhere. It has a metrics endpoint. You have a service monitor and a Prometheus. Now see how this works. This is a Prometheus CR. It's a Kubernetes manifest. It's a manifest of type custom resource of group monitoring, corvus, V1, alpha one. You don't need to look at it. What you need to look at is a version and a service monitor. So here what I'm saying is I have created a Prometheus called Prometheus frontend with a version.1.3.0. And I have created, defined a service monitor with a selector called tier frontend. This is a key value pair. So whatever I put in the selector that is tier frontend, this Prometheus will watch the service monitor or include the service monitor in the configuration. Here I have a service monitor which has a label called tier frontend and it has a selector called tier frontend with watching the endpoints called web. So going back to this diagram, if you see, let's say this is the web service and you have a port called AT somewhere here and you have a service monitor that is watching the web port. The magic that happens here after you have defined and applied both of these is that the Prometheus operator will create the Prometheus, load it with the service monitor config and start watching the service. Does that click something about GitOps and config management? If it does not, wait for it. You can put all of this into Git because they are just Kubernetes manifest. They are nothing but Kubernetes manifest and you can just put them along with their deployment. So if you see here, we have, this is a cluster monitoring operator that goes into OpenShift and here we have the deployment manifest which contains the pod spec and other things. We have the service which goes on top of the deployment to expose this pod and this is the service monitor which watches the deployment through the service and goes to the pods. The way Prometheus works is it works on a pod model as in it scrapes every pod so you don't have to worry. All of the configuration and all of the configuration swapping which means generating your configuration, putting the configuration into Prometheus and reloading the Prometheus server after that is all managed by the Prometheus operator which also means if you change this, the service monitor or the deployment or the service, all of that can be integrated into your GitOps flow or into your CI CD flow as it is without having to do any of the other changes without having to worry about Prometheus configurations without having to worry about all sorts of other things. The pipelines part. Now we have the service monitor spec, we have the Prometheus spec, we have the code, we have the deployment manifest, how does all of this work together? All the conflicts are stored in Git, Git, a commit to the master for Git, kicks off Jenkins, which most of you working in the DevOps world would be familiar with. Jenkins runs a job, the job can run the tests for the Prometheus conflicts as well just like you do your tests for the code. You should also be writing tests to test the cardinality of your Prometheus metrics which is a whole different topic and can have a one hour session. Use your favorite CD tool to deploy it to Qube, simple, very simple, but what is the CD tool that you guys use? We have almost reached the point where we can show this meme and call it success, but we have a problem that we had to solve. This is a slide that I just happened to like so that you don't get bored of me just me talking just look at this slide, you need to energize your enterprise and get yourself energized. The other questions that are answered before I move on to the pipeline is that alerting long-term storage visualization and how do you actually instrument their apps to work with this model? If you still want to have a chat about it just ping me offline, we don't have much time here as I expected. Our way of doing the CD, the actual CD is that we have a tool called SAS Herder and we also have a project called the App Interface which is a GraphQL wrapper over your application configuration. So what you can do is you have a GraphQL endpoint where you can query all of your monitoring configuration and SAS Herder is the CD part where you have deployment manifest and a tracking repo. SAS Herder will take all of your tracking repos, pick up the Kubernetes manifest and apply to your clusters. You're welcome to check that out on GitHub and the App Interface is a GraphQL wrapper over all of this. You can define JSON schemas and GraphQL schemas and then you can query the monitoring configuration like, hey, I have this application. What is the monitoring configuration for this application? And you can do all sorts of magic with it. You can write integrations around it. This is something that we are targeting to develop right now. If the talk gets accepted, you will hear about this in the KubeCon Barcelona next time. And the key benefits of this approach are reduced oil. Developers are very close to owning the metrics and now you can entirely configuration, entirely configure your monitoring stack via code. And since it's living in the same GitHub repository as your code itself, the developers have the power to choose what the metrics go in and the developers can now be on call. It enables them as opposed to having just a black box and an operations guy watching the black box, hoping that it works, right? SREs can now look at high-level tooling like template generations. Your developers, if you were at KubeCon, the point that Kelsey was trying to make there is that the developers should be focusing on the business logic, not on writing Kubernetes manifest. So in this approach, you just write a small service monitor YAML spec or probably your SRE team generates it for your applications or you put it into the pod preset, you don't have to worry about it. You can now standardize on a metrics endpoint called healthy and have all of your metrics there and which makes life easier for all the SREs and contributing back to the community. Everyone is using Prometheus. So just contribute your tooling back. We'll all be happy. Where do we go from here? This was a short talk. We are almost out of time. Where do we go from here is, please, please, please read this awesome blog by the folks at WeWorks. I have linked it into the slide and I'll also link it in my shed.com. Please, this is about GitOps and observability and how you can do it. This was published after I submitted the talk. This is way more comprehensive and this could lead you to starting with GitOps for monitoring. And there was a Kelsey Hightower talk. As always, he was talking about monitoring and Kubernetes from 2016. And it's surprising that we still find ourselves where we are in a place where we are not using the full power of Prometheus and we are still sticking to the traditional monitoring tools. I know it's hard, but you should all try to transition and let us know how that works. On GitHub, go to Prometheus, Prometheus operator, SAS header and contract server. These are the repos you should check out and ping me if you have a problem with any of these. I work on all of these and trying to maintain them. And let's chat. I'm almost done. So this is my LinkedIn. This is my Twitter. This is the time where you take out your phones and click it and you talk to me. And if you have any questions, let's take the questions first and then we can close off. Any questions? Okay. You mentioned that you are trying to make Prometheus work with Xabix. Okay, okay. So the question was, he asked whether we have some plans to integrate Prometheus with Xabix. We don't. We are transitioning to Prometheus because that's the new way and the right way to monitor the cloud native applications we feel. We currently have an existing Xabix stack that I maintain, but it's getting very hard to maintain it if you have a lot of services, like 500, 600 services. So we are entirely transitioning to Prometheus model and we have a model mapped out. So that's it. Xabix 4.0 is awesome. I have heard. So if you're still on Xabix, please upgrade to 4.0, new features. Out of frame? Okay. We can take one more question. One more question and we can chat offline. If you don't mind. Go. Do you use the cluster monitoring operator to monitor just the cluster or also applications within the cluster? Okay. So the question is, do we use the cluster monitoring operator to monitor the cluster only or also the monitoring, also the applications? So the cluster monitoring operators ships by default with OpenShift is going to ship by default with OpenShift and it monitors only the cluster and it has a set of predefined rules around health of your cluster. If you want to monitor your applications, there is the Prometheus operator which you can add to your cluster and you can start your own Prometheus with that. The cluster monitoring operator is a fixed set of rules and monitoring config for the cluster. So anyway, I think we are out of time. So thank you and see you outside. Maybe. Have a day.