 Hello, hello everyone. Welcome to PromCon 2021. I hope you have a great time at this conference. We have CubeCon 2 and yeah, we are excited to be here and show you some amazing scenarios related to Prometheus and the ecosystem. Hello everybody, really excited to be here. My name is Anesh Urales. I'm a site reliability engineer at Sivo. Sivo is a cloud computing company based on Kubernetes. Before that, I was working at CodeFresh, which is a DevOps automation platform. I'm also a CNCF ambassador and I run a YouTube channel around a challenge called 100 days of Kubernetes. If any of this got you curious, you can check out my Twitter at UeliSanees, also linked there. Yeah, my name is Bartek Plotka and I'm principal software engineer at ThreadCat. I'm part of the team which is responsible for the observability and monitoring for ThreadCat, including OpenShift as well. Yeah, we are maintaining lots of projects around that part of the infrastructure. I'm maintaining Prometheus. I'm co-author of the TANOS project and I'm maintaining other goal line repositories all around the GitHub. I don't have YouTube channel. I sometimes blog posting. I'm writing my book with O'Reilly and I'm a CNCF seek observability tech lead as well. Yeah, welcome. Awesome. What do we have prepared for you for the next 20 minutes or so? We have a little ping-pong application of which we're going to show you different scenarios. In addition to that, we have a client that's going to request information from that application. Those are both microservice applications. In addition to that, we are going to use Prometheus and Agarola. Agarola is going to be used to automate our deployments. Prometheus is going to be used basically to help Agarola to know what's happening within the rollouts to, in case something happens, to revert our rollouts as well. So let's jump right into our demo and show you what we have set up. As you can see here, we are in Catacoda. Catacoda is an online platform that allows you to spin up demo environments such as this one. You can also try out the scenario at the end. We're going to give you all of the information so you can check out the repository and the code and also the scenario. Now we have our Kubernetes cluster set up here with nothing on it right now and we're going to just go ahead and create a namespace. Now to start with, every infrastructure has to have some monitoring. So we start installing our resources from the observability components. So first we start on the right, you can see Prometheus components. So we install Prometheus one binary and it's supposed to gather metrics and exemplars. I will tell later what exemplar means and why we need that. We also can do alerting on this, on this note. So it's kind of super, super useful. And Prometheus will gather those metrics and exemplars from any components, including other components you can see on those diagram. So we have also Grafana, which is, will be our single point of access. So it will be our UI, our frontend. And we can, you know, we statically configure some dashboards that will allow us to see the progress on our rollout and, you know, the health of our system and how it impacts or not our customers. And we have our secret agent here, Tempo, on the left. And it's, it's like a cool project, really with similar architecture than the other systems we see like Thanos, Cortex or, or Loki and Prometheus with labeling mechanics similar to that. And it's about, it's, the, its responsibility is to gather traces, which we will use for the later part of this demo. Now we went ahead and we applied those resources. Now those resources will take a few seconds to spin up completely. In the meantime, we're going to take a look at agro-rollouts and going to apply our agro-rollouts resources. Now what is agro-rollouts exactly? It's an operator and custom resource definitions. And both are used to then deploy other resources for our rollouts to make our rollouts happen and basically use advanced deployment strategies, such as progressive delivery, blue-green deployments, as well as cannery deployments. Now in our case, we're going to focus on cannery deployments. So let's take a look at the agro resources at the YAML files that we're actually deploying and using agro-rollouts with. At the beginning, we have our rollout YAML file. Now the rollout YAML file is similar to a deployment resource that you might be familiar with if you're at this conference. And it basically adds additional information to our deployment resource. So we can see over here that we also specified the container that we want to use and the replica set of our deployment. So in this case, we're going to use a replica set of five. If you're not using a service mesh, you will have to use a minimum replica set of five. So for agro-rollouts to be able to distribute the traffic and move traffic between your deployment resources. So that is specified in the step section. We're going to start off with a new deployment, with a traffic rate of 20%. So that means whenever we roll out a new deployment of our application, it's going to first just spin up one part and scale down one of the existing five parts of our initial deployment. And then with an interval of first 60 seconds, and then 30 seconds, it's going to gradually roll out the new deployment with this resource based on this resource. Another interesting section is our analysis. Now this refers to an analysis template that is used to check whether or not our rollout is actually successful. If we don't have the section, agro-rollouts will just gradually shift the traffic and rollout like it is. And as long as the pods are running correctly within our cluster, we'll just go ahead and roll it out. Now that's obviously not what we want because that would require us to manually check whether or not our rollout is okay. Instead, we want to utilize the analysis template and Prometheus to automate this process. Right. So on a rollout section above, we specify certain template that we used. And our template is called low error load latency template. And we specify here the exact data, what it means for rollout to be successful or not. And we kind of base those criteria right now in our demo based on the two metrics we have. One metric is lower than 20p error rate. And second thing is latency 90p lower than one second. What that means, well, essentially, first of all, our error rate metric is supposed to give us the overview of the error rate of pings that are happening in the overall system. And we expect the error rate to be lower than 20%. So essentially, success 80% of the request overall in our system has to be successful. Now, we do that much very often. Like we do this check every 20 seconds. Normally, you probably would like to do this every minute or every five minutes. But we wanted to kind of speed up this demo and we have a very aggressive scrape interval on Prometheus five seconds. Normally, you should have probably 15. Now, this is only one metric. Second metric called latency 90p lower than one second. We are observing the tail latency of our application again for all the customers. And we expect the tail latency to be lower than one second. So most of the users, 90% of the users should have that latency. If not, then we claim this operation to be failed and rollout to be reverted. And we can see also the failure limit. This is kind of important, you know, additional info that you can provide, which tells our rollout to wait for free failures for this metric. And only free failures actually trigger, you know, the failure of the whole deployment. This is to avoid some spikes or some flakinesses as well. And now you can see our diagram expanded to those new two new resources we are just rolling out. We have our rollout, which is our rollout, you know, operator actually controller in details. And it is, you know, managing those applications and distributing traffic and as I said, we don't need to have service match for that. We don't need to have a special ingress, although those are supported as well. Right now, we are just using plain rollout controller, which allows us to distribute the load. And you can see our rollout is also checking permit use from time to time to check latency and error rate. Now that we've deployed those resources, we can check our rollout client whether or not our initial rollout has happened successfully. So in this case, we can see that five parts are currently up and running of our initial deployment of our initial image, and it's called stable. And we can see that any further rollouts will use the Canary deployment strategy. Now once all of our resources are up and running, we can go ahead and open our Grafana dashboard that provides our universal UI and shows us exactly what's happening within our application and the request to our application. So in this case, we have several different sections, three to be precise and different panels within each section. Now let's reduce the time for the past five minutes and have a look at the first section, our user experience. What does it actually tell us? So this checks our ping application to make sure what is the percentage of our successful pings from that application. And in this case, we are right now at 95%. Now additionally, we can see in the next panel, the overall latency per pings from our ping application. So we can see that this is the 50% like how many users are seeing that latency below 0.4 seconds, and how many are seeing about around 0.5 seconds latency. So basically, how long does it take for a ping to be received by the ping from our ping to be received by the ping from the ping application? Additionally, we can see here the overall success and error rate of our application. So how many errors do our customers actually receive when they open our application? What is the percentage here? Yes. So as I mentioned, the very first error is for the client experience. So we take those metrics directly from the client to have the closest experience they have. But now there are useful information we can get from our infrastructure as well. So the below metrics are from our servers. So first of all, we have certain metrics from the server app directly. The same error rate to make sure we see the similar error success rates between the client and the server, we see the latency as well. And the difference is that we now visualize that in a second type of visualization you can do for latency, histogram metric that Prometheus allows. So you can do percentiles, but you can actually show this with the hit map. And the hit map is kind of very natural and easy to read from. Essentially, the brighter the color, the more amount of users are experiencing the latency within certain bucket. So we can see most of the users are experiencing latency between 300 milliseconds and 600 milliseconds. So it's a kind of good experience overall. There are some requests you can see, which are even faster, but a small amount of those. And then on the right, it's kind of the similar thing we can see on the panel above. So it's an error rate. Just we hide the success rate because from the SRE perspective, Satellability Engineer or DevOps, you really want to focus on what is important for you. And right now, we want to care how many errors each version kind of exposes. So we have some separations on the version, which is in our case like a Docker image tag, really. And then last but not the least, we have a rollout state. So we do take various metrics from our rollout controller directly. First of all, we see what are the rollouts in progress. Then we can see the rollouts per our version. So there will be a number of replicas essentially, sorry, some rollouts, but a number of replicas per each version where we have been rolled and deployed. And then on the right, last panel is really about the, you know, our rollout analysis runs, if they were successful or not based on our Argo analysis template, which you remember was measuring the latency and the error rate. And, you know, what are those checks observations are over our time. So this will come handy. Awesome. So now that we have a general understanding of what our Grafana dashboard will tell us over time from our deployments, we can go ahead and we can trigger a new deployment through Argo rollouts. So we're just going to deploy a new version and we named in this case errors, really fancy. Now, first off, we can already see within our Argo rollout section that one part is sped up and our initial deployment is down to four parts. Then we can also see the analysis that is being performed. And basically whether or not it's hitting an error rate can see here in our errors dashboard per version that right now here, our new deployment is actually is already showing some arrows. But at this point, it was really just now up to just now it was around the same as with our initial deployment. So there was nothing worrying yet. But then we can see that our overall success rate that we want to have above 80% is slowly going down. Now it's below 90. So let's see how this is going to develop. Now, in terms of latency, we can actually see that the latency of both deployments are fairly similar. Nothing has really changed there. Now we can see that actually something happened. Argo rollouts realize that something is something is going wrong, something is off and has terminated our rollout. So we can see it just basically stopped and right at instead of slowly scaling down our new deployment, it just cut it off and scaled back up our initial deployment to five replicas. Now we can have a look here of why this has happened. We can see that our lower than 20% error rate has actually cost this has failed and has cost this rollback to be this rollout to be terminated. So the good thing is that we are now back in business, right? Everything recovered and we were able to mitigate the problematic rollout without our interaction of the system, right? Our hands are here. We didn't touch anything. It was automatically done and recovered. So it's pretty awesome. Awesome. So now that we know that something is really off of our error rate, we want to fix that, right? We want to deploy a new version, which does not cause as many errors for our users and overall that the error rate here is staying low for everybody from our client side, right? So we're going to go ahead and we're going to deploy a new version. In this case, we're going to call our update to a fix slow. And we're going to deploy this image and going to observe our rollouts again. Now it will show us our previous revisions. So we can see here revision two, which was our error deployment has actually failed. We can see that that fail of our of our the checks, our errors, how many errors we have has caused the immediate termination of this rollout. We can see additionally that our new deployment is spanned up. There's one port running, and it has already passed two checks from our analysis template. Now let's go back to our Grafana dashboard and see how everything is developing. Overall, here's our error rate that's slowly showing on in the dashboard in the panel. But what's happening over here? So it's like a tail latency. So, you know, most of the users are not impacted at all. It's like, you know, 50 percentiles stay stable. But, you know, the tail latency, so the unlucky users are really unlucky now, because it's not like only a couple of milliseconds, 100 milliseconds more is suddenly two seconds or more to receive to retrieve the punk when you are pinging, right? So something is really wrong with this deployment. Hopefully our rollout will not be that. At the same time, we don't see many like much change within our error section. It's about the same for both of our deployments. And now we can see right away, like that agri-rollouts actually was not happy with that deployment, and it didn't pass our checks. So it's scaled down right after two pots being spun up from our new deployment. It reverted the rollout, and it's been back our initial deployment. We can see now that our rollout reverted to our initial deployment, the latency is back to what it has been before, what it has been prior. So we can relax again. Everything is back to normal again, but we still weren't able to deploy our update. Yeah, and we are super frustrated. Like what's going on with our coding? I don't know. We are bad programmers or what? Like how all those rollouts are failing? And to know more, I would love to know what exactly wrong with this application. And with the errors, it's usually easier to find the root cause of the error, like on the previous deployment. But now we have kind of tricky situation. We have a low latency normally for most of the users, but some users are experiencing enormous wait time for the request. And imagine that we cannot really locally test this out and find the root cause. We cannot reproduce this locally. So what we do? This is where exemplars in Prometheus exemplars and actually the whole story is that it's open metrics exemplars because Prometheus is using open metrics data model, which designs the exemplars and the metrics of different types. And we kind of scrape those information directly from the application from both pink and pinger and the application. Now this exemplar allows us to specify additional information that happened during this, for example, latency observation. So with the latency observation for every request, we can put something else. What we put? We essentially put in our case the trace ID because I didn't mention that before, our applications are instrumented with tracing as well. So every ping is actually traceable. And the trace is essentially, you know, a certain information bound to the request flow across different multiple, you know, microservices. So now we have the link between the trace and between the metrics and the trace in the form of the exemplar. So we can see a different metadata accompanied by the observation. So we can see that this direct unlucky user has the, you know, the request where we're kind of above two seconds. But we would like to know the root cause. So let's go and kind of go to the tempo, which is our tracing solution to see what's going on. So now we have a tracing UI visible, and we can see the trace. So essentially the flow of the request between different components and the code itself, right? So we can see the pinger is the green line. It is, you know, our client that seen, what, two, two dot zero one seconds of latency, and it experienced this latency based on some other things that happened. So let's go to the child process, which is essentially, it hits, you can see now we have a different service. It's called app. It's our service, which is responding to our request. And we can see it also reported that long duration, which is two seconds. So something within the server maybe introduced this. So let's go further. Our pink handle also see two seconds. So it's like a span, different function that happens in the code itself, different trace. And now when we go below, we can see that all of those are reporting two seconds, including the adding latency based on probability. That's a suspicious name, by the way. And then, but maybe that's not the thing that actually introduced the latency. Let's go and see other requests in this trace. We can see that the last request, write status, which is essentially just responding with some kind of bytes to the user, actually happened pretty fast, almost immediate. It was only like zero dot zero to a millisecond. So we know now, based on this view that adding latency based on probability function was the one which introduced the latency. And we know about that because, well, we implemented this. So when you look on the tags, we have additional information that we could provide. So what we did, we essentially simulated some latency based on probability. And we can see the unlucky one was this, and it introduced exactly two seconds. So we literally sleep for two seconds. And we kind of, we choose some random number. In this case, it was like 33. And it was within some probability that introduced this latency. So now we know where to go in the code, where exactly to the code we need to go and fix it. So that's pretty powerful technique of finding the root cause of the problem. And just maybe to show you exactly how we did that, we can jump super quickly to the code. It's a Golang application, but you really do the same thing in other languages. This is what is super powerful with instrumentations. And I use like Prometheus, the most popular Prometheus client Golang. And then I use open telemetry go instrumentation to combine those two together. So what we have seen with the KFC right now is a client instrumentation kind of abstraction where we instrument HTTP client with additional context. So when the request is sent, you can see that we are before, actually after the round trip, we are checking if we have on line 95, if we check, if we have any trace combined within this request, if we have a trace, we need to check if it doesn't have any trace ID. And also we check is the sampled because it might be that the trace is kind of omitted because we have some kind of probability sampling or whatever. In our case, everything is sampled, but we check anyway. And then only then we put a special trace ID into the exemplar. And then this is kind of observing the kind of duration of the request in the histogram of Prometheus with the trace ID. And then if we don't have this trace ID, then we just observe without trace ID. So this is the whole code that has to be created, like not like very complex thing to have exemplar being provided in the metric. And then it is automatic chosen by Gregor Fana, if you go to back to Gregor Fana, it is automatically discovered that this metric has some traces, has some sorry, has some exemplars. So now you need to go, I guess to the back to our dashboard. And you can see that once we kind of published the metric, once we set the metric, you can actually click on the client request latency per second edit. You can see that once I put those metrics, there was this kind of exemplar icon and it automatically detects that for those metrics, there are exemplars provided via Prometheus API, which is like just now in the new version 2.26 available under feature flag. It is seeing exemplars and then visualizing that in very interactive thing that we can click on. So it's a very powerful technique that I really recommend for everyone to use. It's kind of amazing. It allows us a super nice correlation between those signals. Awesome. So now that we know what's actually going wrong with our application, we can deploy another update and make sure that we can be actually fixed. So in this case, we're going to deploy our update called best. Now it's currently spinning up our new deployment of our best application and scaling on the other. Let's see how this translates into our dashboard and whether or not it's actually improving things. We can already see that our latency is actually going down. Let's see if it's also staying down. So while we wait for this rollout to succeed or not, maybe we can go to our repository and show you how you can leverage this knowledge and moreover, you can use the Catacoda URL you can see on the right. Just go there, click there and then you can experience this demo on your own. Awesome. So let's jump back into our dashboard and see what's going on here. Yeah, there are no additional failures yet. There are one progressing still. So we are not yet finished with these deployments, but we are super close. And we can see that our latency is further decreasing also for the high end of you, like for the rest of users who were quite unlucky before in terms of latency. And our overall error rate is staying low, as low as our previous initial deployment. And now we can see our initial deployment has actually completely scaled down and our new deployment is up by our entire replica set by five. So it seems that our entire rollout has actually passed and is no up and running, our new deployment is not running and has passed our checks. And we can see that there's no new errors from our analysis template being introduced. This red block is just basically showing our previous errors from previous rollouts, but our new analysis has passed successfully. So hope you like it and please check out our demo on your own and play with our resources as well and give us feedback on the GitHub repo, ask questions and yeah, feel free to reach us on Twitter or anywhere literally. And hopefully we have some time for questions. Awesome. Thank you everybody for listening.