 I'm David and I'm here to talk about monitoring the monitor, or if a primitius falls, does it make a sound? In primitius to monitor your service, but what monitors your primitius? So probably the first query you come across when you've known primitius is something like this. Is primitius up or actually is it down? So this is an alert that if the metric up with a job label matching primitius is zero then the query will return results. So you can use this in an alert with something like job down with an expression of up job primitius is zero and then it will raise an alert with the annotation primitius down for example. So that's quite simple but that's not enough to actually monitor primitius itself because well it's not a cartoon it's not that simple and you can't monitor yourself with yourself. So going back to basics again the architecture of a normal primitius setup is something like this. We have primitius talking to an alert manager sending alerts to some kind of alert receiver. So we only have one of each of these. Now the receiver is maybe something like page duty where someone else takes responsibility for actually making that reliable for you but primitius and alert manager unless you're using a managed service are probably your responsibility. So a common setup is to run multiple often. So a pair of primitius instances monitoring the same target and then also alert manager in a cluster mode of some kind and these ideally running on different machines. So there's now some level of resiliency there which is good. But what happens if the receiver is down or unreachable? Well alert manager tries to raise an alert but it can't go anywhere. So for example primitius raises an alert like this saying jobs down and then can't go to the receiver. So a common approach in the past was to have some kind of backup device connected to your server directly which meant you could use the internet and also SMS for example. Obviously it's a bit difficult to connect a phone to a server in the cloud. So how people often do all this is rather than alerting when something is down have a particular alert that exists as a heartbeat that is always expected which then is always sent to the receiver and the receiver somewhere on the internet knows that it should expect to receive an alert and if it doesn't then it raises an alert. So it inverts alerting essentially saying if there isn't an alert then start raising an alert. So there are many ways of doing this. Healthchecks.io provide a service that does this which is written in Python. You can run it yourself or there's a cloud hosted version of it. Devman snitch integrates with pager duty and is cloud hosted. Karma which is a web-based UI for alert manager can also display an alert when a particular alert isn't present. So that obviously doesn't page anyone but can show on a screen or something that there's a problem which if you have a knock or something could potentially be useful alternatively to do something entirely custom. So let's look at how we actually set up alert manager to talk to our heartbeat receiver. In the alert manager config we have a route that matches a label of severity heartbeat and then sends that to a particular heartbeat receiver and you'll see in this example the URL has an ID in it which would be team specific or specific to each Prometheus instance that is monitored by the receiver at the other end which unfortunately then means that this alert manager file needs to have every ID for everything that is monitored in it. Also that's not too difficult it can be templated or various other approaches but it still means that this is yet another thing to configure and the configuration needs to be managed and so on. It's yet another moving part essentially. So instead with PromMSD we have the same alert that we have before but in this alert you'll see that there are some annotations that have MSD at the start of them which essentially tell PromMSD how it should behave. The activation is the activation time, some blame was to override and then the alert managers to send the alert to which is unfortunately the one thing that PromMSD compromises on. It can't support dynamic alert manager discovery because the alert managers have to be actually specified in the alert itself although potentially we could fix that with some changes elsewhere in the future but this does mean that all the configuration for a team's alert is actually contained in the alert itself and nothing special is needed for heartbeat alerts. Obviously they probably would have team specific routing in the central alert manager but they don't need separate configuration for heartbeats which you know might get forgotten or so on because it's not used all the time and so on. So what then happens is this in this case raises two alerts for each of our high availability pair and those go to PromMSD. So let's actually see how this works. So over here I have some of the example conflicts that come with PromMSD so it's just a conflict directory and I'm running four terminals here. First of all I'm just running a netcat listening on a random port. This is going to be the normal alert receiver so we'll just see the HTTP requests sent to that. So inside PromMSD this conflicts directory has an alert manager conflict and alerts and a Prometheus conflict so what I'm going to do is I'm just going to run alert manager using that provided config. I'm also just going to run Prometheus and Prometheus will then be running. So you also haven't yet started PromMSD so I also just need to do that. So we now have Prometheus, alert manager and PromMSD all running and it's just first of all go to Prometheus here and if we look at the alerts UI we now see this expected alert heartbeat is active and we can see as I discussed all the activation things. You'll see in this case though that I've put the activation at one minute. They also notice this alert for now is not actually active because there's a full threshold of 30 seconds just to make sure that the previous instance isn't flapping. So this alert is still pending hopefully I've spoken for long enough I have and that alert is now firing. So that now means that we have an expected alert heartbeat that is firing. So what's happening to that? Well that is going to alert manager which conveniently I have running here and we now see there's an alert for expected alert heartbeat over here that we have the relevant annotations on and if we check where that's going that's going to PromMSD and actually our page has no alert going to it. Okay so then if we go over to PromMSD over here we'll see we just have a Prometheus and in this case it's not running Kubernetes so there's no manifest or anything it's just Prometheus which obviously in a real setup you would have a few more labels there but this for a demo this works. So you'll see that that is saying it will activate in a few seconds and I've actually got this set I think to repeat every five seconds so if I just see it reloading this page you'll see it actually never gets below about 55 seconds. So now let's just go to where we were running Prometheus and I'll just kill it. Okay so that was Prometheus yes so I've now stopped Prometheus so actually that's interesting because I see this alert is now still active and that's because alert manager over here still knows about this alert for now. I've actually set the in the Prometheus config the evaluation interval to 15 seconds so if I carry on talking for about four times 15 seconds we should eventually find that that alert stops being sent. Luckily this demo isn't live so if this fails I'll just edit it. Okay so it's now about to activate I'm just hitting refresh here so you can see what's happening. There we go so it's now gone red and says it's sent an alert so if we go back to this alert manager and have a look yeah the heartbeats disappeared and we now have a no alert connectivity alert and in theory if we go here we should actually get an alert delivered to us. So we've now been told we have no alert connectivity so that's how Prometheus works. So obviously that's a very simple setup and in reality you'd have a few more components involved so I thought architecture of deploying it might look something like this you have three teams running Prometheus instances for applications which talk to an alert manager cluster the alert manager cluster routes to Prometheus D as well as things in the cloud for other alerting as well as an infrastructure of Prometheus for example that rather than using the Prometheus D running locally in the cluster uses something in the cloud which could also be another instance of Prometheus D running elsewhere or it could be one of the mentioned cloud monitoring services and you'll also notice if you follow the red line that if Prometheus D here detects that there's a problem it sends it to alert manager but it also sends it to a where put receiver which goes straight to something elsewhere which means Prometheus D doesn't need to depend on anything other than a where put receiver which could run on the same machine or you know even in the same pod as a as Prometheus D in Kubernetes and yeah as I mentioned the infrastructure Prometheus has a separate monitoring that is potentially in a different cluster or elsewhere so the infrastructure team can be notified if everything is broken application teams can be notified if their Prometheus is broken by an explicit alert but if they actually are running in multiple clusters maybe they don't need to be told about their actual application being down if it's an infrastructure problem because the Probers don't fail and it means that you know you don't get a critical alert everything is broken when actually it's not all broken so there's flexibility in how you how you set this up that means you can make sure that the alerts are actually actionable and so on so privacy is now open source and it's available on our GitHub there so thank you to do research for supporting my work in open sourcing this and thank you for going to this presentation