 All right. So before I start, did someone attend to DevOps days recently? One, two, three, four. So this talk is basically extension of what we had in DevOps days. So I present this in five minutes. It's really fun, right? It's very fun when you see the slides coming on up. But so initially, we wanted to actually make a representation, but I submitted a little bit too late, so they push it us to ignite. So before we start, I would like to do a shout out to Vincent Honisby for hosting this event. Also, the organizers here, Nidap and Sigis are the guilties that made us to come here and spend some time together and share some experience. So this talk is going to be a little bit less technical. We're going to go through a practical case on how we actually migrate from Nagios to Prometheus. So who did use Nagios before? Okay. How many of you like Nagios? Before. Do you like Nagios? Before. Okay. I'm going to throw you away. Okay. All right. So let's get started. I see that you guys have some experience and you probably have experienced some pains, so you can share my empathy here. So let's start. So hi, I'm Antonio. I'm an SRE at Cloughler and based here in Singapore. I've been working with Nidap and Sigis and Jason and Monika in our journey to get rid of Nagios and improve our monitoring. So it's going to go with this very fast, basically a quick introduction. So SREs, we do monitoring and it's our heartbeat, right? So there were some challenges when we scale up. We find some challenges, our infrastructure is massive. So as we grow, as more challenge we got, so we're going to share here how we solve it and the journey. So back in 2009 when Cloughler start, we had like barely 50 servers, few data centers around the world, I think the primary in US, and we were growing very slowly. By that time in 2009, like 100 servers was like, wow, you really rock, right? You have 100 metals, you have 100 servers to manage. So your SSH loop should be like quite long. So yeah, but that time we start using Nagios. I'm on the monitoring, using monitoring while I'm on the AT industry since like 12 years. So I start doing SSH loops, then I start doing Nagios and then Savics and now Prometheus. For me, Prometheus is kind of new. Since I started with Cloughler, I had no idea what Prometheus was. I just barely heard it. So I learned a lot, but I kind of stick with all philosophy to use Nagios and Black Box Monitoring and so on. So this was actually the standard by Cloughler by that time, and it were great but the smaller scale, because it also had big community, also had plenty of documentation, plenty of plugins, so it wasn't great. But as we grow as the problems to come, right? So basically we had problems like we were missing monitoring points with Nagios. We had frequent crashes because Nagios could not handle that load. Keep in mind that Nagios, but at that point it was having like 1,000 checks, 1,000 checks per server, right? Single point of failure, Nagios doesn't really had a good HA setup. So yeah, this is obviously an issue, and also it's very hard to make changes, right? It's very hard to make changes in Nagios. Everything is kind of centralized and the config file, is big and it has a lot of beautiful views that we will see. So that was how our initial setup was, more or less we should have one central Nagios servers and then we keep adding POPs. Every POP of presence, not all point of presence were same. Some of them are very small, very tiny and some of them are huge, right? We have some POPs in North America that we can have like plus 300 servers or 400 servers there, and we have some in Asia or in Africa, they can have like three, right? So obviously this is a big difference, right? So this is how actually our infrastructure looks like now. It's a spot, this blue spot is one point of presence we have. So I've been through this before, so I don't want to spend much time in this, but this actually our scale and our Nagios server were somewhere there, somewhere, right? It's a secret location, right? So what's somewhere there? So imagine, right? All sending packets to Nagios from several locations, but the most and the most lovely area was this one. That was the most lovely area, yeah? Yeah? Who have been in China before, who experienced the Chinese network know what I'm talking about. So yeah. So talking about problems again, so we had issues where the high number of connections, we lose data real time, the configuration file of Nagios is not optimal. Every time that we make a change, we have to touch the configuration file, and very often we break when we change the configuration file for adding an alert, for adding a notification group or any other parameter of Nagios is very easy to break, right? And then you are basically out of monitoring for the time that you have fixed it, right? So this is actually not ideal. So this is how our configuration file looks. I don't want to look at it too long because I start to have headaches, but this is actually how Nagios a configuration file looks like. So according to these problems, we start to look for solutions, right? So we need something that was like had some sort of HAA, active standby setup that Arseny also mentioned, that can also monitor itself, right? So I can monitor myself and Nagios can monitor itself. So also has like something that has the host service dependencies and it can basically quick come back in case of failure or old age. So we want also to send alerts to different sources. We have to use our chat system, we use our pager system and so on. And something that it was basically easy to customize and as our environment changed very fast, right? So we basically, we want looking there for the flexibility of the solution, right? Also we want something that as we add notes or we add machines or services, basically they can auto-register every time. So it saves a lot of work. So we choose, of course I think you guys, we choose Prometheus. Why? Because it's robust, has a HAA setup, can handle millions and millions of time series. So millions and millions of metrics means millions of millions of alerts we can process. And as been showed before, it can create news alerts very quickly and easy and in a flexible way using from SQL, right? Also troubleshooting Prometheus is very easy, have you, the Prometheus users, have you ever had any issues with Prometheus on the service level, someone? No one, that's why we are here, right? Because Prometheus is great. No, but it's quite, it's just a go program which is actually very easy to install. So you can very easily localize it and run it, right? So actually it's so simple. It has also a big community and fulfill basically our scalability demands, right? As Neeraj said before, we, our infrastructure is growing every single day. But when the time I started was earlier this year, I think we had around 100 POPs. Now we have around 140, something? 130. So 130 POPs. So obviously with this, with this grow, we cannot stick with a non-scalable solution, right? So, Nahios versus Prometheus. So basically Nahios is based on a script. So basically it runs a script and give you one output which is zero, one or two, right? Which is informative, it's something that was wrong. But well, I mean, this doesn't tell much, right? So if you want to basically modify an alert or you wanna modify, make a small modification in a specific check in Nahios, you have to basically start to script. You have to, if you wanna monitor a number of requests, for instance, has been shown as before. And you decide that 1,000 requests per second are way too much. And then you want to change to then to 500, you have to basically change the script. You have to go and check the script in Nahios, right? Which is basically, it's in theory easy, but it's kind of an overhead, right? Because you have to actually deploy the script and push it all the servers you have and so on. So in Prometheus, as you saw before, you just need a metric, right? In Prometheus, it just requires a prone SQL query and then you go to the alert and you go to the metric. So you're good to go. So okay, so let's migrate, right? Okay, so okay, we have Nahios, so we decided Prometheus, because Prometheus is great. So we're gonna migrate. Okay, we actually migrate from Nahios to Prometheus overnight. No, actually, but some of you over the face, you believe it. You were impressed, some of you were impressed. No, no, no, I'm lying. So, well, actually it was not trivial. It cost us a lot of time, right? For a few reasons. First off is like we cannot have a monitoring blackout. We can't, we simply can't. We need to ensure reliability between the two systems. So you have Nahios, which is actually running and then you add a new thing, which is Prometheus, was yes, is great, but you don't wanna basically put all your production environment depending on a solution that you just put on the table, right, which everyone think is great and we think it works, but you have this Nahios working for years and it works. So you basically wanna have both of the systems running to see if actually the solution that you're implementing is basically covering you 100% of your monitoring, right? And yeah, so basically, you don't have a space for downtime, you're gonna have downtime for your monitoring, right? If you have downtime on your monitoring, you are flying blind, right? So compiling Nahios, again, this is how actually Nahios works, right, or used to work at least. So you have the Nahios remote program execution, which basically goes to communicate with the server and so the server will basically execute a script, right? Savics do it in a kind of similar way too. So basically, Savics also rely on a client server architecture. So this is more or less the old school monitoring. So yeah, so you basically, you can imagine that one of these checks are one script and in each server we have more than 1,000, right? So our USR live exit Nahios, when you run a LS, it take a while, right? So yeah, so as I said before, NRP runs, gives you back three values, okay, one in critical and down, I think, I believe, down. So we show something. And so yeah, so again, massive script deployment for each check, the logic is built in inside the script, inside the server. So every time that you deploy a new, or you change a threshold or you do anything, you have to validate the script. As we said before, that simple threshold will basically give out an overhead. And I'm gonna go with this like, I'm gonna go with this fast as being and as I already go through it, but this is actually how Prometheus was like. So it's kind of for its illustrative process to compare it with Nahios. So you have like other manager and like the data center Prometheus is pushing the LS to our manager and the other manager will send the notifications to the proper channel, right? So this is actually the architecture that we are aimed for and we are having implemented. So we basically have in each pop, we have like one Prometheus server and then we have like running on our core infrastructure. We have like redundancy on two Prometheus servers too, right? We basically use federation. So we're basically sending all the metrics are available here. So where we migrate, some initial considerations that we had to take is like make sure that exporters, of course, show the metrics on the endpoints. We have to make sure that the metrics are pulled by the Prometheus server. As I said, we have a hierarchical federation setup and finally we need to alert manager, evaluate metric and push if required to our proper channel. We have to ensure that all the communication goes smoothly to the endpoint, right, on the alert. So how we implement, what are the steps that we made in order to migrate all these alerts that we have in nachos to Prometheus, right? So the implementation was like, we take one to one alert and we just basically bring it all the way to nachos with Prometheus with this path that I'm gonna describe, right? So depending on the metric, we will need to deploy an exporter, right? As we have been, as I have said before, mentioned on Arsene, we have available like 15 exporters. Actually, Vin is writing a new one for backup PC, right? So as he said, it's kind of easy to write a new exporter. But of course we have services that they don't have their own exporter. So we have to use text file exporter that we'll mention later, right? So, okay, so we got the metric. So we got like the mechanism to get the metric. Whatever it is, right? So then we need to put the metrics and make sure that all the HTTP endpoints are showing the metric there, right? So then we need to make sure that the all metrics are being aggregated. So we see like the metrics basically from everywhere. I mean, not from everywhere, from every colo, but in a not on horizontal way, but on vertical way, right? And then you have to define the alert rules. So once you have the metric, you have to actually define when to alert, right? Sometimes it's not that trivial, right? So when you want to alert, you have to actually, it actually requires a lot of thinking, right? So we need to make sure that the logic match our current monitoring points. So you have to basically make sure that the logic you have implemented in the script is gonna match 100% with the logic you implement in Prometheus. Once we got the alert and the metric is ready, so we deploy it, right? We deploy it and we push it to a channel of a test notifications, right? So then we verify that we make sure that alert is fine properly and it has escalated to the proper channels too, obviously, right? And then when we test that everything looks fine and this alert is actually showing at the same time that the Nagios alert, right? Then we can say that we are done. So we drop the alert for Nagios, we reload Nagios and we count one day of life less for Nagios, right? So this is more or less how we are implementing the change. Any questions, anything? Yes. That's quite interesting. So how did you manage this runbook? I believe this was per set of alerts or somehow. Once you define the process, how did you do that in bulks, in batches? Because I guess you had a lot of alerts, a lot of metrics to export. So your question goes a little bit more into how we... How did you manage the process of transition? The process of transition on the runbook, do you mean? Yeah. So basically in Nagios you can define a link, you can define where actually your alert will lead you for. In Prometheus, when you actually define an alert, there is a field called link that you can actually pace your runbook for each alert, right? So it's quite flexible on that. So you go, okay. And then how did you actually, during the transition time when you redefined this metric? Yeah. Did you do it manually, semi-manually, fully automatic? When you move them from the definitions in Nagios to definitions... So the question is, how we move the definition of the alert from Nagios to Prometheus, right? How do we move all the other definition for Nagios to Prometheus? So the question is, the answer is, we have to do it manually. I mean, you have to, basically, we didn't process the Nagios config file and push it to a gradient of Prometheus. No, because how actually Prometheus is organized, how we actually organize Prometheus is has nothing to do with the Nagios organization, right? So some alerts were old, so some alerts, you will just remove it. And some other, we just bring the runbook manually when put it into the Prometheus. Sadly, we couldn't come up with automatic way because those systems are so different and there is no, basically, good rule. Also, the logic is, sometimes need to be changed. There was like in Nagios, keep in mind, we run Nagios from 2009. So there was like some alerts for some service that they even don't exist anymore. So yeah, yeah, did I answer your question? Yeah, so any questions here? Yes. So did you have any binary checks like this service is up or is down and probably that's all you need? Yes, yes. How did you, because then from, yeah, migrate that from Nagios to Prometheus because Prometheus is all metrics, right? Yeah. Did you just translate that into a zero or one? Yes. I think it also is zero. So the question is, binary checks that requires okay or not okay, how are we going to migrate it, right? So the answer is yes, we had those checks and we have basically a special metric, you can just put a metric, which as you said is a binary. It's a zero or a one or zero or something, right? We also thought that depending on what type of alert, kind of bring a magical number like a negative zero dot zero nine nine nine to avoid the multidimensional spaces, but they actually is not really what we are doing, right? So we are not prevented this, but so your question, yes, we have a couple of alerts that are up a zero or one. Yes. Yeah, let's continue. So for me, great the alerts. One of the things that we must know if you wanna go into that journey to you is like I promise you rely on exporters. Again, we said exporters will expose all the metrics of use service. Keep in mind that the supporters are separate process, right, which is something that you actually need to take it off. A separate process is also a whole of security. You have to configure firewalls and running a new extra process in a machine is, even it's small, it's always something that you have to consider, right? We are a security company, so we actually have to look at after that very well. So also some considerations you should do like blackbuck exporter, very common use for HTTP TCP prose metrics. So if you have, if you can use blackbuck exporter, it's more simpler. I suggest you to use it instead, right? That over complicates stuff on white box approach. And also you need to know that all the metrics is scraped from the exporter every given time, right? So you have to also set you define the time that you want actually to scrape your metrics, right? Very wisely. As I said, the supporters require changes on firewalls and deploy, you need to reserve a new port because obviously it will run of their own port and it's not always possible. Sometimes we don't have supporters for our service. So then you have to cook it yourself, right? I don't have exporter for any service you can imagine, right? For instance, we were a practical case. We had still running piece of old software, which was legacy, it was called backup PC, which is basically Pearl. And we were debating either go to use text file exporter or write our own exporter, right? And since we want to get rid of Nages, as soon as possible we say like, we're gonna just put our text file exporter and push the metrics there. And then later we will work on a black box exporter, right? Sorry, on a backup PC exporter, right? Actually, Bing has some work done on it, so it's actually looking good. So as you need to know that the text file exporter will actually help you to export any metrics, there's a current job inside the server. And if you need a specific metric or service to develop, it's basically the fastest way you can go. It's going to write your own exporter, which can be a little bit of overhead, but it totally depends on your case, right? If you're gonna have like a tonus metrics to export, maybe text file exporter is a problem rather than solution. This is a slide that I stole, right? So I basically is for showing purposes, I've been already showing it, but it's actually struck from Matt Bostock talk. So he was basically the brain of all our Prometheus infrastructure we had. And I really recommend you to watch their video conference on YouTube if you have interest. You have to go to YouTube and type Prometheus Cloudflare and then you will see him for sure and maybe meet also in the future. Who knows? Yeah, yeah, yeah, yeah, yeah. So future challenges, we had like a conversation before we started working on that project, we have an issue with the freshness, right? So we had a lot of problems getting metrics from getting metrics from basically the way that we collect metrics. Didn't allow us to know if this metric was fresh or wasn't, right? There were some brainstorming on the way, but apparently there is an update and I think they claim it's on the next Prometheus version. So Nagios actually can tell you if the metric or the monitoring point is actually old, but Prometheus, this version can, but the version two I think it's already solved, right? That's it, thank you. That's the presentation. So I know you have any further questions? I actually just want to add a little bit. Sure. So the main problem with FreshNet is it actually come from using the text file spotter for the other normal spotter, you can tell if the target that you are scraping in up or down, then you can know if that's fresh or not. But then if you're using text file spotter, then it's just a file on this. And then even if the file doesn't change, you don't know about that one. One way you can do that, you can monitor modified timestamp of the file, but you actually should way too complicated. Yeah, exactly. So the main problem is just for text file spotter. Exactly, we decided to go on that way. We had to actually do another monitoring for each, it's a single file. So we'll be like a recursive forever. We'll say it's finished loop, right? Yeah. No more questions? Do you encrypt when you access your methods in points? If we encrypt our, the question is if we encrypt our endpoints. And the answer is no, we don't encrypt it. And we don't encrypt it because we are using basically a TLS and a cell and we have our security in place. So we have our security method in place so we obviously don't need to encrypt it. So you mean you access your, when you spread, you access your endpoints to a TLS panel. This is what you are saying. So yeah, so our endpoints are encrypted when they are traveling for also a network. I think it's not the case because it's traveling our amicast network. So I don't think that we have ever, ever. I think to put a simple way, we're using HTTPS for encryption. Yeah. It's all about HTTPS. It's talking about the exporter endpoint itself. So my question is, like when you, if you're saying you use HTTPS, the question is do you do HTTPS for each exporter? Yes, that's what we do. So you're built in, so you link HTTPS, you link your exporter with TLS library. No, that's not correct. So then the way we architecture our computer deployment is in each pop, there is a service server which creates all the local advice and that's not trust at that point. That's on our internet network. I mean, it's still playing a TDP. Okay, and then? And then once we've done the metrics, that's our HTTPS. Yeah. Yeah, but again, the prometheus, if you talk about the iteration of this, so prometheus itself does not. No, it says behind the checks. No. Okay. Yeah, so prometheus itself, so we see. And the client, the server doesn't have TLS either, so we do the proxying. And prometheus is in a script version. Yes. I think definitely, for prometheus, you go client, HTTPS client, definitely support HTTPS. Yeah. Okay, and this is something. Okay. So basically, here the metrics are basically on HTTPS, sorry, on HTTPS, so on this level. So when you go out of the colo, of the photo presence, the metrics are actually, the path is encrypted, yeah. So yeah, authenticate. The end point is actually nothing good. And? Yeah, do you authenticate when you spray? No. I guess, yeah. We don't do client authenticate when we're scraping over HTTPS, but we have IP table room that I only allow a specific set of machine that can scrape the wall. So we, so, I mean, to give more background, Cloudflare is a seeding company, right? So we run our own network and we are also a security company. Yeah. We run our own network, we set our own network to L5 world rooms, machine L5 world rooms, and we roll our own HTTPS. And so, I get what you're saying about authentication, but we are, right now, we don't expose a lot of sensitive, like we don't expose any customer data at all, or this is all just. No, I'm not online or anything. Yeah. Right. Yeah, the question is just that, if this was wasn't a valid question, I apologize. No, no, it was not a valid question. It's a valid question. So actually, it made me think a bit, though, because as far as I, we're working from inside on the code level, we don't use subscription. It goes like, you basically with a simple code, you just get like over HTTPS, you get all the metrics. So when we aggregate, actually, I think we, as Neeraf say, we have secured our network, so we don't need actually to add this overhead on the flow, right? So we don't basically, so. Well, let's be very real. How many world customers we know that use SNMP with TLS or something like that? Well, SNMP does not do TLS, it does, SNMP3, yes. SNMPV3 does encryption, it's not TLS, and some do. Yeah, I'm here to see one. So, yeah, in the real world, yeah. I have a different question. Yeah, sure. So it has been highlighted that the data storage in Prometheus is not long-term, so it's 15 days was mentioned. So I assume there's also, is there anything on like down-sampling or is that all handled in time series database or? It's all handled in that, and you can define your down-sampling intervals that you can set it to as gandall or as high as you want. Oh, okay. So I think the reason is that Prometheus focus is right now not to be a long-term story, but it doesn't mean that you cannot do it for a long-term story. The reason we have to limit it for 15 days is because we have just too many metrics and then we're not being able to spam all that like 15 days. Otherwise, we just feel about half right. Prometheus, we do make very significant size in terms of storage. They're not providing much better Raptor support for Clickhouse, for Create, for Influx. Yeah. Then the tiering will be done by some sort of process local to the Prometheus service itself, right? Yes, the Prometheus Raptor that will take from the Prometheus server and push. Yeah, Prometheus itself wants to remain fast. Like, that's one philosophy that I think Prometheus has followed really well. For example, as he talked about that telegraph also does scripts, right? But telegraph runs these scripts itself, right? The telegraph runs the scripts itself and then gets whatever matrix and then exports it out to Prometheus, right? What Prometheus notice Porter also had this choice that it could have run these text file exporter scripts themselves. But they said that we want our notice Porter to be really, really fast. So, and we don't want this different. If they don't control the script, they don't know how much time it will take, right? So Prometheus' focus has always been on speed. So it's like you write the text file and processing a text file and reporting it back on the HTTP input is really, really fast. So that's one thing about Prometheus, I guess. That it is built to be fast. That is why we have been able to scale at this level, right? And that is why they did not focus on long-term storage at the start because long-term storage comes with its own issue where like how do you horizontally scale and all of that, because they didn't want to handle that at that particular point. Sorry, yeah. Anyone of us can answer. How do you, do you have any redundancy with the alert manager? So right now, we don't, but we are building redundancy. And I think it's both Prometheus' and alert manager. We can have redundancy and alert manager can deduplicate between things too. The thing is why we don't have right now and why we're gonna build it later. We do have backup alert manager, but we don't have high availability as in like, if one goes down we'll have another one come up, but it's not like, they're both not active active. It's in an active backup stage. And why do we do it this way is that, look what happens in Nagio says that all the alerting logic or is either at the node or is either on the server, like the service goes down up down whatever one in critical life. What you can do with Prometheus is that like, since we have polo-level Prometheus, like we have Prometheus at each of our locations and we copy the alerting logic everywhere. So when Prometheus sends alerts, it's only sending like, just like, let's say you have thousands and thousands of alerts defined, right? Alert manager does not need to care about these hundreds and thousands of things. The Prometheus on a local level cares about it. And it only sends when alert fires to alert manager. So it's not very high-scale. It's not taking up that much resources or that we need to running in active active. So that is why we are running right now in active backup. So this backup still receives the alerts from all the other Prometheus servers? No. So that new situational awareness, if you're active is down, so how in the end you know your picture of the world? I can't do that. So there are two parts of that. So actually it's very interesting question because right now alert manager is our single file file in monitoring. There's two reasons. One is that like alert manager have some mass viewing but they're still in beta state. So basically there's no head set up available for alert manager at the moment. The other thing that we actually monitoring alert manager is using Grafana. So there's Grafana alert that I actually trigger if alert manager is going down. And there are also Prometheus also wearing alert manager for in metric. So if there's something wrong with alert manager then we can also observe it from Prometheus side. Like if the number of alert sending out is not like at highest alert coming in or something like immediately wrong with alert manager. And then like we also have alert in Grafana like alert manager is going down. Okay. I think it's not about alert manager's abilities because once it's running it's in the pretty simple case of something. Yeah. So once it's running it probably would be running and if it has some bugs which affects it's ability one could guess that they would be faced very quickly. So you can rely on the fact that it is running. But I think that the problem might be is that if you have some form of continuous integration for your alert rules like when you push rules into the thing and it gets made up after some auto test which I heard Google does but I don't know if anyone else does really because I'm not sure how you can write except for maybe a single lecture how you can actually validate the alert manager rules. So what you can do theoretically is you mess up that and then you would be in some trouble depending on how that turned out. Yeah. Something related to that actually. So we have a simple setup that actually testing our how flow from that would be a simple job that trigger magic and then like trigger alert manager like we do some kind of like follow the standard of operation. So they're like eight hours between a different operation team around the world. And then at the beginning of visit there will be an alert that actually trigger internally and then go through alert manager to page a duty and then I go to the on-con person. So that way that I the on-con person actually expecting to be able to know that like our alerting and monitoring is still work. And then I do actually get to this shift. Yeah, it's like a fire drill, right? Like if we call it a drill an escalation drill to make sure that all parts of our monitoring stack are working. I mean, this is forwarding to alert manager that manager is probably going to page duty and all of the whole cycle is working. Yeah, it's a schedule. So you're actually doing it every eight hours. Sorry. With that the whole flow from on the shift. Yes. Yes. So we have three offices as I have London and Singapore. So all teams take eight hours each. And so at the beginning of every shift it triggers. You're also testing the human. Yes. You're also testing if the human is awake. Yes. If you're being able to like your phone and not going like that or something like that. You will go to the next. Yeah, yeah, yeah. So I mean, I will just, are there any more questions? Heather? Yeah, go ahead. Actually, ask this one while I go. He's here probably. OK. Yeah, but just in case that I was also looking for our set of the escalation of the alert. So let's say for that same alert, it will be fired to the designated recipients, right? And let's say you configured your alert monitor to send every hour. Then the following hour will send the same thing. But what I'm looking at is if this same alert happens in the next hour, I would like to send it to, let's say, to level to someone who will poke this guy still. Why are you still? So I mean, this is kind of not related to promedians, but I'll just. I like the monitor. Yeah, so this is more about how you want to handle alerts. So what we do is for business-level alerts. Like when I'm saying this is strictly for business-level alerts, what we do is we handle that logic in PagerDuty. So what will happen is that in alert management, you can define multiple kinds of receivers, multiple even schedules in PagerDuty. So let's say you have three, you can, for example, set up three schedules, one which goes immediately and goes to the on-call person, right? One which is a 15-minute delay and then goes to escalation. Or in your case, it could be a one-hour delay. One which is like a three-hour delay, which can go to the CTO. Or like we are one of those companies where alerts can escalate up to the CTO even. So everyone is on-call, like even the developer. So it's that kind of thing. So we handle that in PagerDuty. That's what I want to say. We can set up custom things like, OK, after if this person has not acknowledged, the first person has not acknowledged and nothing has been done and alert is still firing, then it will go to the escalation automatically. So this one is even managed in the PagerDuty, not the alert monitor. Yeah, yeah, yeah. This logic is, we have made this logic in PagerDuty that the app itself. All right, yeah. Because that's also another... Yeah, as I said, it's not in our case because we only have e-mail, so... You can send e-mails to PagerDuty. You can send a mail to the cross-miles script, which will then keep some state. You can need to do the read-in, yeah. But probably you should raise some money for PagerDuty's level. Yeah. This is our piece, because, you know, that there's some kind of... Just in case. OK, that's generally respond to e-mails just now, if you want to. And then if it finds e-mail to be an effective... If you just add a thing over e-mail, then it will create a new one just to ignore it. It's just nice. It is not worth doing e-mail or other things. Maybe creating Gira tickets. Like, as a manager, have integration, actually, when you find another refining, it can create a Gira ticket. Yeah, so we do... It might be something... We do, like, as easy as, like, if the disk is going to fill up in four days, there's no need to wake someone up, right? So we file a Gira ticket and someone will come in and take a look. But if it's going to go in, like, the next one hour, then it needs to be escalated, right? So we... So that's the thing with Alert Manager. You can define, like, multiple page identity schedules, multiple things, like Slack, HipChat, whatever, and all these receivers you can say... You can configure, like, OK, like... I want to do a Gira and an Alert and this. You can do all of that with Alert Manager. Or you want to do something like, hey, I want to notify the ops team, but also the dev team. And, like... So that's also possible with Alert Manager. So, OK, I will like to kind of end the thing. First of all, I'd like to end this meet... End this first meet-up, I guess. So first of all, I'd like to thank Vincent. If you guys didn't know, Vincent is the captain of the... Or the guy of cloud-native computing in Singapore. Cloud-native computing foundation, which is a foundation which was built by Linus and has now promoters like Google, Microsoft, Amazon, all the big things are behind it. And incidentally, Prometheus and Kubernetes are both under cloud-native computing foundation. Like, even our meet-up was sponsored by cloud-native computing foundation. So, as you can see, Prometheus has a lot of, like, developer ecosystem and push from major companies behind it, like, Kubernetes. And, as I said, like, it's cloud-native. It's built for, like, Prometheus, as you saw, been talked about. Many Kubernetes tools themselves come with native endpoints for Prometheus in addition of matrix, right? So... Other ecosystems, like Dr. Swarm, also... Yes, and also the other ecosystems are also kind of, like, as Dr. Swarm... Dr. Swarm now has... Yeah? Yeah, I mean, at least I know that when I was still following it closely, it wasn't worked. But I think there's been several releases and I think it's all gone now. So, the point is, like, white box marketing is here to stay. And from what we are seeing is, like, all the major companies, Cloud Fair, Google, Microsoft, all of them have put their weight band Prometheus. And as we talked about, like, yes, Prometheus does not have all kinds of exporters. All Prometheus does not have long-term storage, but it's working on it, right? It's a very new thing and it's a very new way of thinking about monitoring also, which is what you must have seen. It's not just one-zero, like, critical warning, right? It's about total visibility. It's about, like, doing histogram, quantize, like, did you know that it can also do data prediction? Like, it can actually do whole twinters. There are native functions and Prometheus to do, like, whole twinters modeling, which is, like, if you, it's a data analytics term, where you can draw a smooth curve and see, okay, like, when even if something is not as, like, smooth as linear, you can do a proper curve and see, like, okay, like, when will this particular condition be reached? So, like, and so that's the kind of the way we're heading and, of course, we are gonna do more meetups for Prometheus and, like, so I would just say, just go ahead and use it, like, try it out. You can try it out from the GitHub URL that I've ensured. It's a pretty easy thing to set up and, yeah. It also has a Docker, it's also a Docker container, which you can just run and start up an instance on your other machines. Yeah, just try it out and post on the meetup if you have issues or things. And I think if you are interested to talk, approach us. Yes. And, yeah, if you want to share your experience with Prometheus. Yeah, if you're just starting up and you hate it and you go back to Daniel's, do that as well. Ha, ha, ha, ha. That would be very interesting to know, yeah. It is, yeah. Thanks, guys. Thanks.