 I recently asked a buddy of mine if he had ever heard of Prometheus and he said, yeah, you know, that's the thing that people use to monitor Kubernetes clusters, right? Well, so sort of, right? I mean, he was right that Prometheus has become the de facto standard for that particular use case, but I'm sure that it's, you know, no surprise to this group that you can use Prometheus for so much more than that. So my name is Will Geth, right? I'm the CTO at Talos and today I'm going to lay out just sort of my personal experience using Prometheus as really like the mission critical core of the tech stack of an IoT startup. So maybe I'll set the table a little bit here. Let's start by setting a few goals for commercial and industrial IoT in general. And then I'm going to talk about my specific use case. So for many established companies out there, their IoT transformation, it's really just beginning, right? I mean, a commercial building represents a huge data store, but almost all of that data is locked up in little metal boxes and it's completely inaccessible. So that being the sort of state of the world, that's the state of the market, right? There's actually a lot of value in just making that data available and then performing some basic services on it, right? I'll give you maybe an example would help here. So I'll give you an example from my own work. So Talos uses IoT kind of technologies for automatic fault detection and diagnostics of mission critical equipment. And we're starting with HVACs, right? Heating and air conditioning units. What we do is we start off with a small sensor board. So one of these gets installed into every HVAC cabinet. And this lets us send data from a few different sensors, so a few different electrical sensors and temperature sensors up to the cloud. And you can tell a surprising amount about the health of an HVAC unit just from a few temperature and electrical current readings, right? So once this data is in the cloud, we can analyze it against a few different sets of rules. One set of rules was handcrafted by HVAC experts. And then another one was learned based on the performance, the past performance of this, you know, that particular unit and then similar units. So your HVAC in a commercial or industrial building, it's really a critical piece of mechanical machinery, right? Climate control is just absolutely crucial to many industrial processes. And many buildings, especially here in the West, here in America, they become completely uninhabitable when the HVAC goes down, right? I mean, one of the hallmarks of so-called class A office spaces that you can't open the windows. You can't actually get ventilation into the building by design. This is a selling feature. So you got to leave if the HVAC dies. At the same time, your HVAC is really hard to monitor, right? I mean, if you think about an HVAC, it's typically up on the roof of the building, or at some place very difficult to get to, maybe tucked behind the building. And it can take a trained technician to really be able to see if the unit is healthy or not, right? Your sort of typical person doesn't quite know enough about the refrigeration cycle to diagnose one of these things. So this represents a big operational risk, doesn't it? On the one hand, you have a completely critical piece of equipment for the functioning of a building. On the other hand, it's difficult and expensive to assess the health of it. Because of this, because of this sort of this trade-off, there's many companies that just choose to run their equipment to failure. And yeah, they have to deal with the occasional emergency. Sometimes it dies when you really wish it hadn't. But that's expensive, right? So the current sort of state of the art, the current best practice in the industry, is to perform quarterly preventive maintenance. So this is where a technician goes to a unit and lays on hands, you know, really whether the unit needs it or not. And that's the cycle that we're trying to break here, right? So by using, you know, some of these IoT devices, we can perform preventive maintenance, essentially once a minute instead of once a quarter. And that's going to make this equipment a lot more reliable. And it's also going to make technicians a lot more productive, right? Because then they can only go have to service the units that actually need their help. But, you know, maybe of interest to this crowd, the most is, we can actually do one step better than that, can't we? We can take this new very rich data set and track the time series data that comes off of it. Now, time series data is so much more useful than maybe static data for assessing the health of any machine, because you can pick out, you can pick out trends, right? So an HVAC has two very important, you know, we'll call them pipes or lines that carry the refrigerant. These are called the suction liquid line. And if you come up to me or any technician and you tell me that the difference in temperature between the suction line and liquid line is 25 degrees Fahrenheit, then I mean, I don't know, I have to tell you that the units probably fine, whatever. That's a lot. But if you tell me that the difference is 25 degrees Fahrenheit, but it's falling at five degrees per hour, I can tell you that you're leaking refrigerant in a pretty severe way. Actually, like you need to get that thing patched up right now. It's going to, it's going to die, it's going to fail. So quick summary, what's the TLDR? Industrial IoT is about exposing and analyzing time series data that used to be difficult or impossible to gather from real world equipment. So some version of that statement was what my co-founder came to me with when we started Telus. And, you know, that was my, that was essentially my job. That was my charge, was to, was to figure out how to serve this data and track it and deal with it. And after setting up a few prototypes based on different types of tech stacks, the, you know, we really had sort of an aha moment at the most abstract, what the goal is is to track metrics about a metal box that is connected to the internet, right? That's what this does is it connects that metal box to the internet. So the aha moment is that servers, right? Computer servers, internet servers also happen to be metal boxes that are connected to the internet. They're out there in the world, they're invisible to the people using them for the most part, right? And the world of, we'll call it site reliability engineering is actually very, very good at this job. All right. So my job then became to just see what kind of tools those folks use. And that's how I found Prometheus and its, you know, and its ecosystem, right? So we use Prometheus, Cortex and Grafana. And this turned out to be a pretty perfect fit. So let's go on a quick, a quick tour here of kind of what, you know, what we built, how we use these tools. Now, it all starts with something that most people are pretty used to looking at, right? This is sort of an interactive map. We use Google Maps. And instead of driving directions, we can now give directions to broken HVACs. So these are HVAC units, just like any others out there, except that we have these tallows, these, you know, these guardian sensor boards on them. There's only a few units pictured here in my demo, but it's easy to imagine this scaling to hundreds or thousands of units nationwide, right? You can just put them all on the map and have a good way to track them. The units that are working fine are here in green. So this is one here in my office. And I can confirm that I'm quite comfortable. And here across town, instead of having to drive to this unit, we can zoom in on it. And we can tell that this unit's at alarm. This unit is having some trouble, right? So let's check it out. So here's the unit. I can tell you that we got unit, you know, data from this unit recently. We can, again, confirm the location of it. And I have the key metrics that I want coming off of here, right? So I can see how much electrical current the unit's using for you HVAC geeks out there. We've separated the compressor, right? That's the sort of the beating heart of the HVAC, if you will, from the indoor fan motor and outdoor fan motor. And here's the, you know, the sort of aforementioned liquid and suction lines along with the outdoor air temperature that we're picking up from a local weather station. So we can run some basic analytics on this. We can see some faults. So let's see. Here's the first fault. The first fault is that this compressor is on a lot of the time, even though it's only 70 degrees Fahrenheit. That's bad because if the next day were to be saying 90 degrees Fahrenheit, which is uncomfortably warm, the people in this building are going to be hot, right? So this is a classic use case. The people in this building are still comfortable on this, on this day that I've selected this data from. So nobody knows there's a problem. But we do, I can predict a failure because if this is, you know, if the day gets much hotter, then the HVAC won't be able to keep up. It's a very bad sign. I can also see that the liquid line is much hotter than the outdoor air temperature. So here the liquid line is, you know, almost 100 degrees Fahrenheit, outdoor temperature is 70. This tells me that the unit is having trouble rejecting heat back into the air. And so I would want to send a technician to this site to give it a cleaning, clean the condenser coil. Right. And the last thing we can see here is that the suction line is actually a little bit too warm at 60 degrees. You really want that at about 45 degrees because it's impossible for an air conditioner to make 55 degree Fahrenheit air, which is what it's designed to do if the suction line is hotter than that. We'll keep going. Just to see one other, one other interesting thing that we were able to pick up on this unit was this example here. So here we see that the indoor fan motor is actually pegging out my sensor. My sensor can only detect 50 amps. It's maxed it out. This is a condition called locked rotor amps without getting into too many of the details. You just have to know that this is a bad thing, but it's a very fleeting state. This lasts only for a very short period of time. And it's hard to detect unless you happen to be standing right there listening for it. And we were able to detect this thing, you know, pulling locked rotor amps twice in a row means the fan motor is going to die. You got to go replace it. Fan motor dies, you got no cold air. Right. So it looks like this unit could use some service and we were able to diagnose it from, from, you know, the comfort of my office all the way across town. That's the miracle of IOT. No, fantastic use case. Right. And Prometheus is actually a perfect fit for this use case. So with a little bit of planning, you can get extremely powerful behavior for monitoring a fleet of IOT devices like this. So again, in my experience, I don't pretend to be an expert at this stuff, but there were two things that attracted me to Prometheus. So the first one was really how easy it is to administer the alerts. All you have to do, pull this back up here, all you have to do is just, you know, edit your gamble file, hit reload, you get tons of this rich behavior for free. You get reliable evaluation, you get these alert delays, you get silences, deduplication, the whole deal. It all sort of works out of the box, right, which is, which is, which is really nice. It's really powerful. The second was the query language itself. Right. PromQL is really a remarkable tool. So it takes a little bit of a learning curve, but once you get your head wrapped around it, you know, PromQL is really a rich and expressive and powerful language. Right. So let's, let's kind of, you know, maybe for the second part of this talk, I wanted to go over, go over an example for you. So let's start to craft an alert. So, you know, in the, you know, in the HVAC world here, right, when your suction line gets, is what gets cold, right? That's what I mentioned earlier. The suction line has ideally like 45 degrees. Once it hits 32 degrees, bad things start to happen because at 32 Fahrenheit, things, you know, water starts to freeze. It will freeze out of the air. It's going to turn your HVAC into a block of ice. If the, when the HVAC turns off, that block of ice is going to melt. Now you have a bunch of water. It can cause water damage. It can cause mold. You really want to avoid this. But these sort of freeze thaw cycles, I've seen them take place, you know, out in the wild over weeks and months, just because nobody knows they're happening. The unit's still pumping out cold air. So the people in the building don't notice, it's just really struggling. And now the building owner has a bunch of mold damage. It sucks. So you really want to avoid this. So ideally we could alert somebody if we detect that there is ice on the suction line. So let's look at how we might go about doing this, right? What would our, first before, before we even start to think about the rule, what's our metric going to look like? Well, the design decision that we made was to define a single metric per equipment type. So the type of equipment that we were dealing with here, right? This, this, this equipment right here that you, oops, that you see on the screen, that is called a package unit. It's a rooftop package unit, right? It's just, it's one of these big metal boxes that you see on commercial rooftops. And so let's define a gauge metric for the equipment of a package unit. So we'll call it, we'll call it a quip. Now, each of these sensor, so all, you know, all data from all packages that I'm monitoring will come to this metric, this gauge. And each of these sensor boards, it has an individual ID number I call, we call this the device EID, the device entity ID. And so in order to distinguish between different data streams, we will, we will define, you know, device EID equals whatever it is. Okay. So this is how this is how we can, again, distinguish data coming off of, let's say this unit from this unit, from this unit, from the unit across town. Okay. So quick aside here, I'm sure that the observability nerds in the audience are going to be really quick to point out that this is technically unbounded cardinality, that's a very good catch, right? Very smart. Patterns off in the back, you are technically correct, which is of course the best kind of correct. We have made the business decision that if this startup is so successful, that we have so many units out there that this becomes a problem, then we'll have plenty of money to pay all, you know, one of you all to come in and, and, you know, patch this up for us. So, you know, quick aside, but yes, that is a, you know, that is a known danger of, of doing exactly this. There's also, so the other thing we have to keep in mind, back to our metric design, is that there's many data points that come off of a single unit, right? We're doing the temperature of the suction line liquid line, we're doing a few different, you know, CT amp readings, CT readings. So we have to add another label. So the label that we want here is going to be, we can call it like, let's say point equals suction F. Okay. So here we go. Now we've individually found the, you know, the, the gauge and the labels that would define, you know, this sensor port for that particular unit, right? It's very precise, fantastic. So if we just say that this, we're going to alert whenever this is less than 32, and let's say all, all we have to do is add a little four claws and a bunch of labels, some annotations, and we're done, right? Ship it, go live. So maybe not quite. Okay. First of all, let's generalize this. So this is an alert for one particular device, but all I have to do is blow away this piece of logic. And now this is valid for every package you need in the fleet. All right. Very powerful. But unfortunately, the logic is still not quite right. So what if this is, let's say, a cold winter day? Now no cooling is going to go on a cold winter day, but the suction line is just a piece of copper exposed to the elements. It's going to get, at least where we live in Richmond, Virginia, it's going to get below freezing. So this will, you know, this condition will fire at that point. It's an alert, but you don't want to, that will be in this case, a false alarm, right? Really bad. So let's put in a little bit more logic to prevent this, right? That's not an alert condition if the day happens to be cold. In fact, it's only an alert condition if the HVAC unit is running and then the suction line freezes. So no problem. Right. Let's add a little bit more logic. So let's say that we want this to be, you know, and we want to equip package. In this case, let's say that the compressor amps have to be greater than one. And for the very eagle eyed among you, you would realize that we have to do this on device EID. So now for any device EID, right, it only compares those device EID, the particular device EID here, particular device EID here. Now when the suction line is cold, and the unit is using electrical power, right? Amps are greater than one. One's a very small number of amps for a compressor to be using. So we know it's on. Then we're going to get it. Right. We're going to get an alert. Fantastic. Right. And this rule actually looks pretty good. In fact, I put a version of this rule into production once upon a time. But when we did, fortunately, we caught it, the problem before it became an issue, we actually ran into a little bit of a gotcha. And this one's a little bit more subtle data coming off of an IoT sensor, especially when with a wireless connection is not 100% reliable. So just because you scrape an end point that lives in the cloud, which is what Prometheus is doing, it's not actually scraping this device in my use case, it's scraping an end point that lives in the cloud. It doesn't mean you're getting fresh data. It could mean that the sensor is failing to send data for whatever reason, right? You can end up with stale data. And a lot's been written about stale data. In Prometheus, I'm not going to retread all of that here. So my greater point is that this is actually a pretty big, you know, issue in the IoT world is this notion of stale data. To work around this, I actually separately defined another metric that is going to keep track of the timestamp at which I received a data point. I do this in seconds since the epoch. So we're going to call this metric. We're going to call this metric feeds last scene, right? The last scene time that I have gotten new data from a certain data feed. And in order to make this data, again, it's tracked in seconds since the epoch. So in order to make this useful, again, I need to do this on device, the ID, and I need to take the rate of increase of this timestamp over some period of time, let's say 10 minutes, and ensure that that is greater than zero. All right. So again, what is this little bit of logic doing? It's making sure that I'm getting fresh data because if I'm getting fresh data, the timestamp will update and the rate of increase will be greater than zero. If I'm just reading the same data over and over again, the same exposed metric over and over and over again, then this won't increase and the rate of increase will be zero and then this alert won't fire. I could, I could choose to set up a separate alert showing that I'm getting stale data if I want to. That's fantastic. So we want to fire this alert only if new data comes in again to reaffirm that the frozen suction line condition is actually still frozen. Otherwise we can get the single cold point and the stale metric could carry it forward for 10 minutes, fire a false alarm. Wonderful. Now we have a actually pretty nice, reliable alerting rule here. And what happens if we add another, let's say 100 boards onto package units all over the place, right? So the beauty here is that I don't have to do anything. We don't have to update this rule at all. It's really cool. I don't have to touch my config file. I don't even have to reload Prometheus. We get this all for free. That's very powerful and it's very scalable. So this is a, like this is a ton of functionality all in a very small lightweight package. It's pretty easy to administer. You know, the company's grown quite a bit since I first set all this stuff up, but Tello started life as a two-person startup. I was the only engineer. And the fact that a single person could stand up, you know, all this sort of scalable infrastructure, you know, really without any, without any prior expert expertise and this kind of technology, I think that shows how, you know, how lightweight it is while still being very, very scaled. I hope this has given you, I really hope this has given you all a taste of, you know, A, how we run Prometheus for IoT. And again, I want the point of this talk not to be like how you do this, you know, perfect, how a world's expert might do this perfectly. The point of this talk is really to show our experience in standing up this system, again, with no prior expertise in the technology. So I have no doubt that kind of the prom QL wizards in the audience, you all have probably come up with 10 different and better ways that we could have designed this data. You might be getting real agitated by now. And if this is you, please, you know, I welcome your suggestions in this matter. And even if that doesn't describe you, I'll invite any questions that you have as well here. So we are still, you know, if you want to get in touch with, if you want to get in touch with me, again, this is where I'm, I'm, my name is Will Gathwright. I'm with Talos IoT. You can find us on Twitter. So at IoT Talos. And my email is here up on the screen. It's just W. Gathwright. My last name at Talos IoT.com. Really appreciate it.