 All right, well welcome again to another OpenShift Commons briefing. This week we're really happy to have with us Thomas Wiest who's been working at Red Hat for quite some time and is one of the folks behind the scenes at OpenShift.com and he's going to talk to us today about OpenShift and monitoring and some of the lessons that we've learned running OpenShift at scale in production and we'll be hoping to get your Q&A at the end of the session. If you have any questions after the session, you can always post them to the OpenShift.commons mailing list or send them to myself at bemueller at redhat.com and I'll try and find a way for you to get them answered. But please do use the chat room and we will get started now. So Thomas, I'll let you introduce yourself and take it away. Okay. Yeah, my name is Thomas Wiest. I've been on the OpenShift operations team since the very beginning. I was actually the first operations person hired on to the team and so I know all the skeletons and all the hardships that we've gone through in getting our monitoring up to up to par. So everything I'm going to be talking about today are all things for the current OpenShift online, which is running OpenShift version 2. That being said, we fully expect we're very happy with our setup and so we fully expect to use the exact same technologies and the same patterns and philosophies going forward with version 3. So, yeah. So here's a quick history of our monitoring and OpenShift online. We started with Nagios and CollectD. So Nagios, almost everyone knows what Nagios is, CollectD is a little lesser known. It just collects values and then can graph them. So it's not an alerting infrastructure at all where Nagios is, but it doesn't really have the graphing aspect of values over time. So we were using those two as disparate things. We quickly outgrew this though because we had Nagios, we were using the Nagios SSH based checks and those have a number of problems. If your check takes longer than 30 seconds, by default it'll time out. You can go ahead and change these defaults. But there's a bunch of different defaults where checks would take a little bit longer and so they would fail and then we'd get alerted and paged and it was a pain. When there was nothing actually wrong, it was just the box was slow for whatever reason on that one check. The checks don't return values. They just return okay or warning or critical. And so you can see over time if a check was failing on a consistent basis, but you couldn't see the value of whatever was being monitored over that time period. And that's why we would jump over to collect D. So we would get the alert in Nagios saying, oh, our disk space is too low. And then we'd have to jump over to collect D and then look at the disk space over time to see is this a sudden problem? Has it gradually been going down? You know, those types of things. So since they weren't integrated with each other, some things had history where we could see them over time, but they didn't have corresponding alerts. And vice versa, we had alerts that didn't have corresponding history, you know, things we ought to think really bad where and so it was just a real, it was a real huge pain. So with all these things together, we decided, you know, it's time to see what other options are out there. And one of our team members that had actually worked with Zavix before. And so he gave us a presentation on Zavix and he said, you know, these are strengths is how it works. And immediately we saw the value. And so we transitioned to it over two years ago. And again, we're pretty happy with it. So I'm going to go into Zavix a little bit. I'm trying not to make this a Zavix training, but to kind of understand our philosophy and how we work, you kind of need to understand the technology at least at a high level to understand like how our flow is. So I'm going to explain Zavix a little bit in the beginning here. So the reason we picked Zavix was because it has a very clean separation of concerns. I'll get into this a little bit more when I actually talk about the individual components that it does. But I think you'll see once I explain those that the separation of concerns is very nice. It's also, it has a very highly cohesive design where each individual component is specialized to do one thing and one thing only. It's quite scalable. We have no scaling problems whatsoever. We know exactly what we need to do to get it to scale higher than we currently are if we do want that. You know, it's very, very nice in its structure and it really is integral to our monitoring efforts. We treat Zavix as an extra team member in that Zavix handles a lot of things for us that we don't want a human to actually handle. And so we treat it just like another team member and it's one of the harder working team members. Now, note, the Zavix web UI is not great. So Zavix, the product and Zavix, the monitoring, it does everything we want. It does it very scaleably, very stably. You know, we're very happy with Zavix, the product, the web UI that Zavix presents, not great. So if you evaluate it, just know that the web UI is not great, but overall it's an excellent product. So yeah, so don't let the web UI deter you from actually using Zavix. Like you'll get used to the web UI. It hurts at first, but you'll get used to it. Okay, so I'm going to go through some of the high level of Zavix concepts. So the first thing to know about Zavix is it has a thing called items. And these are just a bucket where you throw values. So in the disk space usage, for instance, one of the things we monitor is the percent free. So there's an item, and you just take the percent free of a disk and shove that value into an item. And that's all the item does. He just stores a value, and you update the item, and he stores those values, and he keeps them historically. The values that you can pass in can be strings, integers, floats, you know, any of those, the basic types. We try to only use integers because they are extremely space efficient. You can map them to enums in Zavix. So even though they're integers, you don't have to know that a one means this condition. You can actually, in Zavix, you can say a one is this string, and then when you see it in the web UI, it'll actually present you with that string. So you don't have to memorize all these values, which is really nice. And it makes it really easy to add triggers. Basically, for us, the vast majority of the time, zero is the success case, and greater than zero is the failure case. And so what we can do is when we're actually writing things that populate items, we can just count up the number of errors, and if there were zero errors, then we're good. If there weren't, then we shove in the number of errors or whatever. And items contain a retention policy, so by default, I believe it is, at least our defaults, we may have changed these, but our defaults for items, we keep every single value reported for one week. So if you report every five minutes, it'll keep every single one of those for an entire week. And then after that week, it just keeps trending data. So you can go back and see over an hour what was the value roughly. And so, yeah, so that's what we do, and we keep that trending data for one year, and then it falls off. And these are all tunable and all changeable. You can do them however you'd like. One thing that we use quite often is we like to have, so one thing that Xavix allows is, you can populate an item from a host that's from one host to another. Like, so basically, if I am running on a broker and I know something about a node in OpenShift, I can shove in values for that node. And I'll talk about that a little bit more when we get to some of the things that we monitor specifically for OpenShift. But that's kind of the idea is I can shove in values that are for a different host. It doesn't sound like it's all that impressive, but it makes your monitoring really, really nice because no matter what host detects the problem, you can shove the value in and then have it corrected automatically. So and then every item can be graphed independently. So in Nagios, like I was saying, the checks just say, is it currently failing or not? You don't actually get to see what was my disk usage over time, what was my percent free? And so since the basic building block of monitoring in Xavix is an item, so you cannot monitor anything in Xavix without it being an item, which means that since every single item can be graphed, every single thing you monitor, you can see over time. And again, I want to stress this because it is so important when the graph is a fan, like what do you do? You know, you go in, you look at the value of the items that are alerting and you say, is this a short-term thing? Like, did this just happen or have we gradually been going towards this or, you know? So in a disaster scenario, this is very, very important. Okay, so those items, here's an example of an item. This is from our staging environment. So here you can see we have the disk free on slash bar and it's a percentage. So the percentage free on slash bar, you can see this one has 76 percent. This is the last time the item was updated, so we have 76 percent free. And then this change is from the last time it was reported until now it was a difference, so there was no difference. So it reported 76.36 the last time also, so that means that it's not really going down very much. And then you can see what the, you know, you can click on the graph. So I highlighted the graph because this is what you see when you click on graph. And so this is that instance in our staging environment from the free disk. And you can see up here it says seven days. So I have taken the graph and I'm looking at the graph in seven days. And I can actually, I mean this is live of course, but I can actually grab here and drag and just get this portion of the graph. I just want to see what happened right here. I can do that. You can also go back in time. So you can set the date right here, so the date and time between which you want to see the graph. It's extremely useful. And so this is what it looks like. And again, every single item, every single item, every single thing that you want to monitor has to have an item. And every single item you can do this with. So it just makes it very, very powerful. Okay, oh, and they have things called aggregated items, which are items that will like add up other items or do, you know, they'll run some equation on other items and then shove that in as their own value. And so like if I wanted to see the percentage free of all of stage, what I would do is I would create an aggregated item that goes across all of this, all of the items for slash bar of all of the stage hosts. And then it itself would have a graph like this where I could see the actual free of the entire environment. And we actually do those test things as well. So we can not only see on a per instance basis, but we can actually see on a per host type or a per environment as well. Okay, so that's the first thing. Items, they're just buckets to throw values in. Now triggers on the other hand, triggers actually watch items. And what they do is you put an equation in a trigger and you say, if this equation becomes true, then alert. And so how it alerts is it has a state. And what it does is it changes its own state from okay to a problem or an unknown. If the equation actually ends up being like a divide by zero or something crazy like that, it'll flip into an unknown state because it just is like, I don't know. I don't know if this is a problem or not. You just basically unknown me bug in the trigger. So, but a problem state on the other hand, that means the equation has, you know, I've evaluated the equation and it, the equation says this is a problem. And so you need to check into it. Triggers also have a severity. So when it fires, what kind of thing is this, is this a warning or is it a high or is it a disaster? So triggers just watch items and run their equations against those items. And then when they're true, fire an event. That's very basic and that's what they do. They don't do anything else though. I want to be clear about this. They're not the ones who actually, you know, go and try to fix things or page the admin or anything. All they do is they flip themselves from okay to problem or okay to unknown. Okay. So that's a trigger. These are examples of the triggers. I'm using the same items as they did before to it so that we can keep on with the same. So as you remember, it was extended node one dot stage. And so what we have is we have slash bar here and that is P free percent free. And what we do is, so you can see here I've got the severity is average. Meaning it's not a disaster or high. Like it's just, we want to know about this. Zavik's actually named average what I would call warning. They have a warning but it's below an average. So we actually use average as kind of a warning case. Basically, hey, go look at this. You might want to, you know, so again, this is all their UI doing in my opinion. I'm not happy about that but whatever. Okay. So 10% free on bar. So when 10% free hits, we fire an average alert. And what we do, so this is the equation over here. Basically what we do is we look at slash bars percent free over the last five times the item was populated. And we take the maximum value there. It's a little bit hard to, like you have to kind of think through the logic. But basically if the maximum value over the last five items that were populated is less than 10, that means that every single, or less than 11, that means that every single one was 10% or below. And so then we fire this. We say, hey, you know, for five items in a row, which means it didn't just dip down, like it's been down for, you know, a few iterations of the check or of the item. And so that means that you need to go look at it. And up here we have the same exact one but it's 5%. So here you can see, so less than 11 here, less than 6. And we can actually do a less than equal 5 but we just tend to do less than 6. And here it's going to, so these two checks are identical except for the value, which is 5 or 6 here, and then high for the severity. Now this, the severity changes means that it will run different things when this fires. So that's why you do different severity so that you can escalate them in different matters. Now again, the trigger only fires. He doesn't actually go and do that escalation. All he does is say, hey, my case has been met so I'm going to pollute myself from okay to high. So this is an interesting page and this actually is very important in our environment. So this is the most busy trigger that you can see up here. This is the most busy trigger, the top 100 and here I have it for the week. And so for the week we saw these hosts with these triggers being fired the most often. So it's status changes. So it's going from okay to problem and then back to okay and then back to problem. So it did that in stage, again this is staging, this isn't production. I'm purposely showing you the staging stuff because it's more volatile and it's more interesting to look at. So this is our authentication service and it flips that 222 times. Obviously this is something we're going to look at. And what we do on our team is we actually look at this report every single week as a team. We look at it and we say okay, who knows about this, what happened here, how can we fix this? This helps us track down bugs in the product because we get to say oh, this should not be failing this many times. It helps us track down bugs in the triggers, bugs in our code that actually populates the items and also of course real actual issues, right? So we can actually report to other teams and we can, again, these are all graphed. So we can actually show other teams like we have external services that we use and so we can actually, we've done this a number of times where we can actually go to the graph and look at when it was down and actually, you know, send an email with this picture and say here are the times, you know, here, maybe you can correlate either you guys were doing something or your services but I correlate with your monitoring and see if you can figure out what's going on. But this is what we are seeing from our end. It's very powerful stuff. I'm probably not conveying how amazing this is. This is pretty amazing. Like we're, again, we're very happy with how this all works and how we're able to investigate things very quickly and how we're able to see things over time. Okay, so that's items and triggers. Items are buckets, triggers watch the buckets. That's all they are and they run some algorithm against it. Okay, so actions. These are the things that actually do something. So actions can watch one or more triggers. And so they sit there and they watch the triggers and whenever a trigger fires, they can say, okay, you know, triggers with this severity in this environment or on this host groups or whatever. When that fires, then I'm gonna go run this or I'm gonna go do this. And they can do just about anything. They can page in an admin. They can fire an autoheal. They can run on any box being monitored by this addict's agent. So when a trigger fires, we can say, oh, well that trigger fired forward this host as you saw over here, you know, it actually shows the host that it fired on, right? So this host right here had a problem without creation. Okay, that means that we can fire an action on there to try to autoheal. Like we can say, okay, well, usually that means this. And so, you know, we can go and try to heal it. Okay, so, yeah. So this is just an overview just to kind of cement it in your mind. Again, this is the high cohesion. Items only care about the data values. They don't care about firing alerts or doing anything. All they care about is tracking the items over time and, you know, nuking them after a week or a year or whatever. So the host populates the items. The host sends over the data and populates the items. Items sit there taking their data, doing nothing else. And the triggers sit there and watch the items. And whenever, you know, whenever they find them, I find a case that meets their condition, they will fire. Triggers don't know about the host. Triggers don't know about actions. They just know about the items. And when they, you know, in certain cases they fire. Actions only know about triggers. Actions actually have no idea about items whatsoever. They just watch the triggers. And when one or more triggers meet a certain criteria that you set up for the action, they fire a script. And yeah, and in that case, so items don't know about triggers or actions. And, you know, so that's what I'm saying about highly cohesive is like each thing cares about its own area and that's it. And it's also, it has a very clean separation of concerns. They just don't care about any area that isn't immediate to their thing. Now remember, triggers can watch one or more items. So you can actually make these triggers extremely complex. Like you can have them fire and say, okay, fire when, for instance, we can have an SSH check, fire, like if SSH is failing, we can put that in an item and then we can have a trigger watch that item and we can say, okay, fire saying that you can't SSH to this box or whatever. But only when, you know, this other item, which is a ping, isn't, you know, is saying that it's up. So basically you can find cases where the box is pingable but not SSH-able. If it's not pingable, then it's probably not SSH-able. And so you don't need to fire the SSH trigger, right? You'll have a separate trigger that will fire a ping trigger. And then the actions can, you know, do different things based off of that. So if it's an SSH trigger, you can actually have it have the agent stop and start SSH, right? And that might bring back SSH. But if it's the ping trigger, you know that there's probably a problem with the host itself. And so we do all of our stuff in the cloud. And so we can literally call into the Cloud APIs and say, hey, restart this box or hey, stop and start this box, which will actually put it on a new physical host or, you know, do different, we can have it do different actions based off of those things. And so you can make these triggers extremely complex. We try not to, it's nice that we have that power. We like to do everything as simple as possible because it makes tracking these things down very easily. That being said, we do have triggers that operate exactly like that, where they watch multiple items and they do different, they fire only when certain cases are met, so. And then actions can watch one or more triggers. And so you can have an action fire only when, you know, certain triggers, you know, are there. So, so that's kind of our Davix flow. And that's how we operate. I'm gonna go ahead and move on to, like our Davix statistics. So, in our environment currently, we have over 165,000 items that we monitor. We have over 70,000 triggers watching those items. And we have over 500 new item values added per second. So, populating the buckets over 500 per second. We actually are very, we use Davix a ton. And so we're very involved in, like we read a lot of blogs about it and like a lot of performance blogs. And so we're currently at 500 new items per second. This isn't something that we're limited at or it's not something we're, that's just the rate we're currently at. But there are people online who have gotten Davix up to thousands per second, right? And so we know that we have plenty of head room there. And we're not even fully optimized. Like we actually know about certain things that we can do to optimize our flow even further. We just haven't done them because we haven't needed to and there are, you know, as with anything, computers is always bigger fish fry. Okay, so that's kind of Davix in general, like the high level of Davix. Since it's such a key part that I wanted to explain so you could kind of see our philosophy and how we do things, it really does shape how we monitor our infrastructure. So our philosophy in general. So as operations people, we are expected to be developers. We subscribe heavily to the DevOps mentality, not just because of Paz, right? Obviously we're pushing a DevOps product and we're hosting a DevOps product. But we just believe in the philosophy. We think it's great. So every single operations person on our team is expected to code. And specifically in the monitoring realm, we code things to gather the data items, right? And so we gather the data items, we populate the items with that data. We code the triggers, triggers are algorithms. It's not normal code, like you don't open up an editor and like code it. But it is an algorithm where it's using, it's using expressions from programming. And so you have to understand how to program expressions to program triggers. Not to scare you off or anything. You can make the triggers very, very simple, right? But if you wanna get to the more complex things, you have to know how to code the triggers. And then obviously code the auto-heels. So the auto-heels, Zavik just fires a script, right? He just goes and runs something. And so you, under certain conditions, you are expected to write that something. They don't actually ship that something. So if you want something to auto-heal to actually go and attempt healing, then you actually go and write that. So we write that. Another one of our general philosophies is we monitor as much as reasonable. So a lot of people say as much as possible, but we know that there are things that we can monitor that are possible, but that are not necessarily useful. So we do as much as reasonable. That's our bar. And that's for a number of reasons. Obviously our time is limited and whatnot. But also just shoving data values into Zavik that don't really make sense. They slow down the database. They make it so that you have to maintain a much larger database. You can't have retention as long and stuff like that. So we do, as you saw on the other slides, we do monitor a ton of stuff, but we take a practical approach to it. And so the other thing that you saw was, so let's go back to the stats real fast. So 165,000 items are being monitored and then 70,000 triggers. So obviously we have a lot more items that don't necessarily have corresponding triggers. And that's because we have things that we watch. For instance, the number of processes on a box, right? The number of currently running processes. Now, obviously you can get to a point where there's probably a problem if you have too many processes, but it's just not likely to happen. You're likely to run on RAM first or the CPU will be paid. And we do monitor all of those and we do have triggers for all those and actions and everything. So basically what we do is we monitor things that we wanna be able to look at, but we don't necessarily care about alerting at this time. If it becomes a problem, we will add a trigger. There's no problem there whatsoever. But just as an initial thing, we just wanna gather the data and watch the data over time. And then actually when we go to write the trigger, having that historical data really helps us to know, we don't have to do as much fine tuning because we know historically this is how much, these are the values that this has had. And so we can make a pretty good estimate of what a good number is to trigger on. So that's kind of our philosophy in there. So we also try to auto-heal as much as possible. We love the auto-heals. We never want an admin to get paged if a computer can fix it. Now, just because a computer fixes it doesn't mean that we don't know that that happened. And that goes back to this page where we see every single trigger that fired. So if a trigger fired and an auto-heal happened, we still see the trigger on this page in our weekly review. So the fact that an admin didn't go investigate it really doesn't mean that we don't have insight into the fact that that problem happened. Let me give you an example. We use M-collective, right? And so we have an auto-heal action that if we can't talk to M-collective from our broker, so if our broker can't talk to a node on M-collective for a certain period of time, we will trigger and then we have an auto-heal that will go and attempt to restart. All it does is restart the M-collective service. And then, you know, and if that fixes it, great. So let's say that that was happening a ton, like we just had something that was crashing that demon. Well, every single time the trigger fired, it would be counted. And at the end of the week, we would say, oh my gosh, you know, this demon got restarted a hundred times or 200 times or whatever. Now, an admin, since we have that auto-heal, never had to go and actually fix that, but we still have insight that this happened and we can actually grasp every single one of these triggers. You can actually graph and see when the trigger fired. And so you can actually correlate it. Is this happening because of a cron job? Is this happening because of a spike in app creates? Or, you know, like we can actually correlate this data to find out exactly when it happened and find the likely causes of it. So that's kind of the, we always want to auto-heal if possible. And you would not believe how many things that you can auto-heal with very, very simple things like demon restarts or killing off a process or sending a signal to a process or doing just general things. It's really interesting how, like a lot of our auto-heals are pretty simple. They're not as complex as you would think. So the whole system is kind of complex and it's really interesting where you populate an item, the trigger fires, action runs, and then your script, your heel actually happens. But the heel itself is not all that complex. Okay, so again, we look at the top 100 busiest triggers each week. And we only, oh, we also have a rule, and this is actually a rule in Xabix, that if the auto-heal does not work, then after 10 minutes we page an admin. And this is a general rule for every single auto-heal we have. So we don't, when we go create a new auto-heal, we don't have to go and set this up for it, you know, and accidentally forget and then not get paged. Like this is automatically there. If an auto-heal fires and it does not fix the problem in 10 minutes, we page an admin, period. So that's really, it's really nice because we can do a ton of stuff with auto-heals. And as long as they catch a large percentage of the cases, then we're fine, and then we page the admin, and then at the end of the week, when we do our top 100 triggers, the admin will, you know, we also talk to the person who is on call or whatever, answering the pages. We talk to them and they say, oh yes, I saw this, this, and this, and then that makes it so that we can add more auto-heals. So we are the ones who are on call, right? So it's in our best interest to actually go in there and create more auto-heals so that we can sleep at night. So it's a great system. We like it quite a bit. All right, so scope. When we monitor, we like to do both individual host and end-to-end. So in the individual host, we monitor the OS, the network, the application. So specifically with OpenShift, we have like app create loops from the broker. We will actually do app create on the broker itself as a different user, of course, not running as root. Just a user contacting the local broker. So there's no network or, you know, we don't have to worry about anything happening. So he just creates, he does an app create loop and then he fires, he adds data to the item saying, you know, yes, I was able to create the app or no, I was not able to. And then it cleans up the app and all that stuff, right? So that's an internal, it's an individual host. It does actually use, obviously, it's putting the application on a node. So it is using the system, but it's actually running locally on the brokers. So, and then we also do end-to-end checks. So we have an app create loop that's running against the external interface too. So it actually hits through our proxies, through our front-end proxy layer, hits the brokers, you know, and does the whole thing from that aspect. So an example of this would be if we take down half the brokers for maintenance or whatever, we will take them out of the load balancer rotation so that they're no longer taking traffic, right? And then we'll, you know, shut down the daemon and do whatever we need to do. When we shut down the daemon, the app create loop starts failing, right? On the individual host. On the end-to-end host, it never sees a failure whatsoever because we've taken them out of rotation. And so that's why we like to do both. If the end-to-end is failing or if it's failing sporadically or if, you know, we can infer things by how we see what's going on with the end-to-end checks and also the individual host checks. If it's failing sporadically on the end-to-end check, we can usually see that same sporadic failure on the individual host. And so we can say, oh, well, clearly it's this one because it's sporadically failing on the individual host and then if it's causing the end-to-end to fail as well. Okay, so we also do web checks, end-to-end web checks. So we'll do a light one, just hit a page and make sure it comes up and has, you know, some regular expression, like we put in a regular expression to say, oh, you know, make sure I had these words or whatever. And then we also do ones that are a little heavier, so where they will actually hit the database to make sure that the web can talk to the database and whatnot. So those are the, obviously we have more. But those are, this is kind of just a broad outline of our individual host checks and also our end-to-end checks. Okay, so the hosts themselves, here's our philosophy on those. Hosts of the same type should be as similar as possible. So brokers should all be identical to other brokers. Nodes should be identical to other nodes, mongo's identical to other mongo's. Now, not only do we put that as a philosophy, but we actually monitor for this. So we'll do things like RPM consistency check where we will check that the same exact RPMs are installed on every single host of the same type. So every single node in our production environment has the exact same RPMs and the exact same versions. So when we do releases, we rely on this check to make sure that everything got installed properly. And we also monitor our SE Linux module consistencies. We actually had a problem early on where, you know, like the RPM would have a bug in it and it wouldn't properly remove a module or upgrade a module or whatever. So we check the actual list of modules as well as the versions of the modules to make sure that they're consistent. And then also the SE Linux at Context, we actually had problems with this as well early on. And so the interesting thing is, like I'm sure you guys have worked with SE Linux at least a little, when SE Linux is not consistent, you will get weird behavior. So an app being created on one node will succeed, but it'll fail on another. And you will be pulling your hair out, trying to figure out why is this working on this one but not on this other. And like the world just does not make sense until you see, oh, it's because SE Linux is not configured exactly the same on both hosts. And so on this host is denying that. And so anyways, we put these in place. And now every single time they're inconsistent, we know about it right away and we go and fix the problem. So, and then also we use config management. So in version two, we use Puppet. So our entire production infrastructure is all being configed with Puppet. And in version three, we're looking to use Ansible. So yeah, so it's very, these are very important things to keep your instances as consistent as possible. And really, if you talk to anyone who has done anything in a large infrastructure, it's key to not have unique unicorns, right? Like you need to have these as consistent as possible. Otherwise, you are simply not gonna be able to scale. So, okay, so when it comes to security, this is our mentality. We monitor all hosts for new security updates. We actually use the Yum Security plugin for this. So we run Yum Security and we have it list out the security updates. And then we send in an item saying the number of security updates available, right? So again, zero is zero security updates. So we're good. And then 20 or whatever, it'll show that in there. We alert on this in all the environments. So every single environment gets alerted for every single host that has security updates. And then, and what we do for that is we don't actually page out, obviously we have lots of hosts, right? And so what we do is we do an aggregate and so of the environment. And so we only page out when the aggregate of all of the hosts is greater than zero, right? Meaning, hey, you've got some, oh, and we only do it for prod. And so basically, hey, you've got some hosts in prod that the alert, the text from the page is, hey, there are some hosts in prod that need their security updates applied, right? So we don't, you know, we don't, yeah. We don't send a page to the admin for every single one that triggers. So we do alert in all environments, but in stage we actually have an autoheal set up. So when the alert fires, sorry, when the trigger fires, then an action runs on the individual host to just go ahead and apply the update. Now, we love that, right? But it is a little scary, right? So we don't like to do that in production because what if that security update causes us, you know, downtime or whatever. And so what we do is we autoheal this in stage and then it runs with all of these other checks I was telling you about, the individual host checks and the end-to-end checks. And it basically tests automatically, right? This is all automatic by the system. It basically tests the security update and sees if it's gonna cause us problems. And so if it doesn't, then we will go and apply them to prod. Now, I put here that we apply them to prod manually. It's of course scripted, but we actually kick it off manually, if that makes sense. So, okay, so here are some of the things that we monitor on the brokers. I'm not gonna go through all of these completely, but I'm gonna call out one specific one. Oh, and I actually kind of did talk about this, but what we do from the broker perspective is we actually do make sure we can talk to every single node. And what we do for that is it's called MCO ping or mCollective ping. So through mCollective, ping all these nodes and make sure that they can talk to them. And then like I said, we actually fire an autoheal. So the broker will impersonate a node and show values in for a node. And so when it can't talk to a node, it will actually show a value in there saying, hey, this node is having problems with mCollective. And then Zavix will run the autoheal on that node specifically. And so it will know the autoheal just goes in and resorts the demon. And so that kind of process, from the broker perspective, we can actually affect the environment. And this actually worked out great because mCollective sometimes has these types of issues where if you change like the firewall or whatever, mCollective will reconnect up to active MQ properly or whatever. And so we can, in that fashion, we can actually see, we can actually fix it and then see at the end of the week how many times did this type of thing happen. So we do, I'll call it one other one. And then it's capacity. So we actually use OO stats. We try to use the built-in tools as much as possible. And when they're not there, we build them. And as an operation team, we will build them. And then we will work to get those into the product. So OO stats actually started off as an operations script that, it was started as an operation script that did something very similar, but OO stats is much nicer than what we had. But we coded something that basically got the statistics back for the districts and then we would alert on the capacity when the districts didn't have enough capacity. And so what we did was in this case, they didn't take our literal script, but we gave them the script, they looked it over, saw how it was doing it and then they wrote their own version. Yeah, apparently our code in that case wasn't good enough. So in other cases though, we've actually contributed a number of scripts to OpenShift that have been accepted and have been accepted just wholesale so with minor cleanups. So there's one that's actually going in relatively soon called GearTop and basically it's top, but specifically with gears. So if you've ever looked at top on a node, you'll know that it doesn't segregate the processes into the different gears. And so you'll just see the processes all as one thing. Well, if you wanna see what gear, so the question is what gear is using the most CPU, right? You can find that out with GearTop because it does it on a gear by gear basis or what gear is using the most memory? So top just does processes, GearTop does actual gear. So that one we wrote and it's been contributed up to OpenShift origin and yeah, it's going in pretty soon. I actually think the merge request actually got merged. We're, you know, OpenShift Enterprise, we'll wait for the next version to get that. All right, so these are the things we monitor on the brokers. On the nodes, we monitor, the biggest thing I'll call out here is OO accept node. So we have these OO dash accept dash and then something. So you saw over here, we had OO accept broker. You seem to have lost your voice there. Hang on a sec. On them on an ongoing basis to make sure we're still in a good configuration, you know, that a change didn't get pushed in puppet that broke us and by doing this, we can actually catch these types of things in our staging environment, right? Because if we push a change that breaks something that OO accept node is looking at, it will fire and then we will see the alert in our staging environment. So before we actually go to production with it, we can find these things out. So we love the OO dash accept scripts. We monitor them heavily. Okay, and then in Mongo, these are the things that we monitor. All of these, well not all of them. Most of these come as part of the Mikumi project. And so we actually took the Mikumi project. It's written in PHP. We, for version two, we did everything in Ruby. And so we actually ported it over into Ruby. But it's the same exact code and it monitors the same types of things that we do. And so, yeah, it's great stuff. I highly recommend if you're gonna monitor OpenShift to look at the Mikumi project too. You know, it gathers the stats and populates the items really well. So, okay, so here's some additional information. We have an actual blog post out there that talks about our OpenShift. We've had several customers ask us for the specific scripts that we use to populate the items and also like some templates and whatnot that we have for Zavix. And so what we did was we actually open-sourced them. And here's a, this is a blog post that talks about us open-sourcing it and it kind of explains how our Git repo was laid out and whatnot. And then here's a direct link to that repo. So you can actually see the literal checks, sorry, the exact same checks that we use, the ones that I called out before. The vast majority of those we have open-sourced and put out there. There were some that we didn't simply because they only applied directly to us or they were secret for some reason. But yeah, but it's good stuff out there. And then we also have this modern OpenShift with Zavix white paper. This white paper is a little old but it actually takes you through setting up Zavix from nothing into a spot where you can actually monitor OpenShift. And so it's again, it's a little old but it's still really good content. And at least the concepts in Zavix haven't changed and the concepts in OpenShift haven't changed. Okay, I think, yeah, sorry, go ahead. So there have been a couple of questions in the chat. And thanks for that. I was looking for a pause in your breath so that we could make sure we could keep moving on. This has been great. Boris had asked, how do you go about the RPM consistency check? Okay, that one, so yeah, it's an interesting one. So what we do is we rely on RPM itself and we just simply run RPM-QA and get the list. We sort the list and then we compare the list amongst them. Now, those astute to how RPM works will know that there will be some RPMs that won't be consistent no matter what you do, like for instance, the kernel because by default, REL keeps around three kernel RPMs and those versions, if you're building hosts over time those versions will overlap and whatnot. So we actually have special cases that we say, either ignore this outright or treat it in this way. And so, yeah, so that's what we do for those. And then we actually don't do this, but you certainly could, that RPM actually has a verification switch. So if you wanted to get really, really anal about these RPMs, you can actually do the verification and you can see that the MD5 sum of all the files of the RPMs are still good and whatnot. So you can get down to that level. We actually don't do that though. We simply do an RPM-QA, sort it, and then compare the lists and exclude certain RPM. Of course, did that answer your question? He's still got himself on mute. Yes ma'am, he says. All right, and then the next question we had was from Judd, and I think you might have mentioned this earlier, he's asking, is all of this on AWS or do you also have gears or so? What, how is that monitored? And maybe, and Judd is also asking if you're using pagerduty.com. Okay, I'll take the easiest one. We are not using pagerduty. We use the Xavix built-in facilities for doing the actions and the escalations. Pagerduty, we've actually evaluated pagerduty to see if it would give us a lot of capabilities over and above what we have. We have not gotten to that point yet, although we're not against it. We just, what Xavix provides is good enough for now. So, okay, so the second one I will answer with, sorry, what was the, sorry, can you repeat the first question? Yeah, so the way that he said it was, is this all on AWS or do you also have gear? If so, what and how monitored? Okay, so the, yes, AWS, we, this is all running on AWS. All of OpenShift Online version two runs on AWS. We have multiple regions inside of AWS and every single region we run in multiple availability zones. So we actually do the best practices for the availability zones and then, yeah, and then we use regions for, we have nodes in multiple regions, basically based on customer demand, honestly. So yes on that, and then the gears thing, I think, I could be wrong, because the gears in AWS don't really make sense together, but I think what you were asking is, do we monitor the gears? And so how we do the gears is this, so since we're providing an online service, we actually have legal requirements of what we can and cannot do with the customer data inside the gears. And so what we do is we do monitor the web state of the gears. So we have a script that runs every so often and what it will do is it will gather all the return codes of the URL. So each gear has an application and a gear URL variable inside of their thing. We will hit those URLs and we will get back the status code and we store that. And then what we can do is we store that every so often. And then what we can do is we can say, okay, over the last five times, what, how many apps have changed state, right? If one app changes state, it's probably the application. But if 90% of the apps on that node change state, it's probably us. And so that's the kind of monitoring we do on the gears for, basically we monitor everything to make sure it's up, but there's no way since we're, since we have the legal requirement and also because we're not the ones pushing the code, like how do we know that they didn't just push an update? How do we know that maybe they pushed an update two days ago, but there's a bug in their code and that's what's causing their app to fail. Like there's many cases there that we, there's no way for us to know. So what we do is we look for cases where we can determine that it's likely us and that we need to investigate and we will do that. I hope that answered. Yeah, sort of, if he meant gear and hardware, but he likes the answer anyways, and he's in a noisy place and he can't unmute himself. So thanks for that answer, because it answers another question someone was asking. So if you were to give advice to someone who wants to monitor their user applications, I've used in the past things like New Relic. What do you suggest from your point of view at doing it at scale? Yeah, New Relic is great. So we actually do use New Relic for our brokers. And what we do is we use New Relic more as a performance tuning thing. So we can actually watch, we can actually look at the request and over time see. So our developers use New Relic to kind of optimize the broker and whatnot. So we really like New Relic by developers. We actually look at it as well as operations. We don't actually currently alert on anything in New Relic, but we've considered it. So New Relic, yeah, is great. So we actually also, again, we're big fans of Zavix. We actually, our team has actually written a Zavix cartridge and it actually has a client side sender. And so what you can do is you can actually deploy our Zavix, sorry, our, we wrote it, but it's part of the OpenShift. It's part of the OpenShift contributed cartridges or whatever, like it's not an official cartridge, but it's one of the, it's one of the contributed ones that you can like, that you can install. So I would suggest installing that. And then what you do, you install it on a per like project basis, per like, yeah, working area, I forget what they call it. Anyway, you can install for a per project thing. And then what you can do is you can, there's an embedded cartridge that will actually send the data. And so I would, then you add that embedded cartridge to each application you want to monitor. And then, yeah, and then that gives it to you. So that's how I would suggest doing it, unless you want, now that would be per like groups. So like marketing would have their own sales would have their own, you know, whatever. If you want one big one that everyone uses and you control the nodes, then yeah, I would do that. You can actually have them, so basically the same thing as the client, the embedded cartridge. You can embed the cartridge and what it's using inside of there is called the sender. And all that does is it takes a value and shoves it into Xavix, into a Xavix item. And so what you could do is you could actually just change where the sender is pointing to point to your main Xavix server. Or again, if you're using a different monitoring solution, I don't really have any advice for you. Because like from a Xavix, from a Nagios point of view, I'm not exactly sure. Well, actually, from a Nagios point of view, you would probably just SSH directly into the gear and you would just give Nagios rights to the gears. Yeah, so that's probably how you do with Nagios. But yeah, anyway, so that's basically what you do is I would have, you could actually even have, I would have everything that you want to monitor for the application inside the gears. Simply because that isolates the gears from each other still. So you still get the same C-groups data. So your monitoring overall won't kill the performance of the box because that will be C-grouped with the rest of the application. So that's what I would suggest is do it on a per gear basis and do it inside the gear. Okay, then two more questions. And Boris is asking the Git repo that you've released with the Xavix scripts, how complete is that compared to what you were actually running? That's a good question. I actually don't know. Sorry. I know that we have quite a bit of stuff that we don't have out there. I don't know, I can't really ballpark a percentage. I will tell you this though, for version three, we're switching our policy and we're actually doing open by default. And so you can actually go out and see our current version three efforts today. Again, we're switching to Ansible for these types of things. And so you can actually go out and it's on our GitHub OpenShift organization. So github.com slash openShift slash openShift dash Ansible. And if you look in there, that is our current efforts. So basically everything we do as an operations team that isn't secret, like our credentials and stuff, we're gonna shove into that repository. So all of our monitoring checks will all be there, and all of our launching scripts for launching instances and everything that we're doing is going to be open by default. So I don't know what the percentage is today, but I can tell you for version three, it's going to be above 90%, I would say. So that brings us to a great segue into the last philosophical question that Judd is sort of asking, is if you could address the reasoning behind the switch from puppet to Ansible. Yeah, so, okay, so the, okay. I was trying not to get into a flame war. We really like puppet. We do. Our team, we have some puppet experts on our team that are great. We actually just about six months ago implemented puppet roles and profiles, which is kind of like a recommended best practice, and it has been amazing. Like it has revolutionized our puppet code base. We are happier now today with puppet than we've ever been because of that switch and because of some other things. So it's nothing against puppet. It's just that puppet has no orchestration layer. And so what we are doing is we had puppet for our config management. We use PSSH for ad hoc commands. We actually use M-collective to do a command and control type of multi-instance orchestration thing. And then we use SSH on an individual basis. And RSOPs, if you go look at RSOPs, they're a mix of our M-collective stuff and PSSH and all these other things. And so basically what we saw with Ansible is that we could do all of that with a single tool. Ansible is great at config management. It's also great at command and control. It's great at you doing instance... Oh, we actually wrote our own instance creation script, whereas Ansible just does that by default. And so basically we saw where we can unify on a single tool and that really appealed to us. So that's really the reason. There's nothing against puppet other than the fact that it is not built for orchestration. We've actually looked at the puppet razor stuff as a new instance creation. And it's interesting, it's not quite there yet. And it doesn't work, at least in my opinion, it doesn't work quite as nicely as how Ansible works. So Boris has a couple more that we're gonna try and squeak in here so we can keep it under an hour. He doesn't have a microphone, so I'll keep reading them for him. How do you go about finding top traffic gears, for example? Are you getting any stats from C groups, watchdog, traffic monitoring, et cetera? And how did you, you did mention that the old top gear is coming with the next OSCE update. Or I think he's asking if it's coming with the next OSCE update, hopefully. I am 90% sure it is. I don't, I don't, I'm 90% sure. I will actually go, so yes, I'm just gonna say yes to this. I'm pretty sure it is. Yes, we actually do collect data from C groups and we actually look, so we actually have a thing and this is, see, this is dangerous. We actually don't, we haven't actually contributed this open source yet. We are considering it, it's, there's a lot of pieces involved in it, but we are considering it and actually it's based on Ansible and so it will go in the Ansible repo I mentioned earlier, if and when we do. What we have is we actually do monitor all those things that you talked about. We actually monitor the C groups and we monitor the overall CPU health of the machine. And so we monitor, so let's say the CPU trigger goes off and so we actually trigger on less than 5% CPU idle. So let's say that goes off. What happens is that triggers an auto-heal that runs on our command and control box, which actually then fires off an Ansible playbook that goes onto the host and gathers a bunch of gears using the C group data. We gather a bunch of candidate gears to move or to, you know, to do different things. So like, and candidate gears will be like if we're firing for CPU, it'll be the highest used CPU. And what we will do is we will actually move them. We will find one and we will do an OAB and move. We, you know, Ansible, so Ansible will go onto the host, the node, gather the data, then it will go onto the broker and actually kick off a move and have it move from one place to another. And we actually do that for memory. We actually do it for disk space used as well. For disk space, what's interesting is we actually have a lot of idle gears and so we can actually move just idle gears and free up disk space because idle gears only use disk space. And so we actually have kind of a ware-leveling thing that will go auto-balance. Like our scheduler on the broker is pretty good, but for these cases where, you know, we oversubscribe quite a bit. And so for these cases where we oversubscribe and it bites us, instead of having to have an admin go and actually fix this, we actually have this tool that goes and will move things around. So a couple caveats. One, we do not do this on the paid tiers. This is a free tier only thing. The paid tiers, you pay us to keep your app stable. So we try very hard not to shut down your apps or, you know, move them or whatever. And so this is a free tier thing where we are oversubscribing the free tier because it's free that every single box we set up there it costs us money, right? And so we oversubscribe the free tier and then we use these types of things to bail us out when the oversubscription bites us. And so that's kind of an example. Did that kind of answer your question about how we're, yes we are using, yes we are looking at those things you asked and yes we are actually, we actually do have auto-heels baked in. And I want to open source these things and when and if we do, it will be in the OpenShift-Anthable repository. All right. Well, I think that answered all of the questions that everybody had. So, and we're at the end of our hour. So I really want to thank you very much, Thomas. We are definitely having you back to talk again. This has been a great session and people really have been asking wonderful questions. So this is definitely useful and maybe we can even get some of the Zabix folks next time to talk about new features and things like that as well. So, thank you very much. All right, you guys all take care and we will be talking to you again probably next week from the OpenShift comments. All right, take care.