 Good morning, good afternoon, good evening, wherever you're hailing from, welcome to another edition of Ask an OpenShift admin office hours. We are here, myself, Chris Short, executive producer of this thing we call OpenShift TV, and I am joined by the one and only Andrew Sullivan. Andrew, how are you today? Wired. I'm ready to go. I've been up since like six. I've had about a whole pot of coffee at this point. I'm on pot number three because I've been up since four. My wife, she's an exercise instructor and exercise people are a little crazy. So her first class starts at 4.30 a.m. So she was up at 3.45 this morning. Lovely. Yeah, I was not. Yeah, there's some kind of mental thing about, like, there's a huge difference mentally between 5.59 and earlier and 6.00 a.m. I don't know what it is, but it's just like, why does 5.59 a.m. feel so much earlier and so much harder to get up than 6.00 a.m.? Anyways, yeah. Yes. Thank you, Chris. I appreciate the introduction and welcome to the Ask an OpenShift admin office hours. So this is episode 31 where we'll be talking about Alert Manager and kind of the entire OpenShift monitoring stack, right? The things that are going on inside of there. So this is one of our office hours series of shows here on OpenShift TV. What that means is that we are here for you, for our audience. And we want you to ask us questions through whatever client you happen to be watching us on Twitch or YouTube. Either it's the OpenShift or the Red Hat YouTube channel. Feel free to ask us questions in chat at any point in time about whatever topic it is that is top of mind for you. And I see that there's a couple of people chatting already, so please keep that up. And basically what we'll do is we'll answer those questions to the best of our ability. If we can't answer those questions, we will be sure to follow up every week on Friday. I post a follow-up blog post, so we need any follow-up questions or follow-up answers. We can put them in there if I get them in time. And we'll also follow up with the following week stream if we need to. And that, I also am not afraid to post retractions, corrections, et cetera, because Andrew frequently gets things not quite right, how shall we say? So, as I said, today our topic is alerting, monitoring, right, the sort of observability stack. And I am extremely happy to welcome our guest, Brian Gottfried. I hope I didn't ask you beforehand, so I hope that I pronounced that correctly, Brian. So, Brian, if you don't mind, please introduce yourself. Absolutely. And you nailed the name. Well done. Got fried, but it's easy. So, yes, my name is Brian Gottfried. I've been with Red Hat for about a year and a half now. I am a dev out sky in general. Pipelines, app development, that sort of thing, but also lately what I've been working on is a lot of like logging, monitoring and metering. How do you make sure that when your cluster and applications up and running that they're actually staying healthy. So, I'm very excited to be able to talk today about alert manager, about user workload monitoring. Everything for making use of the existing metrics that come shipping with OpenShift and then sort of expanding that to meet your individual needs. So, happy to be here. Yeah, and this is a topic I'm excited about and actually I've been, as I said, I've been up for a while this morning. So, I've already been doing a lot of research and what I thought I knew about the OpenShift observability stack turns out there's been a lot of changes. Yeah. So, I actually, for any Red Haters watching, I actually did a presentation at tech exchange back in 2019, you know, in the before times, around the OpenShift, you know, monitoring stack. And since then, there has been a dramatic change in things. So, I know, importantly, we want to cover some of the architecture, what it looks like, what are the different components. And then, you know, Brian, I definitely want to want to see, you know, how all of that comes together to enable user workload monitoring. That was a huge one. I know people were asking about it for a long time. And that was GAN46, if I remember correctly. So, it's been there for a little bit. I just haven't had time to circle back around. So, Chris, before we get started, I know that I see our Hope 9 has been asking some questions. So, our Hope 9 is asking some very future looking questions, and I will ask them to email me short at redhat.com as well as Andrew.Sullivan at redhat.com. I forget your, is that accurate? Sorry. Andrew.Sullivan. Yeah. Send us an email, we will get you the answers you're looking for, because what you're asking for is very much a product manager kind of question that I don't have the answer to you in my head, nor do I know where to go look it up. So I have to ask the PMs. I'm sure Andrew's kind of in the same boat when it comes to service mesh. Yeah, I don't, I don't know the specific Istio version behind the OSSM version without, you know, going back and finding the same thing. I will take this opportunity to plug. When is it Chris? The 24th is the what's next? Or no, the 24th is the what's new? The what's new in OpenShift 4.8 is coming on the 24th. It will be live stream here wherever you're watching. Yes. So yeah, that's that's one way to find out. So we have two types of presentations when it comes to that. So what's new is what is in the, usually the upcoming release. So in this case, 4.8. So you'll see product management present on all of the cooling stuff that's been added. And then we have what's next, which is also product management doing a presentation on what's coming in future versions. And I'll always keep in mind that, you know, roadmap things are subject to change and all of that. So sometimes if you, if you were to go back and look at the last what's new and the current what's next, you'll see some mismatches. That's, that's the nature of roadmap folks. So sorry, sorry about that. Yeah. But yes, our hope nine, please, please send us an email. We'll follow up on that and get you the answers that you need there. And just for everybody that was curious, we were discussing future facing things. Don't try that at home on your own cluster. Or, or do okay D is, is meeting edge right, you can you can do it on an okay D cluster just don't do it in prod and please don't call supportive it breaks. Yeah. If we're doing it on air here and we say it's a future facing thing, it's not going to work on a current version, just to let everybody know. Yeah. Yeah. So Daniel, I see your question. Any suggestions for long term storage of Hermitius data, looking to keep historical data for a cluster for a year. So I let's I'll acknowledge that and we'll push that one off until we let Brian take over and we'll make sure that we address that in that segment so please please stay tuned while we go through. I'll do in just a moment what I call the top of mind topics. So just things that have happened things that I think are interesting or useful to you all our audience and we'll take just a couple minutes to go through those before we move on. We also see smitesh that the links I dropped or re stream bot spammed are for use smitesh but the, there's some breakage in the way the links came out so I'm going to fix that real quick but basically to answer your question, how does open shift make your orchestration easier than others what's the key point the key point is it is a full like cloud native experience open shift whereas you know some other distros you kind of have to. Okay now I have kubernetes what's next right like we've kind of had an opinionated view using other you know cncf landscape type projects to build a full flesh cloud native experience and open shift is the way to like to describe them. Yeah I usually compare it to rel right red hats for 25 years has had this relationship with the Linux community where we work upstream we do a lot of things upstream but for our customers right the enterprise customers. We bring a lot of additional testing and validation and you know ultimately all of those things that we do to provide a supported Linux right one that we can trust one that we know when that you can trust. You know to red hat enterprise Linux and open shift is the same way right we do we have that same relationship with the upstream kubernetes, as well as all of the other projects products that make up open shift. So, open shift at its core is kubernetes and then we add that you know red hat support ability work, and then we add on a whole bunch of other things to make it a fully complete robust production ready platform for for your containers. Yeah. We want to get you to workloads in production as fast as possible basically. Yeah, and I think, and I see Chris linked a bunch of stuff there at a minimum just go to open shift calm and especially try to. Open shift calm and you can get immediately hands on. You can start checking out what's what's available with open shift. Try. Learn. I'm just going to spam all this stuff today for some reason. So thank you. I'm sorry for the repeating messages that didn't happen an hour ago but now it's happening. So, there we go. A couple of things. So I think where did this one come from. I think it came from Twitter Chris Twitter. Yes, I take tweet. Yeah. Yeah, so actually I think there was there was two of them. So the first one is disconnected installs. So there was somebody who had tweeted at Chris who looked me into the tweets about disconnected installs are hard right there. They're not fun to manage, primarily because of the image content source policies. So I CSP, and what that does for a disconnected install is you say I'm going to and you use the. Oh, now what's the command it's not scopio it's build a no OCD bug. No, I think it's it's integrated with OC now it's OC catalog mirror. So basically you point it at the source, you know, just open shift release, which is for better for worse, those container images are spread across I think four different registries. There's Quay, there's registry.redhat.io there's catalog.redhat.com there there's like four different places that that's spread across pulls down all of that content, and it creates an image content source policy, which maps this source image is now stored in this destination or this new place. And that for the cluster that I CSP is added to it just it doesn't know that it's when you, you know, say I need this image it doesn't know that it's actually coming from somewhere else it automatically gets rematched to the new location. So I saw that tweet I reached out to product management I talked with both the installer PM as well as one of the PMs who works heavily with the disconnected stuff and effectively for the answer was yeah we know and we're trying to better. So I know that there's some roadmap stuff. Again pay attention for when the, the what's new comes along. I know they're doing a lot of work to help make that process much simpler. So one of those things, and let me share my screen here. You're the Chrome I want. So I'm going to share this you're the Chrome I love. Let's turn on do not disturb so that way I don't get things popping up. So if I go to, and I'm now what I was what was I going to search for it is the. I want to search blog.openshift.com or open shift.com slash blog. And I want the. It is the update service, I think. So there is a thing. Yeah, open shift update service. So we talked about this back when we did the disconnected install stream. We talked about it briefly back in the four or five days when this was initially announced you can see 454. Additionally, it is still not GA yet. So this is one thing that when it does go GA this will make that process much easier. Effectively, it connects into Cincinnati it takes all of the graph data, which is what maps versions, no edges to edges and what's an eligible update and all that. So it takes that whole service and brings it offline so that from your deployed open shift instance, you point at that update service and then you get that same, you know, off the same over the air update experience of just going in and click update with a disconnected cluster that you get with a connected cluster. So that's one thing that will make it easier. But I also know that there is a rather significant efforts. I think this slide show that I had the proposal was something like 20 slides of here are a number of pain points and here's how we can address those pain points. So I know that they're aware of those pain points and and I know that that will that process this process will be getting better over time. I'm not sure if you shared this link or not Chris I'm going to. Oh you did. Yeah I did. You're on top of things. I'm on. I'm on top of things today I can't find the Rob Zomsky Cincinnati. That's what I was looking for. I've been avoiding that one, because we now have an official one. Oh we do. Yeah, okay good and access. Okay, I need to. I basically need to create a Cincinnati shortcode at this point. I know. Let me see if I can. I'm going to the same place. So you'll probably be me. You're logging in right now. Yeah there is a. It is the open shift update. And we'll start with that and see what happens. That's going to be a lot of results to you. Yeah, yeah. Anyways, I'll find it eventually here. I'm looking. I'm looking you just keep talking buddy. But yeah, there is an actual. Here we go. I got to find the right. I just got to use the right browser. I've got it in my history. Oh there you go. So we want that guy. And if we post that in here. So, yeah, this is this is the official like red hat labs that those guys created this thing as a part of access that red hat calm, where you can see all of the while there's a stable for a date already. Or you can see all those edges. So this graph that data comes from Cincinnati and that's where it's generated from. So effectively the updates service is what replicates all of that in an offline environment. Anyways, stay tuned. We'll have more information to share on that as time rolls forwards. I expect to revisit disconnected in the not too distant future. So, and again, much like the last couple of weeks, if you have topic ideas suggestions requests, let me know. I'm doing planning for the next quarter so anything you are interested in I'm more than happy to accommodate that we've got some exciting and interesting shows coming up. We'll share those will add them to the calendar as we know what they are, or as the dates get solidified I should say. So this was the first one and the second one was so somebody had asked about load balancer options. And this was the other one I don't know if this came from Twitter or if it came from an internal, or if it came from an email. But it was more or less like hey I've got an open shift cluster and open shift requires a load balancer. So what are the options I need or what are the options I have for load balancing access to that. So, this is a, and I'm going to first go to the docs and I'm going to pick on, we'll go with bare metal. So if we scroll down in the docs we we eventually get down here to, or is it the load balancer. And you can see that there are two sets of load balancers, the API load balancer, and the application ingress load balancer or start out apps. So this one is API that cluster name. This one is start out apps that cluster name. So it doesn't have to be too physically or virtually right logically separate load balancers it can be one single instance that you use for both of these functions that that absolutely works just fine. But there's a number of different ways that you can do this. So the first one is not really a load balancer. And that is quite literally round Robin DNS. There is nothing that says that it won't work to take a round Robin DNS for API dots cluster name and make sure that it is, you know the three entries there are your control plane nodes and same thing for your start out apps right make sure that it is the ingress, or excuse me the infrastructure nodes that have a ingress controller instance. So you will encounter errors with start out apps. If you have nodes that don't have an ingress controller as a part of that round DNS round Robin, because it's not doing any checking right it doesn't know whether or not there's something on the other end of that it just says hey go to this host and if there's no ingress controller well it's going to respond back with an error. So you would want to have a you know infrastructure nodes that are hosting those those ingress controllers in that instance so you know where they are. But yeah it's it works. It's not a load balancer but it would work. You know keep in mind Christian I can't see the chat Christians probably giggling if he's listening about you know using DNS is a quote unquote load balancer but so the next option if you are using on prem IPI is the integrated right and I'm going to use air quotes load balancer. So we've talked about this a number of times I won't belabor the points, but effectively this is you assign a virtual IP address right so just an IP address that has a that is pointed to the DNS name for start out apps or API. And then keep alive D on the cluster manages which node that that's virtual IP address is hosted on. So this isn't really a load balancer because all of the traffic goes to just one node in the cluster. So the VIP lands on a node that has an ingress controller so start out apps and all of the ingress traffic is going through that one node. So if you, you know, shard if you scale up the ingress controller from two to three to five to 10 nodes, you're adding more instances that are basically available for keep alive D to move that VIP to that virtual IP address to. It doesn't direct traffic to those. So with UPI. And in some scenarios and we've talked about that before I'll try and dig up the the appropriate episodes. So basically you want to use an external load balancer and of course with non integrated it's required. So that external load balancer can really be anything that you want or that you are comfortable with right that is capable of providing the services that you need. So what do I mean by that. Usually there are two big things that you need to take into account with a load balancer. So the first one is of course, literally throughput. I need my applications need 74 terabits of throughput, right to be able to access whatever they're doing. Okay, well, that means that you need to choose a load balancer that's capable of doing that or load balancer farm or whatever it happens to be I hope it's a farm at that. Yeah, the second one is high availability. You know, so there are a number of options out there, you know, certainly partners like F5 Citrix, IngenX, etc, are all possibilities. You know, Red Hat, a lot of us, you know, on the Red Hat side will deploy HA proxy to like a relvm and just use that you can make that highly available by deploying two of those and using keep alive D. So that you have that that virtual IP address between the two load balancers. So that is certainly an option, whether or not it's production ready entirely depends on your needs, what it is that you're trying to do so on and so forth. So I'll also offer that it can be a little bit confusing sometimes with our partners. So if I go to, I'm going to pick on F5 here, if I go to catalog.redhat.com. And if I go to software and open shift operators. And then I search for F5. So we have this F5 operator inside of here, right. And when I deploy this, what's going to happen is I'm going to deploy a new ingress controller. That is the F5 ingress controller. And it integrates directly with you see their F5 big IP. And that's when I create a new ingress that is using the F5 ingress controller, it's reaching up and it's talking to their physical or virtual big IP device and doing configuration at that level. This is not the same as if I were to configure an F5 load balancer to point to the defaults HA proxy based OpenShift ingress controller. Right. So just, it can be a little bit confusing. Distinction. But yeah, it's instead of pointing it at that, you know, catch all kind of DNS entry you are using the F5 to point it directly at your apps ingress. Yep. Yep. And there's positives and negatives to both of those, right. If you are and Citrix works the same way engine X works the same way right there's a number of these are just the certified ones. I tend to talk about, you know, there's others out there that absolutely can work. But just work with whoever your vendor is to understand if they have a Kubernetes ingress controller, whether or not that works with OpenShift and whether or not you want to use their ingress controller, or if you just want to use their load balancer and the default OpenShift ingress controller. And I'm sure somebody is probably thinking, you know, hey, well, why would I choose their ingress controller over the OpenShift one. I leave that to the vendor, right. So if you're talking to F5s up on the pager, you're talking to F5, they're the best ones to ask why is your ingress controller better than what OpenShift provides. So, and there's nothing wrong with that. Okay, and the last one I have. So this is something that comes up. I won't say regularly but not infrequently. And that is the time zone settings for the cluster. So I'm going to go back to the docs. To the docs. And I want post install and machine config and configuring the crony time service. So one of the things that you should do with your cluster, of course, is ensure that time, time synchronization is configured. Yes. It has to be within a certain window for the cluster to be able to successfully deploy, right, certificate times have to match. Certificates matter when it comes to time. Time matters to certificates. Exactly. And you don't want it to drift for a number of different reasons, not the least of which is, you know, in line with today's topic, being able to do things like troubleshoot and logging, you know, correlation and all that other stuff time is important. So you definitely want to go in and you want to configure crony to synchronize. We've had a number of folks ask, I don't like rate I live in Eastern time or Pacific time or whatever time zone and everything in my organization is on that time zone. But the open shift logs are in UTC. Can I, can I configure the time zone? Right. So unfortunately, not yet. So there is an RFE for that. There's actually two of them. So they are, I know that they're looking into how to do that, etc. I don't have a time frame. I'll see I'll dig up the RFE so that way you all can look at it. Yeah, like, by default, the Kubernetes that comes from upstream Kubernetes Kubernetes on GitHub uses UTC and that's it right like there's no options to change time zone to my knowledge. So that's where it comes from is the original Kubernetes project. If you're curious. Yeah, and I know a lot of like, I think Splunk offers the ability to localize. Yeah, dynamically and stuff like that. I don't know if the elastic search service, you know, logging service and open shift does or not feel like you could configure that but we have an expert here. Yeah. Well, we're not talking about logging necessarily today. Right. Yeah. Although it is a part of the overall observability story, but Yeah, so just be aware. Unfortunately, today you cannot and I'll take a when I shut my mouth here in a moment. I'll take a moment see if I configure or find the RFE. I'll post that into the chat if you want to take a look at it. Hopefully it's a public one. But yeah, it's not there yet. It is something we are aware of, but definitely ensure you configure the time synchronization at a minimum for a number of different reasons. Okay, that's all I got. I'm done with the the top of mind stuff so let's move on to today's topic. So Chris, I haven't been paying attention to chat really I don't know if we have any quiet. Okay. So just a reminder for any, any of y'all who are listening to us please feel free any point in time ask us questions. And we'll address those as they come up so TZ file TZ data file for Detroit I know you. I love it. It's great. I mean it really tells you the history of time. Well now was wasn't somebody saying that now there has to be one for Mars because there's, you know, going to be computers that are. There's already computers on Mars. Yeah, so they measure it in souls. I know NASA does but I'm not sure how they would be fun to get NASA on to figure out like how are you doing time zone stuff. Yeah, like isn't isn't I know there's open shift in space like, yeah, what time zone do they use UTC probably because everything is done by Zulu time, which is not Houston, not, not Houston time. No, they use UTC to. Alright, so Brian. Open shift monitoring open shift alerting. One of the first things that I know I had brought up as I think it would be good to do an overview and to look at kind of what's the architecture and and what does it look like today. When we are deploying and using the open shift monitoring stack. So you notice I conveniently had this link already up. I posted that one into the into the chat here. And what I really wanted to concentrate on here is this diagram here, which can be a little intimidating, not the least of which because there's a lot of interesting words here. You know we see this Thanos career here in the middle right. People think Thanos and of course they think the big purple guy with the, you know, with the wrinkly chin. You know they don't think of an alerting or a monitoring set of tools. So the first thing that I wanted to talk about is, what are the components of the open shift monitoring stack. Brian, if you want to describe those and I'll kind of you can take over the screen or I'll bring up some of those components as as you're talking about them. Yeah, no, I think this diagrams a great starting point and feel free to keep driving the screen share for it. So essentially, you've got something that actually gathers data right because you need to be getting data from everything that's running on open shift in order to be able to do anything with that data. And that's the Prometheus instances that both the platform one and when you set up user workload monitoring the user workload Prometheus, then you need something to be able to view those and that is Grafana for the platform one. And then you need something to be able to tell you when specific things are happening because there's so much data coming into both of these Prometheus instances, trying to monitor everything via like a Grafana dashboard is it's going to be trouble like you need to know when specific things are happening that are critical to the health of your cluster and that's where alert manager and the Thanos ruler come in so it's really this idea of gather the data and then to do things with the data. You either want to be able to view it at a high glance and over time and that's through Grafana or you want to get quick. Oh no something's wrong alerts and that's where the manager and the Thanos ruler come in. So all of those components are sort of integrated in Prometheus and alert manager and Grafana are the sort of traditional stack they've shipped with open shift for a while. The introduction of these Thanos components has been a little bit more recent in the later half of the four versions of open shift Thanos is a, I want to say like a sister or a compatibility project with Prometheus and implements a lot of the same Prometheus API is, but it has a lot of functionality on top of it this design for distributed large scale systems. And I know there was a question earlier about long term storage that's one of the things that Thanos does I think that when we dive into it a little bit more we can talk about the different components of it really with a Thanos queriers doing in this specific case is it's acting as a central point for any of those components that need to go back and access data, whether it's data that's shipping in from the cluster Prometheus or data that's being gathered by the user workload Prometheus, the query or access a central point to be able to query that data, and we don't have it right now but if you were to also deploy some of the Thanos back end stuff like long term storage. The query allows you to pull data from both in memory storage, persistent storage within the cluster and that long term storage, kind of simultaneously and seamlessly so you can get a query across a real long scale of data, without having to do any sort of like stitching or combining across multiple queries. So it's, as you mentioned the diagram is large and complex, but thankfully most of it really works behind the scenes for you when you're looking at a cluster you're going to be querying stuff via Prometheus running prompt queries to get sort of ad hoc ad hoc stuff. You can then take more complex queries or queries that you know you're going to be looking at a lot and you can make them into Grafana dashboards. And then you're also going to use those queries to say, I know the situation that I'm concerned about with the health of my cluster. I can represent that by monitoring this specific for me this metric and it reaches when it reaches this threshold or when this metric changes from zero to one or when it drops below a certain number. I want to get an alert about that and that's how you end up setting up alerts for alert manager or really when you're setting up your own alerts to set them up to go through the Thanos ruler. You're still using prom kill to do it underneath the hood. It's just, it's sending you out an alert that's more descriptive rather than showing you a graph that has a sudden drop or sudden spike or something like that. Yeah. And so if we look at this diagram, we see a couple of things and as with most things open shift related, it starts with the Prometheus operator. So I'm going to switch over to my, my cluster here. And what I want to look at is going to go to projects and go to monitoring. So we have two projects that are inside of here, open shift monitoring open shift user workload monitoring. So this block here, which is the open shift monitoring components, this is for monitoring all of the core open shift things right basically anything in a namespace that starts with open shift. This is always installed. And you can't really mess with it right we don't support modifying the things that are going on inside of there. So if we look at this particular namespace, and I go to not operators, I want pods. We see a number of different things inside of here. Let's get that out of the way so it's a little easier to read. So here's all of our alert my alert manager pods right we've got our node exporters. This is just a simple machine such that goes right deploys all of those Prometheus exporters to each one of the nodes, the Prometheus pods so on and so forth and here's our Thanos career. So it deploys all of these components into that namespace and across all of the nodes, and then Prometheus pulls in those metrics from the node exporters and stores them. So Prometheus operator deploys Prometheus it deploys alert manager Prometheus and I think this is something that I always forget this and I don't know why but Prometheus is where you actually configure the alerts alert manager is only responsible for telling you about them. So the manager handles how often alerts are fired where you're actually sending them to if they're going outside of alert manager. When, like when you repeat them whether you send resolved solutions for them, sort of managing how multiple alerts are firing so there's a possibility to if you know there's some alert that says like the entire cluster is down. You probably don't want to be firing the 15 other alerts that are saying these individual components are also down right. So you have the ability to set up a hierarchy of rules so that at a certain alert fired you're not getting bombarded with a bunch of other alerts that are relevant at the time. It offers a lot of control over the alerts that come out of Prometheus Prometheus are kind of defining alerts on an individual level and they're relatively sandbox within each alert alert manager lets you control the relationship between them. That makes sense. And that brings up one of my, I won't say favorite topics but something I try to remind people of regularly which is alert fatigue is a thing. Yes. And you can build bad patterns through alert fatigue as well that will make you ignore good alerts. Yeah. Yeah. How many times have we, you know, on some random website you get this pop up and it's just the same as 500 others and you just click it away and it's wait that one was important. What did that say? What did that say? Yeah. Alerts are the same way if you've got a bunch of noisy alerts that don't mean anything. It increases the likelihood that you're going to miss something that is important. I should also point out I didn't start at the beginning when I was talking about the operators. You know I started with Prometheus but actually it's the the cluster monitoring operator is is kind of the first. So it's responsible. You can see here for deploying the telemetry client, the Thanos career, as well as the Prometheus side of things. And then Prometheus goes and deploys alert manager. So that's the hierarchy of things there. Okay. So inside of my name space here, right, I've got all of the pods, all of the things that are relevant to doing the core open shift monitoring process. And when I want to turn on the user workload monitoring, I have to go and basically tell it, you know, hey, we want to do user workload monitoring. So if I do the same thing, if I look in the user workload monitoring, all I did here was edits and let me see if I can. So it's a config map. So I've actually back inside open shift monitoring. Yep. There we go here and then I want config maps. And it is the cluster monitoring config. Thank you. So if we look at this guy, very, very simple. Literally, all the data is is enable user workload is true. In a typical fashion, the operator then takes action and it deploys all of those components or redeploys them as necessary for doing user workload monitoring. So come back up to our projects, go to monitoring. So by default, without that configured, this would be an empty name space. Let's look at pods. But since we enabled that we now have this second Prometheus instance that is now running. And we also have the Thanos ruler, which is rules for collecting information, I think, along with serves is the same thing as alert manager functions the same way. So it'll handle rules, any learning rules you set up for the new monitoring that you set up via user workload. The Thanos ruler receives those alerts and handles sending them out to various places. So it gets a same style configuration. But it's just a different component for it. And then everything still does get routed back to the original manager. And so I want to take a moment to talk about Thanos and a little bit more detail. And in particular, what where what why and and where Thanos is important. So Prometheus is really good at what it does scraping metrics, you know, aggregating them, you know, being able to display back all of those charts graphs rate all of the data that's there, although there's arguably better tools for displaying back the data. But it provides that right prom QL, the query language to get all of that data out. But Prometheus is not good at multi instance rate of aggregating across multiple instances of long term data retention. Right, there's no concept of, and I'm probably going to use the wrong term here but it's effectively summarization of stats. So years ago, you know, the it's still around right the the RRD tool right for for storing and generating graphs of data. Yeah, and it's, you have a short window, a very fine grained data that eats up a lot of space, and then those get summarized into so maybe you collect those every 15 seconds. And then those get summarized into every five minutes and then those get summarized into every 60 minutes and then those get summarized into every day and so on and so forth. So down in or down in granularity, it effectively consumes less space with slightly less detail, right. So Prometheus is not necessarily great at that. It's not just not one of its, its things. So Thanos is really good for multi instance right as you can see here we're aggregating across the two instances I'm pointing at the screen like you all can see me. So we're aggregating across in this in this case, you know the two instances, as well as it has the ability and Brian, I'm going to hand off to you right to talk about its ability to offload those longer term metrics and data. Yeah. Yeah, so actually I think do you have the Thanos IO page up right now yes if you want to go to that. And then I think if you scroll down a little bit and we look for the components Thanos has long term storage components it's really just it's a combination of several try the docs. And then let's see here to design, maybe inside the design. Oh yeah. So it's a combination of a bunch of different possible components and one of those the Thanos storage thing so what it does is it takes component it takes a series database data that is being stored either directly in memory by a Prometheus or Thanos instance or in persistent storage, and every so often it grabs that and like you were mentioning it goes into down samples so it takes it from the exact time frame down samples I think the defaults might be 60 seconds five minutes in an hour, and it has sort of different so it has different scales of which it down samples, and you can configure what at what time frame how old or how stale does that data need to be before it starts down sampling so you can make sure you have the appropriate granularity, but it does that automatically and can tax it down, and then you can push it off to the Thanos store which is an object storage anything as three compatible. And it'll store it into there and then your Thanos receiver can sit in front of both that Thanos store and the object store, as well as your in memory data that's inside the receiver self your query excuse me can sit in front of both of those and query across them over time so you can query for some metric over the course of a year and a half and it'll pull up all of the year and a half data from the the object store and let's say you're in memory storage is four days. It'll also pull up the last four days directly from Prometheus itself or the Thanos receiver depending on what's what's behind it, and display those as a single continuous graph or set of table data. Being able to compact down like that somebody asked earlier, how do you handle the vast amount of data you're getting especially with the Prometheus that ships with open shift by default because we have a ton of metrics that ship with it, which is great but it also means a ton of storage if you're trying to store a long period. You can put Thanos into that and be able to push off some of that longer term storage to it where it starts to compact it down. Object storage in of itself is just a lot cheaper than persistent storage and storing that way but the compaction really gives you that multiplier effect and be able to store for long periods, really indefinitely So it's very powerful in that way and then with like you mentioned that multi tenancy aspect of it, it makes it a lot easier to seamlessly support all those and give whoever is querying that data, a single point of entry for still getting back to it. They don't need to worry about, oh I need data from the last six months so I need to go and query the object store using this Thanos instance instead you just hit your query and it seamlessly covers data coming from every at the same time and stitches it all together. So I don't want to put you on the spot. I want to spend about two minutes talking about Grafana and Observatorium. And then, if you have the, I want to address Daniel's question if you have like an example that you can show of how to configure that longer term retention, or, and whether that's you want me to to drive and show or if you have something that you can share, either way. And then I think there was another question down here, or hope I asked if you have an example of alerting with matter most. Or, or at least we can point to an example maybe. And then let's see exe are are. Yeah, there is a question about monitoring of service mesh. Okay, then TLS. So yeah. So I'm going to talk just real briefly about Grafana and Observatorium so, Brian, if you need to, if you need to stage anything. I'll give you that that time. So, so I do want to quickly highlight if you've been using OpenShift for any period of time. When you look at the dashboards in the recent couple of versions, these have changed. It used to be that this would link directly out to Grafana, which we have here. And you would be able to see all of the dashboards as soon as I move this out of my way. And I can go in here and I can look and I can see all of the things that are happening in my cluster right through Grafana. And my understanding is that we are, we are moving away from this being the primary interface for reviewing metrics with the primary reason being RBAC. Effectively, we don't have or Grafana does not have a way and I may be misunderstanding the way that it was explained to me, but it's hard to do RBAC that is integrated with OpenShift and name spaces and specific only allowing specific people to see their specific things like it is or unlike it is with when it's natively integrated into the dashboards. So I can come here to the NCD dashboard and you see I get very much the same information. So for all of those default dashboards that we ship that that Red Hat ships with OpenShift effectively expect this interface through the OpenShift UI to become the primary interface for that. You'll still be able to and this Grafana instance is read only as far as I know. We don't support adding custom dashboards or anything like that. You can still deploy a Grafana instance and create all of your custom dashboards and all that other stuff, just like you've always been able to if you so choose in the future. But yeah, just FYI, this is, this is going, excuse me, this is going to be the primary interface in the future. So the other thing I wanted to talk about quickly, which was one that I was chatting with one of the engineers. I didn't actually know about this and that's Observatorium. Yeah. No, this is a new one on me too. Yeah, so Observatorium, which I, I understand they're trying to bring under as a CNCF project. So this is effectively an aggregator for an aggregator. If we think about it in a certain way, right? So, essentially, it is a long term repository for and you can see here was he to manage integrate and combine multiple existing projects like Thanos, Loki, so on and so forth. Lots of Avengers references there. So Observatorium is a massive data store and database for all of that metrics data that allows that aggregation to happen. And for what I understand Red Hat uses this extensively with the managed services. So if you're using, you know, OpenShift Direct or Rosa, Red Hat, OpenShift, AWS on AWS, ARO, Azure Red Hat, OpenShift, the rocks, the IBM offering, basically all of those feed data into an Observatorium instance that is run by Red Hat. And that is how we provide those long term analytics and other information out of that. So I don't have a lot more information on that. This is one that I, like I said, I just learned about it. So I'm interested in learning more. You know, maybe we can get some of the folks there. And I think, Brian, you said that it was ACM also can also use this. Yes, that's actually how ACM manages its multi cluster metrics. It deploys the Observatorium operator and then the operator goes and sets up a Thanos instance with all the necessary components. That actually ties into, I think there was a question around seeing an example of setting up a Thanos store. That is, I think probably the greatest weakness of Thanos right now is that it's relatively new and it's relatively complex. So I don't trust myself to try and set that up in a short timeframe there. I'm sure we'd have a lot of speed bumps. The Observatorium API is a great starting point for that. Or if you, if people don't mind me playing for ACM, I've been working with it recently with one of my clients. It really makes it incredibly simple to get multi cluster observability set up but also get that long term storage set up. I think there would probably be some growing pains with getting Thanos set up within a single cluster for me. So if you're trying to, if you're trying to work out long term storage, you're already operating multiple clusters. It may be from a time perspective, it may be easiest to look into ACM because it's really going to help abstract away a lot of the set up requirements for getting multi cluster observability set up as well as getting a long term storage that you might need. Certainly if you only really have one or two clusters and you just want to be able to set up a Thanos instance for that, it may be worth the time to do it individually, but you are going to find that there's going to be probably a lack of examples out there so far. You know, it's just still a pretty new project and it's still growing and evolving so I personally prefer, I love having to be you developing behind me and supporting me in that way, but whatever, whatever floats your boat at the end of it. It's always nice. Yeah. Alright, so Brian, do you want to share. Yeah, I'm going to share. So yes I time check we got 10 minutes left. Okay, so a reminder for our audience please submit any questions that you have. I'm going to see I'm going to, while you're shifting stuff around Brian. So I see a couple of things in here. So our hope nine. Yeah, I know Grafana has our back our back our back so I need to find some details I'll see if I can find details and include them in the blog post or maybe we'll talk about it next week. As to the precise rationale there I know it's changing so I want to see and I want to talk about ex exerr exact our is user workload monitoring with Istio injected apps going to be supported with I don't know what mtls is. Mutual tls. Okay. At the moment there seems to be no way to scrape metrics on service mesh with mtls enabled the only solution we've seen is probably Federation. Unless you know the answer to that that's not one of so I haven't done much with Istio. I don't know whether the limitation there is something to do with the mtls portion of it or the or if it's something about where service mesh is being deployed. I do know that user workload monitoring right now doesn't let you create alerts inside any of the default open shift namespaces and I'm guessing that is being deployed inside something like open shift service mesh. And that's kind of just you get two different teams working on different priorities to get at the communication board in there and RFE may be the best option. I had my only concern with federated Prometheus is I know that performance is a real real big issue that's kind of the main reason they created Thanos in the first place is trying to set up a Prometheus that scrapes from other Prometheus is because it's a pull method as opposed to a push. It's really really difficult to scale that top level Prometheus so that it can actually handle it in memory. So you can certainly give it a shot I personally haven't had a lot of success with trying to do a production ready federated for me this requires a lot of tweaking and then very careful measuring of what you think your load is going to be you ensure that you don't get kind of a cascading crash effect. So I'm sorry I don't have a better answer in regards to that that is a complex situation I think maybe go back to your service point for it and see there may. You may be ahead of the pack and you may be the one opening at the RFE otherwise there may be somebody who's already sort of noticed that some of these components may need extra measuring but don't have the ability to do so yeah. And I'll ask around on the on the BU side as well see if we can find from product management or engineering any additional information. So a question from I think that's going to be poacher. Is there a way to get container metrics from a pod with multiple containers. So that's a good question I think what you could do is so you're setting your pod up you have multiple containers in there you could have each one of them pushing out metrics in a different endpoint and then set up a service monitor for each port essentially on that pod. And that way you can measure from multiple ones that way I haven't actually tried that before, usually. I try to as much as possible stick with the sort of one pod one behavior type thing so I don't normally have a need to measure metrics from multiple containers within the pod usually. If I've got a pod with multiple containers in there you kind of one main container and then some supporting ones in there that their behavior usually stays internal. So I think it would be possible but I haven't tried it personally so I don't want to commit to that necessarily, but my guess would be try and expose multiple ports on that pod and have the containers output their metrics to each one of those ports individually. And then you would set up a service monitor pod monitor to measure each one of those separately and then you would actually be able to capture the instances that way so I think I just threw a bunch of terms out there that we maybe didn't cover. So within the documentation for managing metrics inside the open shift one so there's a thing for setting up metrics for you to find projects. And that's where service monitors and pod monitors come in really what they're doing is you're just pointing user workflow monitoring to a specific service in this case like we can look at the service for this example app they have up here, they're setting it up on port then you set up a service monitor and all it does is says for that same application label. Go and grab the monitor don't grab the metrics that are being exposed on this specific port. So if you defined multiple ports within your service or within your application. If you're doing a pod monitor instead and you have each one of the containers pushing out on a different port. I think you should be able to then set up a service monitor for each port this being output there, label would be the same but obviously your port name would be different. And then you should be able to capture for multiple ones. Hopefully that helps. And then we Daniel also has another question so I want to, you know we've got about five minutes left and I think we do have a hard stop at noon today right yes, I think, Turbonomic is on it. Yes, a new IBM acquisition Turbonomic. So I want to make sure that we that we don't have to do an abrupt close so there's three questions that I'm aware of. So one is the configuring long term storage. So I'll be sure to include that in the blog post will also briefly talk about it next week Daniel if you're able to turn it tune in. So for the matter most example, I will dig up where that's at. There is one. And I'll put that in the blog post as well as we can talk about that next week. So Brian maybe, maybe you might be back on a little bit. Awesome. Yeah, and then Daniel, can alert manager send alerts to separate email addresses based on namespace labels in which the alert occurs. You know that's a good question I think it's going to depend on how you define your alert you can definitely apply individual labels. Well actually so alert minute I need alert that comes out of our manager is going to come with a namespace I think put it on and already and then within your alert configuration do I have an example. Let's see here of to do here we go. So this is an example or this is actually sort of related to the matter most one to do or now hold on I don't have the alert in question. Plus, and anyway, there is a possibility to sort of set up dynamic and points for alerts within your alert manager configuration. So within this router here, you can see for routing to a slack channel in this case, setting up the receiver it has this web book URL here. You could set it up to pull from a label there. I know I think there's another robust perception one about. And by the way this is an incredible resource if you're trying to figure out a specific use case for alert manager for me this robust perception is a great blog for that. And then actually you're using labels to direct email notifications. So we can link this I can send this blog post to you guys and have you push it out via there via the chat as well as later on in the blog posts. But you can see here you can pull out labels from a alert that is reaching alert manager, and then specify that in the receiver configuration so you would probably want to do the same sort of thing here. And then I think it would probably I think it would just be dot labels and then namespace, I'd have to go and look and see where the namespace label actually comes up, whether it's in the group labels are in the individual alert one. So you should be able to do the same sort of thing as this in order to specify them in there the only thing that might be difficult is that if you're not mapping the field of the namespace label to directly to the email that it's sending. You might need to do some sort of translation inside this receiver right to say for namespace. So that's one, I need to just send the email this for namespace NS to any do just send the email that she might have to do that logic inside the receiver here. Or if you can name your namespaces based on whatever the email is that would also, or vice versa if you set up an email account that is named for the namespace. Then you don't need to do any sort of that translation you can just directly put the value of the namespace label into the to field right there and send it that way. So it's possible, depending on what you're whether you're trying to use pre existing emails and pre existing namespaces and those values don't match up you might need to do a more complex receiver logic, but it is definitely feasible. All right, one minute warning for 45 seconds as it were. Yeah. So, I know that we had some outstanding questions that we didn't didn't get to if you saw me looking down I was taking notes on on what to follow up on. So like I said, I will follow up in the blog post on Friday. We'll also follow up in the opening segments of the show next week, which is I understand Chris you will not be with us next week. I will not be here next week. Langdon will be filling the role of Chris short yep. Thank you, everyone. Thank you especially Brian really appreciate you joining. You've been tremendously helpful and I have learned a huge amount of this particular episode more than usual. So the it's as I like to say open shift is is like a fractal right every time you learn something it turns into, you know, a whole bunch of other things that every time you learn something out one of those it turns into a whole bunch of new things so if you have any other questions any other things that you would like to see covered or have us address please reach out Andrew dot Sullivan at redhead.com or practical Andrew as it is in my twitch username here on Twitter you're welcome to reach out to me at any time. I'll throw Chris under the bus as well short redhead.com. And aside from that, thank you so much everybody for joining us today and please stay tuned to learn more about turbinomic. Thank you. I'll be switching over right now. Thank you for tuning in everyone.