 Good morning, good afternoon, good evening. Welcome to another episode of Cloud Tech Thursdays. I am Chris Short, producer, host, and showrunner extraordinaire for this thing we call Red Hat Live Streaming. I'm joined today by Amy, Josh, and special guest, Leif Madsen, to talk about STF and how to monitor your OpenStack cluster. Amy, why don't you introduce yourself, talk a little bit more about the topic, hand it off to others, that kind of thing. All righty, hi, my name is Amy Marish. I am the OpenStack community person here at Red Hat, and we're joined today, as already mentioned, by Josh Berkus, who is the Kubernetes community person. And our guest today is Leif Madsen, who works on STF, which is also known as InfraWatch upstream. Leif, do you want to go ahead and introduce yourself and the project? Sure, my name's Leif Madsen. I'm the Cloud Service Telemetry Architect at Red Hat, and Service Telemetry Framework is a project that I've been working on for three plus years. And the idea is that we install a set of microservice applications on top of OpenShift, and we monitor our infrastructure as a service, our OpenStack. And so today I'll just be kind of going through the architecture of STF, some links of where the open source project components are available for anyone to make use of, and I'll go through some dashboards and some live environment stuff, and I'm happy to answer any questions as we go. Great, perfect. All right, I will just share my screen and we'll get presenting here. All right, so I'll just place the links at the beginning here, and I can get those over to the show hosts after as well, but basically github.com slash infrowatch is the upstream location for all the source code that I'm gonna be going through today from our overview and the rendered documentation, which is also written in the upstream source there, is that infrowatch.github.io slash documentation. So just a quick overview, so service laundry framework is basically, receives monitoring data from OpenStack or third party nodes and is a central location for storage viewing of dashboards and alerting for your system. And so what we make use of is CollectD for storing and collecting, sorry not storing, for collecting the metrics and events for the infrastructure components, the OpenStack aware metrics and events come from Solometer. We also support multiple clouds going into the same monitoring infrastructure. We also provide availability monitoring such as container statistics for CPU and memory and the API health checks for the various OpenStack API interfaces, so Glance, Neutron, things like that. Integrated stuff metrics with the CollectD-SEF plugin, so if you happen to be running SEF within your OpenStack infrastructure, we can also collect that information using the SEF plugin. We can send SNMP traps using AlertManager. We make use of the Prometheus SNMP webhook implementation for that. We make use of various storage back-ins that are provided via OpenShift and so we've tested with OCS and things like that just to make sure that's all working and our visualization is done with Grafana. And so all of this is all operator driven using the Operator Lifecycle Manager within OpenShift. So this is kind of the high level overview of the architecture. On the left here, we have our actual OpenStack deployment and we make use of CollectD and Solometer for collecting the events and the metrics. We then make use of AMQ Interconnect which is also Apache Cupid dispatch router and that's in AMQP1 protocol based message bus and we use that without brokers or anything in order to just stream the telemetry and the event information. So that's our transport protocol coming across the backend. We then make use of the Smart Gateway which is actually made up of two components, the SG core and the SG bridge, which I can get into later. But basically that's our middleware that sits in between the data storage and the transport layers. So takes the information from the bus for metrics, it provides a scrape endpoint for Prometheus to scrape so it can collect the metrics for all your nodes and for events, it will write that into Elasticsearch currently. So, we're all queuing in here. So CollectD has been around for a while. And particularly based on my experience with it, although it may have evolved since then, proceeds having any sort of standards around what its messages look like. Why CollectD? How has it been useful as sort of the central collection point? Yeah, mostly CollectD because the intent when the framework was originally being developed was to make use of something that was one small and fast. So it didn't have a lot of overhead and to be able to run it on things like other network devices. So CollectD can actually be compiled and run on some routers, some switches, some infrastructure components. So that was kind of the main reason around that. It also just had a lot of plugins that were good for infrastructure monitoring. So it had a lot of the information that we required. And the messaging thing hasn't really been too much of a problem for us. When it comes in, we're basically collecting that data with our Smart Gateway anyways, and we're exposing that data by the plugin instance, by the type instance, and there's various labels and fields that we make use of. And so that's actually been quite consistent across the plugin. So we haven't really had any issues with having to do any crazy manipulation or anything like that. So the other thing was the events, is we make use of the Vez format. And so the Vez format is like an encapsulation, VNF event standard or something like that. I can't remember what the VES stands for, unfortunately, but we're making use of that for our events. So it kind of has another encapsulation layer on it as well, so that when the events come in from CollectD, they're also somewhat standardized. You didn't consider trying to push this upstream so that the collector was actually using standard formats? So everything we're doing is from CollectD upstream. So all of the Vez formats and everything, all of that is from the CollectD upstream project, yeah. Okay, sorry, I misunderstood, I thought you were doing that in the gateway, okay. Yeah, no, no, we just de-capsulate the messages. The Vez format is actually coming from, and has been programmed as part of the event plugins that we're making use of inside of CollectD. Okay, thanks. All right, so we'll just kind of go through, I only have a handful of slides and then we'll just get into kind of the SunderLive stuff. But the intent here is to provide kind of a near real-time event performance system. And so we can collect various pieces of information from various events and telemetry systems. We're making use of CollectD and Solometer currently, which is our collection layer. The distribution or the transport layer is making use of AMQ. So that's the AMQP1 protocol. Inside of OpenStack, the message bus in use is RabbitMQ and that's AMQP0.9 protocol format. So they're actually different formats, but similar idea, you transport the information across the bus and get that into the central location. The big thing that that provides us is kind of a push gateway model for telemetry for Prometheus. So instead of going and scraping all 500, 600 nodes of your system, you actually just end up scraping the one smart gateway. So all of the data is collected and sent across the bus and then that's exposed as a single scrape endpoint across effectively the local network for Prometheus. So we kind of get everything collected to the central location and then we just have a single scrape endpoint. So there's no real need to do no discovery or anything like that. You just basically start sending the data over and it becomes exposed for Prometheus to scrape. Events are obviously a push model anyways. So we collect that, send that across the bus just so we have a single transport and then we write that into our event backend, storage backend, which is Elasticsearch. So there's just kind of a little bit more of a blown up view, the same kind of idea here. We have various collective plugins that we can make use of. A lot of the reason for collectee is it also has a lot of NFE specific things. So for telco backends, overlay networks, things like that, we have a lot of information here that we can make use of. From an open stack perspective, we also make use of our Syslog and we're in the process of getting our logs potentially across the bus. We're doing a bunch of load testing right now to determine whether that's feasible. And so we're testing out at 100,000, four million logs and things like that and making sure that we're having as close to 100% delivery of those messages as possible. So that works on going. So at some point, logs may show up here also in the single transport layer. And then, yeah, so it just comes into the bus here. Smart Gateway is basically the middleware from a third party integration perspective because of the way that we're doing the transport layer and using the message bus in a distributed manner, both the Smart Gateway and other systems can actually connect to that same bus. And so you can actually have the Smart Gateway can collect that data and store it for you and other systems can actually also listen for that data and make reactions to it. So part of the reason this is set up in this way is to allow for closed loop remediation. So what you could actually do is you could have a process living on side of your open stack system, listening to the local message bus, being able to react to that and do something. Let's say an error showed up or a warning or something like that that says I need to go and like restart a service, for example, right? That same service can listen and react to it without actually affecting the data storage which can happen much further down the line. You don't necessarily always want to be reacting to the information after it's been stored because that can actually be quite a long time, right? So that can actually be a long loop, right? Which we call the northbound loop. So going all the way up, going into the storage domain, sitting there for several cycles to determine that something is actually wrong and then reacting to it, sending an alert and then going and acting on it, which that action may be done by a human. But if you wanted, you could actually have a remediation system that could react to that. So we have these various loops within the system. So we have like a very closed loop here or a very small loop. And then you have a longer loop that can come up into the actual storage domain and then even longer loop where everything is actually in the storage domain. And then you have some prediction based on several samples or some period of time. So that's kind of the reason this is laid out this way is to allow for that really fast reaction while also allowing for the longer term predictions of what's happening in the system. And you better know what you're doing because it's very easy to code a race condition. Yes. So you definitely have to make sure that your system if you're gonna do a closed loop remediation is able to understand that something happened, it was then able to react to it, make a change and then it should technically also report back into the system in order to clear the condition to say that I have resolved it or it's being resolved. System that we have set up also allows for multiple clouds. So you may have several different data centers or you may have one data center that has various small clouds that are broken out maybe on a tenant basis or just specific features or maybe you just have a system that you'd have various small clouds or medium and large clouds. And so instead of having multiple different monitoring systems, you can actually centralize that. Again, we use the transport. When that transport comes in, we actually have various groups of smart gateways one per cloud and then that can also go all into the same storage backend or different storage backends if you actually wanna configure it that way. Excuse me. So this is just the various components. So STF is actually made up of a bunch of different components. So I've been talking about the storage domain, I've talking about our middleware, our transport layers and our collection layers. So on the open stack side of things, I mentioned collectee and salometer are the data collectors. We're making use of the Apache Cupid dispatch router for our transport layer. In that diagram here, we're actually that's the AMQ interconnect operator here. And that AMQ interconnect operator is what manages the lifecycle of the AMQ workload. We also make use of a certificate manager for creation of certificates. We use elastic search and Prometheus operators in order to deploy and manage the lifecycle of the data storage components. We use the Grafana operator for managing the actual Grafana deployment. And then we have the smart gateway operator that is what actually manages the deployment and the standup and the configuration of the various smart gateway components. And then we have finally the service telemetry operator and the service telemetry operator is what I call an umbrella operator. And so it is the thing that you install and when you want an STF instance, it actually goes out and creates various objects inside of OpenShift. And those objects are then reacted on by the different operators listed here. And each of those operators then goes and manages their components. So STF just says, I need an elastic search or I need a Prometheus. So it will request that. And then the operator that has all that operational knowledge of how to stand up a Prometheus, how to stand up an elastic search, how to stand up a QDR system, STF just makes those requests. And then those operators actually manage the standup of all of that. And then when all of those components are up and running, you now have an STF instance, basically. So, Leif, is this a page of view of what we would see on the OpenShifter OKD side? Yes. So if you were running this in OKD or OpenShift, you would see this as the installed operators. So once you've gone through the installation process, following the documentation for STF, this is what you would see inside of the operator lifecycle manager page. You can also do it from the console. And I can actually just show you what that might look like. I have this on the proper page, which I have moved it somewhere else. Oh, there it is. Start with that. Let's try that again. There we go. So if I get to the proper endpoint here, OC get service telemetry. OC get STF, sorry. Think I just typed it wrong. So OC get STF default. This is what you would actually create as part of an STF instance. So you basically say, inside of alerting, you ask for an alert manager and you request its storage backend. And then you can have a receiver like for SNMP traps and things like that. Here's our back ends that define our events back end. So we basically say elastic search is enabled and we're going to use persistent storage for our metrics. We're going to say we're going to use Prometheus for that. That's been enabled with a scrape interval of 10 seconds. And then we've again set our storage back end and how long we're going to retain the data. And then the various clouds that we're going to monitor. So these are basically we want, these are our collectors. So within the clouds set up, we're saying we have events and metrics collectors that we're going to listen for, which is the subscription address inside of the Cupid dispatch router. What is the collector type for that smart gateway? And then basically we define that. And so we can have a list of various clouds. If I had a multi-cloud setup, I would have another line down here. So I've defined this as COMP04, which just is a short form for the cloud configuration. Again, I can even have override. So this is an example of a Grafana manifest and I'm doing an override because I needed to change the base image. So I can actually, the different objects that STF can manage, I can actually override if I needed to. And again, graphing is enabled and this just kind of sets some information for it and whether I have high availability and the transports and things like that. So that's kind of the object. So then if I did OC get Prometheus, these are objects that the service telemetry operator actually requested. So OC get Prometheus default and then you would basically have another object here that this is the object that the Prometheus operator would react to. And again, that just took the information that I had in my service telemetry saying I wanted a Prometheus storage backend that created this Prometheus object which the Prometheus operator reacted on and resulted in a Prometheus object for me. And then it created the different storage backends and things like that. So here's my persistent volume claim that I've requested as part of the standing up of my storage backend. So this is a picture of kind of the layout of the routers. So these routers are the Cupid dispatch routers that are collecting the data. So you can see this is basically STF. This is what's running on OKD. And then each of these routers are running on each of the nodes inside of my open stack environment. And so each of these routers run locally and then the local clients connect to them. So the local clients in this case are basically collect D for the infrastructure metrics across all of the nodes. And then these are our controllers at the top. And you can see that I actually have two different clients connected to it. One of these is collect D and one of these is Solometer. The way the Solometer works on open stack is that there are compute agents that will run on the non-controller endpoints. And then that information is actually sent across the rabid MQ bus to the Solometer agents and then the Solometer agents via Oslo messaging will basically be able to send that information across the A&QP1 connection, which is our Cupid dispatch router that we're looking at here. And that information is then centralized back to STF. This is just a picture of one of the dashboards. And I'll go through the various dashboards I've been kind of working on this week. But this is kind of the results of all of that information. So I've got my various APIs inside of open stack that I'm checking. And I've just created a dashboard here that has various bits of information. So I can see how much CPU usage, the horizon services is providing or is making use of and how much memory it's using. Again, with SAP and Nova and whatever other endpoints that I actually care about monitoring. So I have these prerecorded demos, but I don't have to take any questions and then I can get into any live environment stuff if that's interesting to folks. So I actually have kind of a big question, which is this is a very multi-layered system, right? With a lot of components. What does degraded mode look like if the problem you're having with your cluster or clusters is actually affecting the ability for this full stack to operate? So if worst case scenario, you basically monitored to say, I'm not getting any monitoring. And I basically sent an alert saying your monitoring system is basically I'm not getting any information from the cloud that you're monitoring, right? So that could be the networking couldn't have gone out, services could be overwhelmed, the memory could have run out on the system. But ideally, I would have enough information leading up to that event that even when I got the alarm saying, well, your monitoring system's offline, well, why is it offline? Well, I can see that the memory usage of neutron or some system was running and overwhelmed the controllers. And now the controllers are basically offline and now I can't send any information. Now, if any of the networking is up and running, even if my controllers go offline, I should actually still be getting information about my computer or my storage backends because it's not centralized through the control plane. All of the systems are actually running their own collectors and they're running their own Cupid dispatch routers for the transport. So if you have networking, you will still get data. If worst case scenario, you just don't get any data, that is definitely something you'd want to alarm on. So what I can also do is I can start doing, I can make some use of things in Prometheus, for example, to do predictive system. So I can say, I've been watching memory going up or I've been watching network utilization going up over the last hour and it's predicted in three hours from now at the current rate, you will have overwhelmed the system. So ideally, you react to things faster that way without having to actually have the worst case scenario where the system's offline and then you have to react to it that way. And it's kind of an advantage versus the OpenStack telemetry project in that the telemetry itself, except for Solometer, isn't running on OpenStack. So your monitoring is now on another cluster, therefore it's not actually affecting the system itself, which I think is a great improvement over what we had before. Well, maybe it's on another cluster. If you're running that OpenShift cluster on top of the same OpenStack cluster, it's on the same cluster. Yeah, we actually don't recommend that. So our documentation actually will say you should not run your monitoring platform on top of the thing you're monitoring. Because again, you have that, let's see, I have that catastrophic network outage. Well, now the thing that's running on top of it can't actually notify me that it's out, right? So the only other way to do that is then have another monitoring system that's checking to see if your monitoring system's up and running and then alerts you when the alert monitoring system goes away, which is kind of silly, right? But it's not totally impossible if you have something really small running somewhere. But ideally, I believe a lot in kind of an infrastructure cluster where you actually deploy a very small cluster that specifically for running things like monitoring, maybe you run your undercloud, you run your OpenShift installer, you run your ACM, whatever systems you happen to want to be running to actually manage and deploy your clouds, I believe should be in a separate cluster anyways. So that's kind of the idea around here is that you don't run your monitoring on top of the thing you're monitoring. You know, in an ideal world anyways. Yeah, that makes sense. Okay, so, I don't know. Let's look at some dashboards because dashboards are cool. So hopefully this is back up and running. So this is my service telemetry deployment and these are the various routes of how to get to the systems. And so I can look at my Prometheus, I have, you know, Kavana running, I have my interconnect, things like that. These are the alerts that I've built in. So of course that didn't log in for me. We'll come back to that. I've already seen that part anyways. But these are the various dashboards I've got put together recently, assuming that they will load. Yes. Emily, while you're clicking through there, how hard is it for someone to make their own dashboard? It's pretty straightforward. What I like to do is I'll go to Prometheus first and I will go and find something that I want to monitor, right? So I can click through this list. You can see this list, is that true? Yes, we can. Sometimes it doesn't show up on screen shares. So let's just say I wanna check memory, right? So I'll just do something like this and I'll say I'm gonna collect for memory. And let's say I wanted to see the free memory so we can just do a search for like type instance equals free and I can execute that. That'll, you know, shows me the amount of free memory across my systems. So then maybe what I would do is I would take this, I would go into my dashboards and go home. If I can get back to the main screen here. Oh, I probably have to log in here for a second. There we go. Now I can actually create stuff. So let's create a new dashboard. Here's a new panel. Here's that query that I created. There's my graph. Say I wanna make it. I just wanna see the host name. I can add that. I can change this unit. It's data in, I think it's in bytes. That seems about right. I had apply and I have a new dashboard. Obviously I can set this name, set this panel, whatever the case may be. But that's pretty much it. And then usually what I end up doing is once I've done that, I will save the dashboard. Then I will export it. So that'll save it to a file. That gives me a JSON document. And then what I end up doing is OC get graphana dashboards. And so these are all the dashboards that I've created. So the nice thing is that when you load these in, they're managed for you as objects inside of OpenShift. So I don't have to manually import those every time. If I restart graphana, if it does an update or whatever, these dashboards will all be automatically loaded back in. So OC edit, let's just look at the graphana dashboard, set dashboard 1.3. And so I just have to wrap it in this basically header. So which API? It's the integrately API. What kind is it? It's a graphana dashboard. It's a little bit of metadata. This stuff's actually all created for you. So you wouldn't even have this stuff in here. I just name it and then which namespace it's part of. And then basically that would be it. And then I'd have the spec. I say that it's a JSON blob. This bar just means multi-line. So everything after that that's indented. And then this is literally exactly what I just exported out of that file. So if I quit out of that, I made modifications. So then I would do OC create dash F new dashboard.yaml for example, right? And then that would create this dashboard for me. So if I go to github.com slash infra watch slash dashboards, this is actually where our dashboards live in GitHub. So it would be OC create dash F. These are the dashboards here, but OC create dash F stuff dashboard for example, right? Once I do that, then that would automatically be created for me inside of the dash. So in fact, if I'm going to be really brave here, OC delete dash F stuff dashboard and then do a review here. This is the one that I created. So you can see my stuff dashboards no longer here, but if I go back to my console OC create dash F stuff dashboard and then I refresh this page, there's my stuff view that wasn't in there before. And now there's my dashboard and it's making use of all the data that I've already previously collected. And you can see all the information about your stuff backend. So you can totally get ups this whole thing then. So you could actually tie creating new dashboards to deploying a new resource. Yes. So that's part of the reason I make use of OKD for this is because it's really easy to manage these components. So instead of me going and having to write a whole bunch of stuff to manage and deploy all this for me kind of after the fact, I make use of the operator model to do that. So I can show you a little bit of that of how that works in for what service telemetry operator. And so this is just Ansible in here. And so it's just there's a lot of boilerplate of creating the actual operator itself, but ultimately it's just Ansible. And so part of this is all of these components or these playbooks that I've created. So there's one, you know, the alert manager, the certificates, the clouds, you'll ask search the Grafana, things like that, right? And so all that's doing, in fact, I will load it in something. So we have some colors. It just looks up a template, it sets some defaults and then it creates an instance to Grafana using the Kubernetes module inside of Ansible. And so that takes that object inside of that template, loads it into OKD and then the Grafana operator reacts to that and results in the creation of the Grafana instance. And then there's other things where, you know, anything else it needs, like looking up, you know, data sources or, you know, what this is doing is also creating the data sources. So OC get Grafana data sources, I think this is the object name. So there's the default data sources, default data sources, OEML. And then so that creates the data sources inside of my Grafana instance, which is defined here under data sources. So ESLometer, ESCollectD, STF Prometheus, these are all created for me as part of an STF deployment. So I don't create any of this. This is all automatically resulted at just by enabling dashboards inside of the service telemetry object. So anything you want to add, you just add it to the service telemetry operator and then that can go off and work with other operators to actually deploy the components that you might need or might want. Again, we have overrides that you can also pass into the service telemetry object. So like I had that Grafana manifest, for example, I did an override of Grafana, but I still had access to my data sources that were automatically generated for me and created as part of the service telemetry deployment. Our documentation's even all auto-generated and everything as well. So anytime that I make a change to this InfraWatch documentation. So this is our source and ASCII doc. Anytime that's changed and it merges into the main branch here, this actually will update and you will get changes into this documentation. So all of that is auto-generated and you can see that for our upstream here for the open source deployments here, OpenShift uses OKD, suggestions of different backends you can use for testing and things like that. And then it just goes through of how to create the object. So when I deploy an STF, I literally just do this, copy and paste that, copy and paste that, copy and paste that, just keep going. And then I'm basically done. And once I get to the end of this, I will have everything that you just saw there in terms of the operators that are deployed, the Grafana instance that exists, all of that. And then all you have to do is add your alerts. So there's a file that we provide that provides these alerts that you can use for what to monitor of an open stack system. You can customize that, you can add whatever you want. Again, with the dashboards, you can manage your own dashboards and create one like I just did there, export it, wrap it in a little bit of boilerplate at the top. And now you've got basically a GitOps model of something that we check in to GitHub and everyone can make use of. And if there's a change or we need to make a modification, you just go and change it, submit it, and then you can make use of it immediately. Cool. Yeah, I can see that putting that part of the workflow. One of the other things I wanted to actually ask you a question about is, a couple of times when we were talking about things like we're talking about degraded mode and other stuff is you mentioned predictions. So is there any kind of a predictions feature in STF to say, hey, you're gonna run out of memory in your cluster at this point? Yeah, so that's actually part of Prometheus. So there's a lot of different functions that you can make use of in Prometheus. So this would be like this predict linear. I'm not gonna try and create something on the fly, but you basically make use of these various things inside of Prometheus. So there's lots of functions. So there's just summing, so like adding things up, predict linear, and then just various things when you write your alerts that says, when something reaches this threshold, send a warning. And when you reach a further threshold, send something like a critical notification, right? So you can actually get these different things when you send alerts to say, this is just a warning. We're kind of getting up to the critical area. And then when you surpass that critical area, then you get your alarms that say, okay, you really need to react to something right now. One of the other things that we have for the eventing is just in the virtual machine view, whenever we have events happening, we can actually overlay them in the dashboard. So that's what these like little annotations are. These ones just happen to show up as an example, but it's just telling us that our virtual machine is still active, basically, inside of this project. So if I change to a different open stack project, this one doesn't have any VMs in it. This one has two. And then we can even see that down here. So these projects, here's the instances that are living inside of that project. So I think that's pretty much everything I have to share right now. Okay, I've got a question because you mentioned monitoring. Not monitoring, alerting. So how do you alert out of STF? Oh, so we're just making use of the alert manager that comes from Prometheus, basically. So you just load your alert rules in. And I can actually show you that really quick, too. Because I know it kind of distracted you off of your actual demo, and I apologize for that. Because I haven't seen this since you showed me STF 1.0. So it's been a while. Yeah, so I, yeah, rules. Here we go. So here's the rules, open stack rules. This is just loaded into the console. OC get, I can't remember what it's called. I think it's Prometheus rules. We're back on the documentation, just FYI. Oh, because it's not sharing the right thing. You don't see the console, right? No, we do. Now you see the console, is that right? Sort of. We just saw it, now we have the. We saw a gray screen. Yeah, now we have a gray screen. Yeah, hold on, it's just sharing the wrong thing. Okay. Oh, I know what I did. One second. You're gonna see Chris for one second. Do you see? See your terminal? See terminal. Okay, perfect. I think that's, yeah, this is only three hours ago. So yeah, this is the file that we created. So you just open stack rules, and then these are the expressions that we've created, and then any alarms or alerts that you wanna create. So if we sit in this position for 10 minutes, then the label basically is severity warning. So that's how you create those, and that results in these rules being loaded in. So we can see, here's, you know, load midterm shows the alarms. So these alarms are what will show up on, I believe the infrastructure node view, if there's any, you know, recent alerts or whatever. These need to be tuned for the environment. Obviously these are the fact that I'm seeing a whole bunch of, you know, current alerts and recent alerts just means that it's flapping because those queries are too aggressive for this particular environment, because this is a demo environment. So it's always heavily overloaded, but that shows that the alarms can basically show up here. So that's what results in the alarms that can show up on the dashboard. Those can then be sent to alert manager, and alert manager can be configured for various receivers, and that receiver can go to a webhawk or whatever the case may be. If you make use of the SNMP trap functionality, then we have a little bit of middleware that sits in and listens for the webhooks being sent as a receiver from alert manager, and then it can actually convert that and send it to a system that can accept SNMP traps. But otherwise you just set up your alert manager just like you would set up alert manager for any other system. So STF is just making use of those existing components. It's not doing anything magical. You just have to create the alarms or the queries that result in alarms anyways. So possible to plug sort of third party services into this somewhere, like for example, if somebody wants to use pager duty for alert management. Yeah, so alert manager actually supports pager duty. So that would be a receiver that you would create inside of alert manager. So there's only a few things that alert manager supports. Pager duty happens to be one of them. Primarily if you want to interact with any different types of third party systems for sending alerts or warnings or whatever the case may be. Generally those are, you consume a webhook and then you convert and send it to whatever endpoint is you want. But some of the various built in ones are like email, Slack, a few different ones and a pager duty happens to be one of them as well. Yeah, and I'm just thinking that managing who should get an alert is its own thing. Yeah, and not something you would want in this tool, right? No, exactly. So that's actually part of alert manager and alert manager will do. So if it receives, I don't actually have it working right now. I didn't set it up quite right when I was deploying this because it's actually disabled for some reason. But this is the route. So you set the route here and then you can group by various jobs. You can determine how long you're supposed to wait for, how often you're checking, things like that. And then you would have a receiver. Now mine's set to null because I don't have any receiver set up. But you basically set a receiver here that may be pager duty and may be to email system, whatever the case may be. And that receiver is what ultimately results in the delivery of the alarms. And it will also do the duplication, suppression and things like that. So alert manager, you can run two or three of them for high availability. And it will determine when it gets the alarm, if it's already been sent to a receiver from by one of the alert managers, you won't get it like multiple times, for example. So one of the thing I wanna ask about is application telemetry. So if I want to feed the telemetry from an application running on OpenStack into this, how does that work? via SNMP via collectee either or? Yeah, so really STF is designed as an infrastructure monitoring tool. It's not really meant to be the application monitoring tool. So you could either run something inside of your virtual machine that sends it either to another system that is designed specifically for application monitoring or you can match the same pattern where your virtual machine can actually run say a collectee or something like that. Now for virtual machines themselves, if you're just trying to get CPU, memory, IO, things like that, you don't need to run anything inside the VM. We make the collectee vert plugin deals with that for you. So it will talk to libvert and it gets information about all of the virtual machines. So if you're actually just trying to monitor the virtual machines themselves, then you don't actually need to run anything inside of the virtual machine. That's already dealt with for you using this collectee vert plugin. So memory of the virtual machines is basically dealt with. You mentioned earlier that it could be reactive. So kind of building on Josh's question about monitoring an application. Can we do something like checking that the Apache on our VM is up and running to determine whether our VM is up and our application is running and then restart it. Can we do that through STF? Yeah, you have to send the data. So like I said, STF is not meant to monitor the workloads running inside of the virtual machines. It's meant to be used as the infrastructure monitoring system for an administrator of a cloud platform. So if you wanted to as an administrator, if you wanted to allow the workloads inside to also send information, you would just have to run the QDR or be able to point your collectee at the QDR. And then you would just make use of the collectee plugins that you wanna make use of. So if you want, for example, your example is, I wanna know that the Apache running inside of a virtual machine is still active. I'd make use of the Apache plugin inside of collectee and then send that into the system and then basically monitor for that. But that's kind of out of scope while technically possible because you just have to send the data. Once the data is sent and it's transported, then you just make use of it, right? But that is kind of at a different layer. That is not kind of the scope of this monitoring system. But is an own and out of memory within the scope? Because we can monitor the memory in the virtual machine. Yes, so I can just monitor the virtual machine itself and I'll know how much memory was allocated. So you can see down here how much total memory, how much is unused, how much is usable, how much is available to me, all of that kind of stuff. So I can make use of the libvert staffs to determine that a virtual machine that was given 16 gigs of memory is approaching, it's theoretical maximum, right? So the infrastructure method of monitoring your application. If your application is too much for your VM. Yeah, it's not monitoring the application itself, but it's monitoring the virtual machine that the application is running on top of. I mean, I will say as a former DBA, the separation of infrastructure monitoring from application monitoring is not a decision I've ever understood. Because for example, if I'm running a database and the queries start being inexplicably slow, often frequently the reason is that the machine that they're on is running out of resources. And it really seems like you actually want that telemetry unified rather than in two separate systems. Yeah, like I said, you just have to run collecting inside of the virtual machine, right? So that is a decision for the infrastructure administrator to determine if that data is appropriate for their system and what they're monitoring. But if you just run collecting inside of the virtual machine, then there's absolutely no reason you can't collect data from inside of the virtual machine, but you have to configure it that way, right? Like the deployment of the data collectors and the transport inside of the virtual machine is definitely out of scope of the life cycle management system for your infrastructure. In this case, triple O, right? So I'm running triple O. What I'm doing inside of the virtual machine is outside the scope of that system, right? Which is what I'm using to deploy the data collection and transport. If you happen to have a virtual machine image that you've created and uploaded to your data store and when you launch that VM and those VMs are set up to automatically no matter what you're doing to run a data collector and a transport system and you configure it in such a ways that it can connect, there's absolutely no reason you couldn't centralize all that data with the system. Wait, why do you need to run a transport, why do you need to run a transport system in the VM? Well, you need to connect it to a QDR somewhere. So you can either understand that there is a VM that might attach to the QDR running on the host, but then you have to make sure that your network routing allows the virtual machine to connect to the host that is running the VM and that might be a security issue. So you may need to set up another QDR because every node is running a QDR, which is very light, right? Those QDR routers are running an edge mode. There's no really, it's all ephemeral, there's no data storage requirement or anything like that. You just run that really small little system and then you basically connect it back to the central location, right? Right, so you do the same with an application, yeah. Yeah, that's a theory you could do it that way. Now, obviously it just depends, if you want to use the local QDR for your virtual machines then you have to decide whether the security requirements of your system allow for virtual machines to connect to services running on the host. It's a bit of a southbound connection that might not make a lot of sense for some people. Or it just connects centrally all the way back to, if Collect EA is running the MQP plugin, that MQP plugin could just connect to the interior routers running in STF all the way that way too, right? They may not need a local transport anyways, they may just connect all the way back to the central. And there are so many different ways you can monitor and everyone is very passionate about what they think is best. That's why I was venturing that if you had an issue, knowing the CPU was high or the memory was low could point to your direction, but at the same time it is nice to know that things are responding, whether it's in the same system or not, connecting to the database, are you getting a good response time, connecting to your website, is it up? You can kind of get some of that predictive behavior on whether it's up or down, based on you see the CPU going up on the VM itself or on the host itself, and then you know you need to do something before you even get to the point where you're down. Yeah, and there's obviously, there's tenant implications and things like that. So if you have multiple customers or tenants inside of your infrastructure, STF is not designed to be a multi-tenant system, right? So all of your data is collected in one location, that's not separated out, so I can say customer ABC can only see the information collected by their VMs, right? That's not a scope for what we're doing. So if you are the only tenant, you're running the cloud for the purposes of say running VNFs and all of the workloads running on top of that infrastructure specifically in support of your administrators and you don't need that multi-tenant separation at the data store level, then yeah, I mean, run your virtual machine, run your VNF inside of it, and then have your data collector inside of it that goes into your main, you know, monitoring infrastructure, you know, like STF. But if you need some very fine-grained controls of who can see what, then probably STF is not the appropriate solution. You could, seems like actually if that was your situation, you could go the other way, right? If you have an application monitoring system, then you could expose select data from STF to that system, right? Because like if I'm in charge of the application of I'm the DBA, I don't care how I get the correlated information behind what the CPU is doing, what the database is doing, I just care that I can get them together. So you could go the other way if you were doing multi-tenant, right? You could say, hey, these kinds of events or these queries from STF are going to be available to the tenants. Yeah, so like Prometheus is not multi-tenant aware, right? Like I can't, I don't have separate logins that I can say you get this subset of data, right? That's not how Prometheus works, right? So, but what you could do, if you really wanted to was you could break those out into separate data stores, you could effectively create two service telemetry objects in the same namespace and then have a cloud configuration for just those that tenants application. And then when they send their data back, it would be listening on a different transport address, right? So you'd have a different set of smart gateways on a different address space inside of the actual transport medium and the applications would only send to say, you know, application slash DBA or something, right? And then the smart gateways would listen to that topic on the transport and they would put it in its own data store and then you could create, you know, routes just for that person and then you could, you know, use, you know, OKD's, you know, login system to provide who can access and log into various routes or whatever too, right? So it just requires a little more setup than what is just kind of out of the box, you know, takes more than the 10 minutes it takes to set up STF normally, but in theory it is totally doable. Now I know we're running short of time but there's one thing I wanted to ask if someone wanted to get involved with InfraWatch, do you hold any meetings? Is there a contributor guide or anything? No, we haven't really had that so much. So it's pretty much just, we monitor the GitHub system. You can open issues there. You can always, you know, find us on IRC. I think we have a service telemetry channel on I want to say on OFTC, but generally just go to the GitHub, open an issue and we're pretty responsive there. Otherwise, yeah, there's, we haven't had a lot of other folks, you know, outside of our team really making use of it. So we haven't really had that need for a guide or anything like that. But obviously we would, if we started getting lots of contributions, then that'd be great, great problem to have. I'm looking at one that is tagged actually a good first issue. So if anyone wants to get involved, look under service telemetry operator and there is a issue that is marked as good first issue. Leif, do you have any parting words for us? No, I think, I mean, it's definitely a little bit of a different system and it's not built on, on top of the, you know, open stack services running directly in a side of open stack, but the idea behind it is to allow information coming in from various different systems. So that's kind of why the architecture looks that way. And if anyone's interested in running it, you know, feel free to reach out and I'd be happy to help anyone work through anything that they need. Great, thank you so much, Leif. Josh, you had some parting words. Yeah, so thank you, thank you, Leif. Thank you everybody for attending or for watching this later on YouTube. Cloud Tech Thursdays will be taking a brief hiatus and we'll be returning in approximately one month on August 17th. And the reason why it's August 17th is we're actually becoming Cloud Tech Tuesdays. The reason being that we wanted to move this to an earlier time slot in order to be more friendly to European viewers. So we've been complaining that this is way too late in the day for them to attend. So I, yep, so you'll see us in four weeks, minus two days as Cloud Tech Tuesdays where we will be meeting with the Kubernetes 1.22 release team to talk about how the release went and what's in the release and all of those other things. So see you then on Cloud Tech on CTT, which we still are. Same show, acronym, different day, different time. Do we have a UTC time? Does anyone know it offhand? UTC time, I believe is 1300. Yeah, I wanna say because it's at 10, right? 10 a.m. Eastern. Yeah, so it's 1400 UTC and 10 a.m. IEDT. But you'll see more in various places. Yes, absolutely. We'll be tweeting that out quite a bit. So yeah, thanks everybody. Any problem words or we got here? Nope, we're good. Thank you everyone. Thanks so much, Lea. Thanks for having me. See you folks.