 Good morning, everyone. This is the Menosca deep dive session, which is monitoring as a service at scale. I'm Roland Hockmouth. I work for HP, responsible for monitoring of our public cloud and building up monitoring for the helium distribution. To my right is Sandy Walsh. Sandy is a developer for Rackspace. And we also have Tong Lee sitting up here. He's also a developer from IBM. So today the agenda is we're going to go over just a general problem statement for monitoring. Then we'll go into what is Menosca? It's architecture. We'll go into a little bit deeper dive into the metrics. Sandy will take us through a much deeper dive into the events stack tack pipeline and integration that we have going on. Tong will cover some of the microservices stuff that we're doing. We'll do a little bit of an update on where we are with anomaly detection. We'll give you a current status of the project, as well as some performance numbers that we've been acquiring over the past couple of weeks. Our next steps, and if everything works out well, we'll do a demo and follow up a Q&A session. So monitoring, there's a lot of problems with monitoring today. Some of this is talked about in the industry. There's a great conference called Monitorama where they talk about this. I want to go through our view on what the problems are that we're trying to solve with this project. So one problem is just monitoring as a service isn't really generally out there yet as an open source project. Most of the operational monitoring projects, they don't have a multi-tenant model, for example. So that's not well-addressed in the industry right now. Performance, scale, and data retention are huge problems today. In our public cloud, we're generating 25,000 to 30,000 metrics per second. Most monitoring solutions that we know of, or pretty much all of them, really can't scale that high. And you see this in the industry where you see lots of web companies that have developed their own proprietary monitoring solutions. There's multiple uses of the data, SLA calculations, business analytics, and root cause analysis come to play. It's more than just monitoring for operational monitoring where you might keep track of the last day or two. After a big incident occurs, you might want to go back and look at it several days later. Elasticity and dynamic runtime configurability are critical when you're running a cloud at scale. You have to manage your metrics and alarms today. And a lot of systems, like, for example, create alarms manually for every resource that you want to monitor. Spammy alerts and alert fatigue are very big problems. Getting thousands and thousands of alerts a week, which is very common for the NOC organizations and IT organizations, just leads to a lot of alarms being ignored. Real-time event streaming is another problem. Extensibility. Many monitoring solutions are extendable, but they're extendable in kind of minor ways. They don't extend it in the internal mainstream pass. And many of them, although they have some API format, like Nagios does, have an HTTP API and Xabix has an API. They're really not fully full first class APIs. Also, most monitoring systems today are focused on one or two problems. They might do internal operational monitoring, but they don't do the external customer monitoring as a service type stuff. Or they focus, like Nagios does, on health status type alerting, but they don't do metrics processing. So finally, the last thing is just cryptic data. If you're looking at these legacy monitoring solutions, you're forced into taking a metric name that you might want to describe and force fitting it into something that has maybe a name and a host and a host group, where rather than being able to describe the metric, the region, the zone, the resource idea, et cetera, that you'd like to really describe. So what is Manasca? Manasca is monitoring as a service solution based on a first class REST API. Multitenancy is done via keystone authentication. And this really helps out leading to a kind of self-service model for monitoring. The monitoring group doesn't need to be involved every time you need to add another thing that you'd like to monitor. It's highly performant, scalable, full tolerant, and it's capable of big data retention. We can take those 25,000 metrics that we're receiving in our public cloud today and store them for retention periods of 13 months or greater. This might be really important for SLA calculations, business continuity requirements, et cetera. It supports metrics, storage retrieval of statistics. It has a built-in thresholding engine, a notification system. Real-time event stream processing is in progress. Sandy's going to cover that. It's open source, and it's built completely on open source technologies, such as Kafka, which is a highly scalable, full tolerant, durable message queue. This was developed and obviously used by LinkedIn. And there are many, many companies in the industry that have messaging systems that are using Kafka today. So Apache Storm is another technology that we're based on that's a computational engine. And we use it in our threshold engine. We have several time series databases that we support. InfluxDB is in today. IBM, Tong is working on supporting Elastic Search. Our solution, we consolidate multiple monitoring requirements into a single one. When we developed our public cloud, we had Nagios. We had CollectD. We had a customer-facing monitoring as a service solution today. And we looked at this, and we said, wow, that's a lot of systems. How did we get here? We took a step back, and we said, well, we can do this with one system, we think. And that's what we basically are doing with Manasca. It's extensible based on a microservices message of us like architecture. This is the one picture that kind of summarizes the overall architecture of Manasca. First class REST API is the big blue bar on the top. Everything, all the traction in our system goes through our REST API. So I'll talk about metrics first, then I'll talk about events. So the sort of upper right there, we've got our Manasca agent. We have our own Python agent. And that posts metrics to our REST API. Then from there, the metrics end up in our message queue, which is Kafka. What I'm not showing here is the deployment architecture. All these components like Kafka could be scaled out across a cluster, and it uses a consensus-based algorithm so if any node fails, it'll be happy. You can recover from that. OK, so metrics end up in our message queue. They're durable, Kafka is a durable system, so they're all stored there until they're consumed by one of these components. So I'm going to go down the metrics pipeline first. And I'll go to the persister. Persistor basically consumes that one there. Persistor consumes from Kafka and stores into our metrics, events, and alarm database. Our metrics and alarms database, that's influxDB, or it will be Elastic Searcher Future. We also support Vertica today, which is a commercial analytics database from HP. So that's the job of the persistor. And if you want to add support for Cassandra or something else, that could probably be easily done. We did the port to influxDB in about three weeks. OK, so persistor stores the data. The threshold engine consumes metrics from the queue. And it does your traditional alarm threshold calculations on that. And if a threshold is exceeded, it publishes back to the queue on another topic, an alarm state transition event. Alarm state transition events are consumed by the notification engine. And if that alarm state transition event matches a notification method that's there, it'll send out a notification like an email or something to somebody. Today we support email. Again, that system could be easily extended with other things like page or duty or open up into a gearticketing system. OK, so we have another database over here. The config database we call it. We were naming challenged at the time, but it kind of stuck. The config database stores things like our alarm definitions and our notification methods, things that are basically read and used by these other components here. So you store things like an alarm definition via the REST API into the config database. And then the other components read that information out later on. OK, so the other things that we have, we have a horizon dashboard integration. And we're integrated in with Grafana. Those all operate via the REST API. They're not coming in through the database directly. We have a Python monaska client, which is similar to your other OpenStack Python command line components. OK, so I left out a few components here. I'm going to go cover the events part next. So way up to the right again, we have events. And they're symmetric metrics. They're posted to the REST API. They end up getting published to the message queue. And they're consumed by a number of components as well. The persister doesn't support this today, but we don't store those events into our metrics alarm database yet. That'll be added. But the transform engine consumes an event, transforms it in some way, might reduce it at the same time to something that's easier to operate on. And then it publishes that event back to Kafka. So reads events in, transforms, publishes event back out. The events engine then will consume the transformed event, process it in some way. And Sandy's going to cover this, I won't get into details. But the output of later on of events processing can include things like more metrics or events. So if we go down this path, we can take the event engine, process a bunch of events, and that could ultimately result in a metric, which then ends up in our pipeline later on, getting processed by the threshold engine and the persister. And we also have an anomaly prediction engine that's in progress, and I'll show some information on that a little later. So more details on the metrics pipeline. We have a REST API for creating, querying, and getting statistics for metrics. It's just showing a really small example of how that works. But it's a dictionary with a name, dimensions, key value pairs, et cetera. But it's flexible. You can have those dimensions be anything you want. I'm just showing hostname, region, zone, service. You can have resource ID, cloud tier, anything you want in their device, HTTP URL, if it's an active check, et cetera. We support creating alarm definitions. This is somewhat unique in Manuska. We create alarm definitions. We don't create alarms. And when you create an alarm definition, what happens is that's like a template. As metrics arrive later on in the system, we match the metrics that arrive with our alarm definition. And if there's a match, and if we're not already processing already, we will create an alarm at that point in time for it. What this allows you to do, if you were monitoring, one of your metrics was HTTP status and you have that being reported for a whole bunch of services, you create one alarm definition. And automatically, as those services come online and new ones are found and discovered by our threshold engine, we automatically create alarms for it. You don't have to manually do that later on, which involves a lot of configuration. It has a simple compound expression grammar shown there. We have actions associated with alarms. So there's three states of an alarm, OK, and a determined. When you transition, we have actions, which result in notifications that can be sent. If you associate a notification, like send an email, when we transfer to the alarm state, we'll send that. When we create these alarm definitions, we could say it's low, medium, higher, critical. Alarms can be queried and deleted. And we can also keep track of the entire alarm state history. So you can go back and look at the alarms that have occurred and do analytics on that and root cause analysis later on. Notification methods are ways to create addresses and then associating those addresses, like an email address, with an alarm. We have an agent. It's a Python monitoring agent. Supports your usual system metrics, service metrics, such as Rabbit, MQ, MySQL, Kafka, many others. We do application metrics too. We have a built-in stats D demon. And we also have a library for supporting our dimensions concept. We support VM metrics. We do active checks on things like we do HTTP status checks and system up-down checks. It can run any Nagios plugin. And it's extensible slash pluggable. So all the other service above, Rabbit, MQ, MySQL, et cetera, have been added using the extension mechanism. But if you want, you could add others. And we'd love people to do that. We have a horizon dashboard that's integrated. And also, there's a very good time series dashboard out there called Grafana. And we've integrated in with that. And I'll be demoing that later. OK, so Sandy is going to cover events. Thanks, Roland. Good morning, everyone. Good turnout for this early on the third day. Awesome. Thanks. So I'm going to talk a little bit about StackTac version 3. As Roland alluded, this is our event pipeline engine. For those of you that are familiar with earlier versions of StackTac, we started that off as a debugging tool, really, for OpenStack. OpenStack, most of the services within OpenStack can publish notifications onto one of the queues. And you'll see that, like in this diagram here. Almost all the services can publish these things out. And then it's the responsibility of some downstream systems to consume this data. And notifications for those of you who aren't familiar with them are big, nasty, nested JSON data structures. They're awesome. They have a lot of cool information in them. It's the who, the what, the when, the where, the why of what's going on in your system. So it's different than logging files, which just sort of tells you everything. And it's different than metrics, which is CPUs at 70%. This gives you a lot of really interesting data for auditing, billing, debugging, just some great information. So we want to be able to consume that. The problem that we see is that a lot of people who want to consume this information end up rolling their own solution. So they'll make their own consumer to read from the queue. Storing these events, these are probably about 10K each. In a large production system, we run into about 100,000 events a day. That's a lot of data that I had to store. So as Roland talked about, archiving is very difficult. As you horizontally scale that, then you've got ordering concerns for the time that the events are being processed. You want to keep those in order. That's a difficult situation. OpenStack doesn't do a great job with defining the structure of these notifications, and that's something that we're working on at this Kilo Summit. So understanding the schemas for these things is a little bit tricky right now. Hopefully it'll get better. And then when you want to process those streams, having your downstream workers coordinate that data and fetch it is pretty expensive. It usually results in a lot of batch processing. So people will do big queries against the database, pull out the events that they're interested in, and then do some work on it, and that's expensive. So for Stacktack version three, we've, it's a complete rewrite. We don't have the funky UI that we do in the previous versions, but we've got a really solid engine and a very generic architecture. And it's made to scale up. So we've got Winchester, which is our actual stream engine that takes these events and works on them in a real-time stream. Stack Distiller takes that big JSON payload, compresses it down into something that's manageable, that's a flat key value pair dictionary, and then you can work on that a lot easier. So it gives you a unified view of the world. And then we've got other things like date expression processing. Shoebox is a way of archiving the raw notifications. I'll talk about that in a second. Quincy, Quintz, and Klugman, forensic scientists, are looking at dead bodies, which is usually when people are looking at events, you know, what happened, why did this thing die? So that's our UI side of it. Klugman is our command line tool and our client library. And then we have Yagi. Yagi is the thing that consumes from the queues. So if you wanna make a solution, one way that we do it is we'll just fire up a whole bunch of Yagi workers that will start pulling these events off of the queues and stuffing it into Stacktack version three. The first step of that process is, like I said, we take that big notification, we distill it. That's what our Stacktack Distiller Library does and we get this nice manageable event that we can work with. Everything's been scrubbed and unified and put in the proper format. Bad events are discarded. So then typically what will happen is it'll come in from Yagi, we get this notification on one side of the pipeline, we throw it out in the shoebox, which stuffs it up into Swift or some other storage if you want. So that's a great way of archiving, doing long-term archiving for all these big events. And then on the other side, we distill it down into what we call an event and we feed it into Winchester, so that's our stream engine. And this, like I say, can be horizontally scaled. So we store these streams as they come in in MySQL and then eventually we'll take those streams and we'll hand it off to a handler chain to get some work done on it. So for our Monask integration, what Roland's team has done is they've created Monask API with, Tong has rewritten the API from Java into Python. So we've got a Python API that takes these notifications and stuffs it onto a Kafka queue. And then the Monask Transform service uses Stack Distiller, takes it off the Kafka queue, reduces it down to an event, puts it back on the queue and then there's all these Monask event workers that takes those and feed them into Winchester. So you can see how you can use StackTac on its own or you can take it and integrate it in your own environment, just choose the pieces that you want. So the stream processing, this is really where the magic happens. So you have an event, you might have some key that you're interested in like request ID. So a request ID is a unique identifier for every operation that's started on the system, create a new instance, delete an instance. And we have a YAML grammar that you can define these streams with. So in this case we're gonna distinguish on request ID. And what'll happen is all these new streams will get created as unique request IDs come in. And those are like buckets where all those events are gonna get stored and sorted and managed. So like I mentioned, as these events come in and they get sorted by request ID and grouped, then we can trigger on these things. So we can order them by time. So it's like a jitter buffer for TCP. You get the events in the order they came in, even though it's distributed across multiple services. We've got a really rich grammar for it and it's all stored in my SQL. So your existing DV admins and stuff can do all the work with this. They don't have to learn a whole lot of new technologies. And what'll happen is you trigger on these streams based on a couple of things. It can either be some unique series of events that came in and go, yep, that's the end of that operation, I'm gonna do some work on that. Or by some expiry time. So you can say, okay, I haven't heard from this stream in an hour. Let's see what's going on with it. And then you can feed this into a pipeline. So this is where you actually do the work on the events. And you'll be handed an ordered set of events to work on. No database queries required, anything, you just get the actual events you're interested in. And we have things for doing usage for overbilling, underbilling situations, performance stuff, you'll see in the demo that we can talk back to Manaska, we can publish out to Statste, for example. So you can generate new events from these pipelines and you can generate metrics. So it's like a snake eating its tail. You've got events generating metrics, metrics generating events as alarms, et cetera. And then in the situation of maybe an expiry, you can do things like issue trouble tickets and whatnot. So idempotent processing for the pipeline, horizontally scaled, three phase commit on the pipeline so you're not duplicating effort. It's a pretty rich system. I'm gonna get tongue just for a second to talk about the microservices architecture, which is really the thing that lets us make this magic happen and integrate easily with other systems. Thanks, Sandy. So yesterday, no, Monday, Jonathan Bryce in his keynote speech, he said that he summarized open stack, he's in one word. He said, it is a choice. So that is really awesome. So open stack provided choice to people. Now when we designed Manasca, we didn't have that word in our mind, but we know monitoring as a service is for many people, right? Not just event or metric. Monitoring can produce a lot of things, right? Not just Jason. It could be images, right? Videos, right? So how do we design a system that can handle a lot of things, right? So we introduced this Merkel system, micro service in our architect that allow us to not only deal with metrics, events, but also if you have something, they say like audit data, right? Catf, or let's say images, right? How do you handle that, right? So we want to leave the options to the developers that they can develop APIs to handle the data, either select the data or some other things. They probably know the best of the data. They can precisely write some code to handle this data, they know the best. So with that, you can add your API, look at the third box there, and enable that. So your system, even though this still Manasca, can handle the data you have, okay? So we call this micro service, and this will become basically a configuration issue. You don't have to rework the entire Manasca system. And I think this little change enable Manasca to a lot of great things. If you do have a question, you can stay behind, we can talk more. Thanks. Cool, thanks, Tom. So if you want to learn more about StackTac, I want to get to the demo very quickly, so I'm just going to jump ahead, but it's all on StackForge, so it should support everyone's licensing and keep the legal department happy. We've got a dev environment called Sandbox. You don't even need to have OpenStack running if you want to play with it. And there's a YouTube playlist with a bunch of screencasts about the gory details of how this thing works. We'll make these slides available. Thanks. Oh, and there's going to be some cool StackTac stickers if you want afterwards. Okay, anomaly engine. So in that last slide you saw, there was anomaly engine way on the side there. So anomaly engine, what does this thing do? So it influence a real-time streaming anomaly detection. This basically is a consumer of metrics and a publisher of metrics, very simple. What happens inside of that anomaly engine is very complex, but the integration into Manasca was very simple. And this, you know, we don't have an anomaly API, but when we figure out the microservices component, we'll have an anomaly API for controlling that. This thing can operate today without any interaction. Okay, so we support two algorithms. There's an algorithm for a company called Numenta. They've been around 10 years. Jeff Hawkins, I believe his name, the inventor of the Palm Pilot of all things started up this company a while ago. He's been in this area for several years. They actually have a service called Grock out there. It works with AWS today. They've made their algorithms completely open source. So whatever they're using on Grock, I'm using today within this Manasca anomaly engine. And that would be under the area of like a neural computing type algorithm. The other test I'm doing is a Kolmogorov Smirnoff. Sounds like a drink that I had last night. And that's a two-sampled, non-parametric statistic. Goodness, a fit test. And I'll show that in a second here. So anomaly engine, it consumes metrics, does a lot of calculations, publishes metrics back. That's pretty cool. Then, what does it publish back? It can publish those metrics in the form of a predicted value and an anomaly score. And the anomaly scores are probabilities that you've actually detected an anomaly. And so you can now use that metric in your threshold engine to say, I want you to tell me about all the anomalies that you're seeing and send me an email when you find something anomalous. I don't need to look at data all night long. Okay, so this slide here, when I was trying to get prepared for this, I was like, what am I gonna show these guys? I've got all this data. So I let it run overnight, like Thursday night. And then I decided to run it again Friday night. So this is Friday's night's picture. It was a Halloween night. So anyway, if you look on the top, that's the actual in green and the predicted value in yellow, the new pick algorithm is predicting that. And you can see we're, I don't know when that is, it was like 12 o'clock at night or something. Something happened and we bumped up from around 30% utilization to 35, not a huge drop. It looks kind of big there. Wow, but you can't see the numbers. It's really not a big jump. So we went from 30 to 35 and then for a while there, we had some anomalies that were predicted by it, right? The same time that the jump we predicted anomalies. And we're an anomalous with new pick for a while until we get to here and then things drop back down again and then new pick probably says, oh, I've seen this before, so that's not anomalous. The bottom one is the anomaly score for the Kolmogorov Smirnov algorithm. And you can see right here, it's predicted some values. And when I first looked at it, I'm like, oh, wow, I can't show that to everybody. There's nothing there. That's not right. But it is right. If you look up there and you can see that where this point hits a little bit before that, we, you can see the variance is a little bit higher. We're starting to spike a little bit in that region. So this was, that top system right there, that's the CPU utilization for a dev stack VM running on my Macbook. That same VM is running right now. So something happened at like 12 o'clock at night and it did it two nights in a row and it lasts for a couple of hours and then it drops it out. This is just showing the anomaly scores, the probability of an anomaly. Okay, so pretty cool stuff. And you can use that in lots of ways to uncover. Okay, so current status of the project where Manoska and StackVac v3 are an open source. We're not an open stack incubated project. We would like to target that. That's based on, you know, other people's interests. So we're getting a lot of that. So maybe that is in the cards for us. We support metrics alarm definitions, alarms and notifications completely today and it's ready for production. Who is working on it? Three big companies, HP, Rackspace, IBM are primarily working on today. We've got a lot of interest so maybe we'll be adding some more. Who is deploying it? HP, we're not deployed yet in our public cloud. We have it in our test environments and it's also gonna get integrated into our healing and distribution. Time Warner Cable is doing a lot of work for us and doing a lot of dev-up style work and giving us feedback on this and Workday has also taken a fairly serious look at it and rumor is that they're gonna be deploying this in a couple of months. Okay, so performance. Everyone says, what's your performance like? This is just metrics in spurts a second. You know, there's, what's that quote? There's lies, there's Dan lies and there's statistics. It's really hard to do this and give you, you know, really good data but this is just the raw insert performance. I don't have queries being reported today. My test deployment environment here is a three note HP for Align, SL390 Gen7 servers. And we have Inflex DB running in a clustered deployment there. Our performance, total end to end performance including storage. So basically amount of metrics we can handle per second going into our REST API and ending up in Inflex DB is somewhere between 25 and 30,000 depending on, you know, how many metrics you're sending per HTTP requests and what time of day it is. The Monoska API itself, a single API can handle 50,000. I didn't spend a lot of time doing a lot of tuning on that. We could probably go much higher than 50,000 but this is what we were just measuring in this past week and 50,000 is already fast enough. You're deploying on a three node cluster that means you can do 150,000 per second which is well beyond what we could do with Inflex DB right this minute but if you really do need more data-based performance, Vertica is supported. Vertica is what we have in our public cloud and Vertica I can tell you for sure is capable of hundreds of thousands of metrics per second. And Elastic Search is being looked at and Cassandra might be another one that we look at adding in the future too. Okay so next steps, events is in progress, anomaly detection is in progress. We're formalizing that microservices architecture that means defining the message formats. There's not too many messages in the system. There's metrics, there's events, there's domain events, things like, hey an alarm is created. A Python port is in progress. We had a number of those components up there were all developed in Java originally. The persistent has been ported to Python. All components right now are Python except for the API and the threshold engine. The API is I'll say 75% ported to Python. The Java API is out there, it's 100% functional and that's what we're using today and the threshold engine which is written using Apache Storm is completely in Java right now but we'll look into doing that port after we're done. Okay so we're called to action. We're looking for contributors. Obviously lots of ways to contribute this project. See any one of us if you'd like to do that. More info, we're out there in StackForge. We have Launchpad, Wiki, IRC and we also have a pretty cool Monoska development environment which I'm gonna show here. If anyone's downloaded in the past and run into difficulties running that we use something called Berkshelf. We removed all our Chef dependencies and we're now using Ansible so that should be much more reliable. Okay, a demo. Okay so as I was saying, we are integrated into Horizon. I'm gonna supersize that. All right so basically we have this panel here. This is just showing our main overview panel. We've got open stack services at the top and then servers at the bottom. You can see Nova API, Cinder API, Glance, Swift and monitoring. So currently we're all green and when I click on this it's probably gonna ask me to log out. But we can take a look at like the Nova API. Yes, time down. Okay, we are back. Okay so we got the Nova API here and we'll click on that and. Okay so what can we do in here? We can go look at the history of the Nova API and we can see we had a bunch of alarms. Last night I was playing with this and I was stopping Nova API and starting it. I can see it's green right now. What I'll do is I'll come into my dev stack VM. Oh this is our vagrant development environment. We've got two VMs running. One really runs the Manoska service. The other one is running dev stack. And on the dev stack we have our Manoska agent running sending metrics to Manoska and then we also have that same agent sending metrics to our cells. So sudo stop nova-api, I hope I remembered that. Yes, okay so that's gonna take a second to go turn red and we'll come back to that. Okay so I mentioned in our system that we create alarm definitions. We don't create alarms. An alarm definition is like a template for something. So let's go through this process. We'll create an alarm and we'll call it disk space usage. Okay and we'll say we'll pick the average and we're gonna go ahead and these are all our metrics in the system right now. We got quite a few metrics in here but we'll look at disk space utilization present. Okay now right here where it says matching metrics. These are all the metrics in the system that match this alarm definition that our threshold engine knows about right now. If we were to bring another system online when that system starts sending metrics it'll match the definition and will automatically create alarms for it but what you can see in here which you probably can't see because it's a little small but I can see it. You see all the devices here in the first column and then we have host name. We've got two hosts dev stack and mini mod and the first device is dev sda one. The second one is dev sda one for two different hosts. So that's what's going on there. So we're gonna end up creating about 20 individual metrics for this particular alarm and we're gonna give it like a 97% and we'll give it a super critical and we're gonna assign a notification to me not Sandy or Tong and we'll go ahead and create that. So that's out there and create it. So what happens when you create this definition you end up with alarms. Lots of alarms that are created right? And these are some that were in there earlier and the ones for the ones that I just created are just showing up now they're all gray because they're in an undetermined state. There's a moving window average calculation that's being done here. Okay, so that's alarm definitions and alarms and the notifications is pretty simple. You got email and I won't go through the process of showing here how we create that. My slides that'll be online there's some screenshots in there if you wanna look at that. Okay, so I mentioned that we had, oh yeah, Grafana integration. So, oh, our Nova API is down by the way. I mean I stopped the service, so it turned red, pretty cool. I love when that works. All right, so this is our Grafana integration and the main thing I wanna show you here, well this is our main panel, we got our service API health here and when the API is up and running it's a zero and when it's not running it reports a one you can see way over here. That wasn't what I expected to do. I think we're running out of space. Oh well, we'll go back to the six hour view. See what that does. All right, well, I won't show you that. All right, and the leader row. Okay, so I want to show you a CPU, disk, database, network all being reported here. You can create these custom panels. What I wanna do is show you what we ended up doing with events processing. Okay, so give me a second, Graf, all right. Okay, so we have events being sent in this system right now. And what happens is it goes into the events processing pipeline. So it gets sent to our REST API, it gets put onto a Kafka queue. It gets consumed by our transform engine. They're open stack notifications. So these are synthetically generated. I have something, a background process running here sending a bunch of events right now. So it goes into the queue, gets consumed by a transform engine. It uses the stack tack v3 distiller transform it can create something smaller. Then it goes back to the queue. Then our events engine reads those. And it's bucketizing those events based on the instance ID. When you create an instance, you end up with something like 24 events. Sandy knows the exact number. Okay, so when the event, the stream was first created when we saw that instance ID and we start bucketizing and it's complete. It fires when we hit the compute instance create, compute instance create end event. And then we're gonna do a calculation. In this case, we do a calculation. We're gonna calculate the time between the start and the end. Maybe that's two minutes, I don't know what it is. In my examples here, I got synthetically generated seven flavors and my flavor one events are one second, two seconds, three seconds, four, five, six, seven. I made it very simple to remember here. Okay, so what we ended up doing is we did this calculation. Then we stuck it back into the Manaskar Kafka queue. I can now visualize this thing as well as any other metric in the system, right? So you can see that flavor ID one, two, three, four. Okay, well, that's kind of cool. Well, what if I have some SLA and I need to guarantee that my VNs are starting in less than two minutes? Well, I can go ahead and alarm on that. And I already created that alarm earlier. Compute instance create time. And there it is. If the, well, average compute instance create time is greater than four seconds than I want it to be a high severity event. In real life, four seconds is too small, but I'm just trying to synthetically show you what's going on here. So I've run out of time. That's the end of the demo. I can show you lots more if you're interested. Grab any one of us. We can take you through this. I don't know if there's any time for even a single question, but go ahead and shoot. And I'm gonna go ahead and wind down because I want the next guys that are coming up here to get enough time. And that's it for presentation. Thank you for showing up. And thank you.