 Hi everyone. Welcome for this session. So I'm Fran Boudin. I'm Open Stack Product Manager for NAV. Hi, I'm Pradeep Kilambi. I'm an Engineering Manager at Red Hat. So here is the agenda for today. So we will be covering some requirements around gathering metrics at scale and storing them. We'll talk about a solution approach that we are looking to based on our experience with the legacy stuff that we have been using. And then we'll dive into a detailed architecture on our new approach that we are talking about today using Prometheus operator. And then we'll go a little bit deeper into the configuration and deployment. And then Fran will talk about the roadmap. So with that, I'll hand it over to Fran to talk about the requirements. Thanks. So first, what do we want to achieve? The goal is to monitor our Open Stack cluster at scale with complying to a full detection of few hundred milliseconds and ideally even less for telco and also for enterprise. And we want to have a well-defined API at multiple level, multiple level. This is very important to handle this requirement. And obviously we need a time series to store these metric events. And this time metric should handle the scale and the pace. So a lot of event, a lot of event per second. So this has to scale. And it has to be expandable to multicloud as well. And so now let's have a look at historically what we had in Open Stack when we talk about time series as a series. So this is telemetry. Those are the telemetry components. On the left, you have the compute node with an agent which is pushing data to the controller. And on the controller you have other agent pulling data from the Open Stack API to get metrics from the APIs, which are pushing all of these events and metrics into a pipeline with Panko, Gnoki and an alarm notifier. So this is the existing solution, which is still working well. And the key element here is Gnoki, which is a time series as a service, which is storing the metrics. So this is the point where all metrics are coming. And we tried to use this existing component to monitor at scale. And we pushed the number of metrics per second up to the point it was breaking. And we have to say that this point was not for enough with the expectation we have in the first slide. So just to say that a typical monitoring interval for Cilometer application is 10 minutes. So we expect to get new metrics every 10 minutes. We were talking about 100 milliseconds. So I'll let you do the math. And bottlenecks. So we have pushed as far as we could and we have profiled, debugged, identified the bottleneck as Gnoki, the time series as a service. And also on the controller node, the load was not acceptable. And the controller was at 100%. At the same time, telemetry is getting data, which are mainly used for charge back focused on the VMs. And we want to monitor the infrastructure. And we want to monitor all of the nodes, controller and compute. And for that, there is a tool named collectd, which is collecting a lot of metrics on the compute, load of the CPU, temperature, a lot of things. And this is exactly what people want. So our idea was the first idea is we have Gnoki, time series as a service. We have the metrics that we want with collectd. Okay. Let's connect collectd to Gnoki. So we have implemented a collectd front end. So collectd was pushing data into Gnoki. And what happened is, okay, so it did not work very well because the bottleneck was again Gnoki. So it was not keeping up with the speed. And Gnoki was a bottleneck. And we worked very hard to make it scale and debug it. And we pushed the limit further. But still, that was way not enough. Also, Gnoki is an API for the time series, which is standard in OpenStack. But you don't have a lot of application to consume it and to build on top of that. So that was another issue. But this is a functional one. So, just to recap, and also to give some perspective, by the way. So Silometer Project is one of the core projects from OpenStack from the beginning. And Silometer API does not exist anymore. Now this is replaced by Gnoki. We have separated PENCO, the event API. And the API is now deprecated. And also the infrastructure monitoring is minimal with telemetry. So at the end, and also, the metrics that we want are coming from collectd. So this is how we have crafted our new solution, knowing that I want to say one point very positive here. Telemetry has been designed for charge back. It's working well for charge back. It's still working well for charge back at scale. And 10 minutes for charge back is way enough. So the solution. So remember, we are happy with collectd. So guess what? We have built a stack of a solution based on collectd. And the key thing about this solution, this is a stack where you can cut where you want, depending on the solution that you want to implement, because you don't have to take it all. And from a company perspective, we will support if you cut at any level, but you have to start from the bottom. So on the bottom, we have collectd, which is gathering all of the events and metrics that we are going to distribute on AMQP. And this is going to be stored not in Gnoki, but in Prometheus. And Brad will give a lot of details about how it works. Okay, great. Thanks, Frank. So as Frank described, I mean, we wanted to take a layered approach and provide multiple access points to the data. So for that, we have three levels of API that we expose. So one is at the collectd plugin level itself. So once the collectd plugins are enabled and gathering the data, you can redirect the collectd connection to wherever you want, right? So the second level is, now, let's take the data from collectd all the way to a message bus. So as he said, we are using AMQ, and it uses an AMQP 1.0 protocol. So once the data is on the message bus level, now you have access at that point as well. So you can subscribe to the message bus and gather the data from there. Now, you can go further and say, okay, I'll wait until the data gets all the way to the storage. So in this case, we use Prometheus, and once the data gets all the way to Prometheus, you can use promql or any of the querying and get the data from there. So the point here is we are giving you access at multiple levels that way, you know, you're not waiting for the data to come all the way to one point, and you're overloading that particular endpoint. So and to reemphasize what Frank has already mentioned, so Solometer and Nokia will continue to be for the chargeback and 10-inmetering purposes. So this is what we call, this is basically the name we came up with when we could come up with a better one. So it's called service assurance framework architecture. So service assurance framework is the total architecture end to end. So the main pieces of the puzzle here are collectd, as we mentioned, AMQ, which is a very scalable, you know, messaging protocol. Prometheus operator for, you know, time series data storage, and then, you know, we continue to support Solometer and Nokia for chargeback. So let's look at the architecture here. So think of this from bottom to the top, right? So the bottom is your client, so which is your OpenStack cloud. So you could have your controllers, computes, you have Ceph nodes, maybe you have your own Prometheus or OpenShift running on top of OpenStack, and you can also have application components. So we have collectd plugins that we support, that span across, you know, performance monitoring all the way to application level. So we'll talk a little bit more about the plugins in a bit. So this essentially gives you the idea of we gather the data through the data collector, in our cases, collectd. The data is sent across to the top to the message layer. And then once it's over there, we have multiple access points where we can either send the data directly to the Prometheus operator management cluster, which is what we call it, which is essentially Kubernetes or aka OpenShift with Prometheus, you know, Grafana and Elasticsearch running on the top. Or you can skip that entirely and have third-party applications directly integrate into the bus and get the data from there. Example cell one, right? I mean, you can directly pull the data off of the bus without even having to worry about the cluster. So let's dig deeper into each of the components. So collectd, so as, you know, most of you are aware, collectd is a very mature project. It has a very big community and, you know, we have been involved with the collectd community as well as with Open NFV barometer project. And we have contributors contributing to these projects at the same time, you know, doing reviews and getting our plugins accepted. So we have been working with our partners like Intel in, you know, pushing to get a lot of plugins accepted upstream. So we use collectd 5.8, which is our latest version of collectd, which has a bunch of plugins enabled, along with the set of plugins we are mentioning here, which are part of the barometer project that we enable. And of course, collectd is natively supported through OSP Director, aka the upstream triple O, where you can actually use our templates to deploy collectd. So here's a quick overview of the OSP 13. This is our previous version. So we just announced OSP 14 two days ago. So OSP 13 has a bunch of collectd plugins pre-configured. So if you enable collectd through the director, you already get all of these enabled by default. And we also have a bunch of other plugins that we package that you can configure and enable through the heat templates. But to begin with, we have a pre-set, pre-canned version of the plugins that we already enable. Okay. So jumping onto the next piece of the puzzle, as you have seen, there's a big bus sitting there, right? So that has a lot of other components associated with it. So this is the AMQ 7 interconnect, which is our Red Hat messaging project. It's a product, and we have an upstream for that as well. And the advantage of that is it actually uses a mesh network. So what we have is a Cupid dispatch router. So we use Cupid as our messaging protocol, so which is the upstream Apache Cupid. So you have two types of dispatch routers, I can say. So one is the edge router, which you want to run on the nodes, which are the open stack nodes, which are very lightweight and extremely scalable and very high-performance, and they don't have any overhead. So that is what we run on the open stack nodes. And then we have something called a core QDR, which is essentially the main router that takes all the messages from the edge routers and then routes it appropriately. So the advantage of this is it uses the shortest path algorithm to pass the message from the client to server. And it uses all the buzzwords like high availability, so it has a really nice way of retrying so the edge router keeps trying to connect to its connections to the core until it gets a connection. And if the core goes down, it tries another connection. So it has a very high resilience. And not to mention it's stateless and it's very efficient end-to-end. So let's get to the star of the talk, Prometheus. I guess most of you are aware, I mean, this is a quick intro to Prometheus. So Prometheus essentially is an open source monitoring tool that most of you would have heard of. If you have not, it's an open source monitoring tool. It's mainly geared towards metrics, so it doesn't do logging, it doesn't do any trace routing, it's just for metrics. It uses a pull-based mechanism, so it does scraping through HTTP get. So if you're looking at this picture, you have a Prometheus server, it uses HTTP to go to your target, so in this case, a target could be your node, so it goes, gathers the data, and then stores it in Prometheus. And then it has a really nice query language that you can query and do visualizations. You can use Prometheus, you know, WebUI to do the visualization or you can use Grafana and create dashboards. And another one is it has a very nice multi-dimensional data model, so it makes querying a lot more scalable. And of course, it has alerting and roles, so we use alert manager that we'll talk about a little bit. So this is all great, but one of the things about Prometheus is the configurations can be very complex, right? So it's not something that's very easy to, you know, deploy and then redeploy and then maintain. So for that, there is this concept called an operator that was, you know, it's an old concept, but the framework was built by CoroS. And essentially what operator is, it's basically a software abstraction, right? So it's basically taking the business logic, you know, out of the way into a little, you know, encapsulation, and it basically takes care of managing your application. And, you know, it's mainly built for Kubernetes applications, and it manages the lifecycle and installation of your Kubernetes apps, and it does through the custom resource definitions. So here in this picture, I basically grabbed it from the CoroS website, but you can see the flow. Essentially, this is the goal of operator, right? So it takes an expected state and an observed state, right? And basically it goes through the sequence of it observes what the state does, it does analysis, and then it acts on it. So for example, let's jump into Prometheus, right? So in Prometheus case, let's say we want to abstract the business logic and create an operator. So what the Prometheus operator does is let's say I create a custom resource definition and say I want to run two instances of Prometheus with the version 3.10, right? Okay, great. So it ensures that you have that running. Now, like six months later, I want to upgrade my Prometheus to 3.11. So instead of you going and doing all that, you just create a custom resource definition and say I want to use 3.11 version and make sure I have three instances running. So if it notices that it has two parts running and it's running an older version, it's basically going to spin up another part and then it's going to bump your version to 3.11. So it makes all of this very easy so you don't have to sit and muck with your configuration and all that and those can get really long. So that's basically the advantage of Prometheus operator and it preserves the configurability and it abstracts out the complex configuration. So along with that, we have few other components in this service assurance architecture, right? So Prometheus is great for metrics. So we need some way to handle events. We have a lot of events coming in as well. So we use elastic search for events and logging and we do it through an Elk stack that sits next to the Prometheus cluster. Along with that, so the advantage of using Prometheus here is that we can use elastic search events and then forward those to the Prometheus alert manager and the alert manager once it generates the alerts, it can send it back to the Cupid dispatch router and I'll show you how all that works in a little bit. And good segue that works through Smart Gateway. So now the data that we receive from CollectD is obviously the CollectD format, right? So Prometheus has its own, you know, it's a time series database. It works through labels. So you need some way to translate the data from the CollectD format to Prometheus. So we built something called a Smart Gateway which handles that for events and metrics. And then what it does is it takes the data, it converts to that format and it exposes it through an HTTP server and then Prometheus basically can come and scrape the data from the server. And also it helps with relaying the alarms which I'll go in a little bit. And finally, Grafana, it's optional but you can run Grafana to visualize your data. So this is a little bit of an in-depth, you know, deeper into the server side. So let's go from this side. So you have CollectD. So think of the CollectD and the dots here are the QDRs. These are the OpenStack nodes, right, on that end. Now the data is coming onto the AMQ bus, as you see this is the below line. And once the data is there, you have two dispatch routers. So the QDR, AQDR, B. So in this case, we are running two instances of Prometheus. So if you have three instances of Prometheus, you'll have three of these. Just think of it that way. Now for each of the instances of Prometheus, we have a separate Smart Gateway. So the way Smart Gateway works is, as you see, we have multiple components. We have, once we gather the metrics, the metrics are wrapped through the metric listener, right, and then it's cached and then we do the conversion and it's being exported through the metric exporter. So this exposes an HTTP API and then basically Prometheus comes and scrapes the data from this API periodically, right, every 10 seconds, 15 seconds. Now the next piece here is events, right. For events, we have something called an event listener. So what the event listener does is it wraps the data and then it basically works with the elastic search client to send the data over to the elastic search. And then you can see a few more clients here. So one is the Prometheus alert manager client and you can see an HTTP and an alert publisher. So what's happening here is the Prometheus alert manager client is basically, you know, handling the alert. So once the alert rules are coming in, it has alert rules and then once the event data is coming into the alert manager, what it does is it basically sends the data to the HTTP, via HTTP, and then the alert publisher publishes the data back to the QDR. And now it's back on the bus. So it's essentially a loop all the way from the event coming in, we process the event, get alerts, the alert is created, now send the alert back to the alert manager, and then we publish the alert to the bus. So that's kind of a very high-level overview. And you can see once the data is in Prometheus, you can query and, you know, use Grafana to do graphing and analysis and all that. So that's the architecture. Now let's get into configuration and deployment of this whole thing. I mean, it can be quite complex if you want to do it all, you know, with a bunch of Ansible scripts. So what we basically do is we use triple O to do the integration. So the client side, when I say client side, the client side here is your OpenStack Cloud. So you have an OpenStack Cloud, you have controllers, computes, self-nodes, and whatnot running. So every node in your OpenStack Cloud will have a collectD agent and a Cupid dispatch router running. And this is configured and integrated through triple O. Now, once you have these, these are, we have containers built through COLA, so we have containers for both collectD and the Cupid dispatch routers. So the orchestration happens, we grab the container and we run it on each of these hosts. And then through the environment template, when you're installing these, you can configure your environment template to say this collectD needs to point to this dispatch router to send the data. So all that can happen through that template. So I'll show you in the next slide. So here is an example of how do we enable collectD and QDR on the OpenStack nodes, right? So on the triple O side, you create an environment template. So you enable two resources. So these two are the service registry resources. And we pre-ship the QDR and collectD YAMLs for the Docker containers. So you enable those two. And then you can set some parameters. So in this case, what I'm saying is I need to set the collectD connection type to aim Q1. So previously we used the same thing except the connection was knocking, right? So it's very easy to configure. So in this case, we're using aim P1. And then you can set the instances and the notification and all that. The other piece here is the parameters, right? We need to tell where the QDRs are and what their ports and IP addresses are. So that's where you configure here. So now we created two YAMLs. What we do is, so as you see, we have an overcloud deploy. So when you're deploying your overcloud, and this happens on the undercloud, if you're familiar with triple O, there is an undercloud. That's your management node and then overcloud, which is your actual OpenStack cloud. And here what we essentially do is we pass those two environment files and then we do an overcloud deploy. So it runs these two services on top of each of the nodes. Now on the server side, as I said, server side is an Kubernetes cluster. It's an OpenShift cluster running on three bare metal nodes. So remember that we created an OpenStack cloud, which is one heat stack. And then we are going to deploy using triple O. We are going to deploy the server side, which is an OpenShift cluster running alongside. It's not on top of OpenStack. It's running next to OpenStack as a separate heat stack. And then we do the bootstrapping and run from Ethiopia's operator Grafana and all that on top. So here is an example of how you would deploy your server side stack. So you use the same overcloud deploy command, but I'm basically enabling the OpenShift. And this will tell what version of OpenShift and what we use is we use OpenShift Ansible for now. I mean, that might change in future. So we use OpenShift Ansible to drive the OpenShift cluster deployment. And then, you know, we orchestrate running the services on the top. So Prometheus, Elasticsearch, and QDR. So here is the post deployment overview. So this is how it'll look. I mean, this is in my setup. I mean, it might look a little different in yours. But I have the OpenStack nodes at the top. And then another cluster, as you see, it's a telemetry node. I have three nodes, three physical nodes. And on top of that, I will have, you know, OpenShift running. And on top of that, you'll have the applications, which are Prometheus and all that. So this is kind of a global, you know, holistic view of the end to end, right? So on one side, you see the OpenStack cloud that you have. And each of the nodes will have the Kaliqd and the QDRs running. The data is sent to the AMQ bus, as you see. And then on the other side, you have the server side, which is your Prometheus operator cluster. And you can see it goes through QDR. You have the service gateway, smart gateway. And then the data is sent through the... So this is kind of an overview, a little more details on, you know, how the data happens. So we have something called a proton client, which is what is responsible for taking the data in and out of the Kaliqd and send it through the messaging and all that. So a little bit about scale. So we have, as Frank mentioned, so our main bottleneck was Naki, right? So the storage and the IO there, mainly what we have noticed was once you have Naki with a lot of data sending in, the CPU pegs quite a bit. So we did a little bit of scale testing mainly with Prometheus. This is an overview of the hardware that we used. And some of the data that we were looking for is the raw metrics, the time series, and the rules, the alerting rules that we created. And our target, we wanted to push it as much as we can. So our target was 4,000 hosts with hundreds of metrics each. So that's like millions of metrics per second. And, you know, Prometheus was able to keep it up pretty well. So with four CPUs, this is an aggregation, so all the way to 4,000 hosts, we are barely using 50% of the CPU still. So it was doing very well. So we were very happy with the performance. So with that, I'll hand it over to Frank to talk about the roadmap. Yeah. Thanks, Prada. So all that Prada. So yeah, a little of explanation. So beside the name of the upstream release, you have our product number. So 13 for Queens, Rocky Stein. So Rocky has been delivered upstream and we have announced 14 this week. And our next release, Steinbase, will be 15. From all that you have seen, everything is on master, meaning it will be in Stein. So in Stein, this whole thing will be GA. In 14 that you can almost have, you have everything, but the ability to install Prometheus from the triple low. So you have to install Prometheus with an uncivil playbook. So this is what we call that tech preview. But this is almost all there. And something that we do at Red Hat is because our 13 release is what we call a long life, we are going to back port these features because our customer, this is what they use. Most of them, they use 13. They use 10 today. And 13, they're going to use 16. But very few will go on 14 and 15 into production because of the long life aspect. And now that we have in 15, that will be in 13, the ability to deploy a great monitoring tool for one cluster, we need to go multi-cluster. So this is what we are going to work next. And what do we have on the bottom? So you have, let's say, a larger cluster for Prometheus, which is central. That can monitor on the right a local OpenStack deployment plus remote deployments. That's what we call edge deployment. So have you heard about edge this week? So how do you monitor your edge? Your thousands of edge sites. Okay, this is how you do. And then you have two kinds of edge cluster. You have the small one where you don't want to put Prometheus. This is the middle one. So you can remotely monitor. So this is something that so we are going to work on and we'll make sure that this is flying well with whatever kind of network you have in the middle. And we are going to provide the characteristic about latency packet drop until it does not work because there is a limit to everything. And then you're going to have federation of Prometheus because you're going to have clusters where you will want to deploy your local Prometheus to monitor locally. And what you are going to do is aggregate with a federation of Prometheus. So you have a central monitoring point, which is what people want instead of having thousands. Yeah. And another one that we are looking into is how do we shrink this architecture? I mean, this is great for large deployments, but we also want to think about how do we make this for a smaller cloud. So if they don't want to invest in too much hardware, how do we make this work for smaller clouds? So that is also something that we are looking into to shrink this. Thank you. And we have time for question, if any. Yeah, this is something that we don't provide today. Yeah. Yeah, I mean, definitely that is a concern that we have with Prometheus. And we are looking into alternate solutions that we could use Prometheus alongside another technology like Thanos or something to see if we can use it for long-term purposes. But I agree with you. That is something that we did not forget. Yeah. And what we don't provide, partners can provide and build on top of API. And the key thing about this solution is if you don't like MQP bus, you can just rely on the collective. And if you don't like Prometheus, you can just pin in MQP. So this is a pluggable solution. And the top of the plug, the pluggable solution today is Prometheus. I'm just wondering if you could clarify on the MQP bus, is it completely brokerless? So you only have the Cupid routers or is there a broker somewhere that I kind of missed on the slides? Yeah, I mean, it is brokerless from metric standpoint. For events, we do use Artemis broker to actually send the events over. So for events, we do use Artemis broker. So the solution presented is for monitoring infrastructure, right? And for tenant monitoring, monitoring as a service, I still recommend using gnocchi that the message I have taken from one of the early slides, right? So just repeat your question. So can we use this architecture to monitor infrastructure and tenants as well? Yeah. So what do you recommend for tenant monitoring if I want to? So you mean tenant monitoring as in the open stack, like the services? Is that what you're talking about? Which tenant are you referring to here? I would like to expose my metrics to users. To what I'm saying? I would like to expose monitoring as a service. Sure. Okay. Yeah. So expose metrics of applications, for example, to the users? Yeah. That is something that we could do as part of the smart gateway. So we already have some logic that we use to convert the data format from CollectD to the Prometheus format, right? So Prometheus has this concept of labels that we can leverage to add like the tenant information, the UUID and all that. We do not have that support today, but that is something that we are looking into even from open stack standpoint, like the service level metrics that we gathered open stack, we will need to make sure that is somehow queryable from Prometheus side, right? So yes, so the place where that logic would be would be in the smart gateway. But yeah, that is something that we are currently working on. So hopefully by OSP15 it should be in there. All right. Thanks. Yeah. Thanks for the talk. I thought it was very interesting. The architecture of having a central message queue where metrics are published onto it and you have further processes subscribing to these metrics is quite similar to Manasca. I don't know if you're familiar with that, but I wanted to know if you considered using it. So I'm not familiar enough about Manasca to comment on that, but I mean, the decisions we made here were based on the direction we wanted to go with respect to OpenShift and Prometheus. And we knew that the AMQ message routing that we have, which is using the AMQ 1.0 protocol scales well. So this puzzle that we put together seemed like it was a good direction. So we just went with that. So yeah, I cannot comment on Manasca because I don't know enough about it. Yeah. Okay. Thank you. Thank you for the talk. It was amazing. So this solution basically solves me all the alerts, metrics and logs, right? Everything in one piece. Yeah. So for Elasticsearch, we use events and logging for Elasticsearch and metrics for Prometheus. So yeah, we handled all three. Thank you. Thank you very much.