 Hello everyone, my name is Iori Gregori Malfeira, I'm a software engineer at Red Hat and today I'll be presenting Gaslight Unbearable Metal Meet There on Promoted Exporter. What we will discuss today, a new product spawn, workflow, configuration, demo, adventures and limitations, IPA future, useful links and questions and answers. A new product spawn, our use case is focused on hardware monitoring. This is because it doesn't matter the size of your infrastructure, you need to be aware of the health of your physical nodes. For example, if you have higher temperature in the processor of different physical nodes, you can have some unexpected shutdown and of course you don't want this to happen in your infrastructure and if it's low, okay, you can just go to your data center and find the machine and turn on again and figure out why it was happening. But when you start to have a lot of nodes, you don't want to be wasting time trying to figure out which machine had a problem, why it stopped working, you want to be monitoring every physical node that you have available, either temperature, power consumption, memory, voltage and maybe other type of resources of these physical nodes. Going back on time, in the general release, I only introduced the sensor data collection and the goal was to be able to provide IPMI data so it would go to the salometer and you would be able to use the data for alarms or to trigger some actions in your infrastructure so the operator would be able to know what to do since it would have all the information in the salometer, you could have some interface with Grafana for example showing the metrics for temperature and speed, voltage so you can keep an eye and see if something is wrong, you can just take action and don't waste time because time is money. The Prometheus, the adventurers to choose Prometheus is related to the data model that provides that it's focused on time series. Each metric you have will have a team of data that would have a value and a time stamp related to that so each metric you have the information over time and you can create different alerts or either have some integration with Grafana that we will be showing the metrics there and something that is also interesting is that the collection that Prometheus does is via pool model so you need to just provide an endpoint where Prometheus will be scrapping all the data that will be there. It doesn't matter if it's empty or if you have data, if you don't have anything it will just be an empty request with get basically. So the metrics that Prometheus provide, normally it's called Prometheus text format and this format was also standardized by the operating metrics project that is an initiation that will provide a standardized pattern that can be used by different tools for monitoring basically. The format that Prometheus provide is you have the first line with help the metric name and the description then you have the type, metric name and the type of it and after this you will have one entry in each line that will represent a different metric. So the metric is identified by the metric name and optional set of labels where you have key and value and then you have a value that was collected when you're monitoring your system. So for example, you can have the temperature of your processor and if you have more than one processor in the machine you have probably one sensor in each that we're providing the temperature of it. So you would have two entry and one of the labels would be the sensor ID that it's providing the information about the temperature. IPA is the Ironic Prometheus exporter. It was introduced during the OpenStack training release. So how does it works? How magically we can just get the data and have it available on Prometheus? Where is the whole workflow that we expect to be happening? This is how normally your infrastructure would look like just a minimal example, okay? You would have a controller node where you have Ironic services running, Ironic KPI, Ironic Conductor and Ironic Spectre for example and also you would have the Ironic Prometheus exporter application there. We have the Flask application that will be used by Prometheus to collect all the data and then you have your data center with a lot of physical nodes and Ironic is managing all the data center, okay? So the first thing that happens is that Ironic Conductor will respond tasks for each node to collect the sensor data. When it's done, the data is sent back to Ironic Conductor and then when you have the data in the Conductor, we have Oslo Notifier driver for the Ironic Prometheus exporter that is in the Ironic configuration. So the Ironic Prometheus exporter driver will handle all the information that is sent to it and it will parse all the sensor data to the Prometheus format and it will store the data in the disk of the controller. And step four, Prometheus will pull the metrics doing requests to the Flask application that we have of the Ironic Prometheus exporter. You also need to know that step one, two and three basically they are independent of the step four is not something that it's synchronous. So you have the whole cycle of the Ironic's requesting the sensor data, gathering the sensor data and sending to the Ironic Prometheus exporter so it will filter all the data and create the file with the metrics for each node. And then you have Prometheus running in a different place and requesting data for each 15 seconds, 10 seconds it will depend on your configuration so it's not something that it's difficult. The configuration of the Ironic Prometheus exporter. We have two sections that you need to configure to have the Ironic Prometheus exporter working. First of all, it's the conductor section so you need to set the sensor data to true otherwise you won't be able to have any data and you need to have the sensor data interval. By default it's 600 seconds so 10 minutes you need to be waiting to get more metrics on each node. Be aware that if you use a very low value you may cause problems for the BMC because Ironic will be doing requests very often so this is not good. And there is also one optional configuration that you can set. It's sensor data for un-deployed nodes by default it's false so if you only have a node that it's enrolling in Ironic and you want to gather information about it you can just set this to true and you will be collecting metrics from all the nodes it doesn't have to be in the active state. The second part is the Oslo messaging notifications section you need to set the driver for Prometheus exporter and transport your L4 fake and that is a required parameter that is called location. This location that you will provide it will be the directory where you are going to be storing all the data you have of the nodes that Ironic has available. So here we will have a demo I will just first explain how is my setup for this demo. Basically I had a machine where I installed Ironic in the Prometheus exporter using Bifrost and then I have two different machines to be used as Ironic nodes Delpar Edge R640 and I'm using the IPMI driver for in this one and the HPI Proliant DL380-10 I'm using with Redfish and I have a Prometheus server and alert manager running on my notebook as containers and let's start the demo So here I can show you that we have Ironic installed in this machine we have the Ironic KPI, Ironic Conductor, Inspector and the Prometheus exporter by default the exporter runs on port 9608 here I can show that I have two nodes available the first one as I've explained the Edge 640 the machine it's power on and already active and the second one in HP DR-308 that it's in N-Roll but it's power on so that means if I didn't have the on deploy nodes I wouldn't be collecting data for this machine now I can show you that I'm using the IPMI driver for the the machine and some other information that we have about it and if we look at the HP machine we'll see that it's using Redfish second we will check the Ironic configuration to see if everything is okay so as I've explained before first we go to the Conductor section and you can see that we have sensor data, sensor data for unemployed nodes both are true and the data interval I've set for 90 seconds only because this is a demo but you shouldn't do this in production basically and then we can go for the Oslo section, I made a typo, sorry and you can see that we are using the driver Prometheus exporter and transport your out fake and you have the location we are using Parlib, Iron Prometheus exporter data so let's see what we have in this directory already we can see that we have two files one for each node that we have in Ironic in this part I'm showing the number of lines the HP has 93 lines and the DAO has 158 the reason for this is because the matrix that you have available will depend on the machine you have so the hardware window and also the driver you are using on it, I can just check how it looks like the data for one of the machines and as you can see you have the help, the name of the matrix and a little bit of the description and then you have type, name of the unit and as I've explained before our matrix is identified by the matrix name and the set of labels so if one of them is different you will have more than one entry so you have the temp1 and temp2 for the sensor so you have the matrix for temperature of each processor basically now I'm going to show that I have in my machine the containers for the Prometheus ever and alert manager so as you can see I have two containers running the alert manager is in port 90 and 93 and Prometheus ever 1990, if we go for the interface of Prometheus you can create different graphs for the matrix you have and here I'm showing the parameter fan RPM so you have an idea of the speed of each fan that you have in your machine and you can just hover by the lines and you see the value and all information with the labels about the matrix so you can understand what it represents you can also just select a few lines if you are interested in different sensors just to compare them, not see everything at once and you can just go adding different graphs this one I have the bare metal power status that will indicate if the power supply of a machine it's okay or if there is some failure the URL indicates that it's okay and you can't see that much but the lower part has two lines and the top one has also two lines because you have two sensors in each machine and they represent the power supply so you have a machine that showing there is a failure in the power supply so we had an alert for that and I'm going to show you the alert now, so if we click in the alerts tab in the prometo server we can see power supply failure two active, the reason for two is because you have two sensors that are providing matrix that have triggered the alert and you have some description and summary if we go to the alert manager you will see that we have two alerts and if we click in the information button we can see a better message description the health status of the power supply for the bare metal the L380 indicates a failure we can also add more information to this system disabled because you can just define how the message will look like I could just add for example the sensor to make it look better when the operator is reading you can also try to set up to receive an email you don't have to be always using the interface to see what's going on here I'm going to show the configuration of prometo what it looks like, this is known as the configuration of the container in the last part you need to add the job name and provide the information for example script interval how much time it will take for you to do the request again for that endpoint timeout, you need to have the endpoint for the matrix so it is less matrix normally and you need to provide the address and the port where prometo's will be getting these information go in that portion now let's talk a little bit about the advantages and limitations that we have for the prometo's exporter I can tell that it's as an advantage it's very easy to install and configure as you can see you don't have that much of information that you need to provide basically it's been organized, so it doesn't matter if you have AGQ machine, down machine Pujitsu, micro you can use any type of hardware basically the data collection is no introsive, you don't have to install something in the physical node to be providing the information that you need, of course it will depend on the information that you can get from the BMC basically so there is a limitation of the number of prometo's that you have and there is of course the integration with prometo's and prometo's also provides some integration with Grafana and many other tools that you can use as limitations so far we have only tested with machines using the IPMI and Redfish driver we just need to do some research when using iDRAC, iLO or any other type of hardware that is supporting the Ironic to see what metrics we can collect and then we just need to update in the prometo's exporter, contributions are always welcome, so far we only provide support for Gaussian metrics this means that for example you can't have a histogram, sometimes you can have metrics that will look like a histogram so you can't create that the IPA basically and as I've shown before the set of metrics it will depend on the hardware type so if you have for example the IPMI with Dell we have that number of metrics but if I was using for example HP with IPMI the number of metrics may be different and same goes if you are using Redfish with Dell or HP and so on it's also possible to use virtualization but only with Redfish IPMI doesn't support virtualization if you want to collect metrics so let's talk about the future of the IPA what can we expect as I said support for more drivers we will try to get some machines and do some investigations but if you have some machine you want to test out and see how it looks like the sensor data that you can get of the machine and we can just create a new parser in the prometo's exporter to have the metrics available so you can use your data center metric enhancement for Redfish so far Redfish only has support for temperature and some status of both of the power supply so it's not that much but we will try to improve that we are planning add support for introspection data you may be thinking ok but normally I only do introspection once in the machine yeah in this case it's not interesting but there are some cases there people that want to be introspecting the hardware from time to time so it would be interesting to have the information and maybe see if there are value in parsing the data and having some metrics about it and also we plan on adding some support to generate the alarm definitions that you can pass to prometo's some useful links we have the upstream documentation that for me it's in a good shape so if you have any problem when trying to use just go to the IRC and talk to us prometo's Bifrost now has support to install the IRC promoters exporter so it's very useful if you just want to give it a try the meta tree they have the IRC promoters exporter and the containers already running and some people already use that and here we can go for some question and answers thank you very much for your time if you want to get in touch with us come chat with us on free-load on OpenStackAround channel my nickname is Yuri Gragori or just feel free to send an email to the OpenStackDiscurs and we've there on that in the subject thank you