 Welcome, everyone, to this presentation about how to extend Monaska from the data gathering point of view and some reporting capabilities. I'm Stefano Canepa. I'm from Italy, so I beg your pardon about my poor English. But I work, I live and work in Golui as a cloud solution engineer, and doing things with OpenStack, obviously. And Monaska sometimes, and most of the time about working with monitoring and logging. And I'm Don Walsh. I'm from Golui, and basically everything he said only slightly less so. OK. I know someone in the room already knows what is Monaska. So if it's OK, we are going to present a little bit of what is Monaska. If everyone knows what it is, we can skip this part. Anyway, show of hands. So we need to explain Monaska in detail. OK. We can go. OK. OK, so Monaska is the monitoring for OpenStack. It is an OpenStack project. And it is made in a way to be completely scalable, high-performance, and full-tolerant. And being an OpenStack project is multi-tenant, too. It's based on microservice bus architecture. That means that every single service is communicated through a single message queue. For all the information, you can go to monaska.io, that is the website for the project. And the Monaska team is looking to introduce new features and changes. And so we are getting a set up a survey about your monitoring needs and what you are using, et cetera. And so if you go to the URL that is shortened, it could be of great help for us for the future. The architecture of Monaska is more or less this one. Maybe we are missing some pieces. There is the center, the message queue, that is Kafka. And then the biggest thing is the Monaska API, that is the connection to the outside world, where the pieces are connecting to Monaska and to get data or to get other things and to act on Monaska, to configure things, et cetera. Then there is the Monaska persister. The persister writes all the measurements, all the points, everything that needs to be recorded on a time series database. The data is a big amount of data, so it's not stored in the configuration database, while the configuration database is the open stack provided database, so it's equal to all the podcasts. And then there is the Monaska threshold engine, that process, metrics, measurement, and determine the alarm states. The notification engine is where all the metrics are collected from and the, sorry, where all, if an alarm change state, the notification happens. The notification engine can notify with using different, in different ways, the external world. So there are some default plug-ins, but based on the fact that it is a plug-in architecture, you can write your own plug-in and extend the notification engine. The Monaska agent that is on the top is where all the collection is done, and so all the data is origin from the Monaska agent. And then there are the log APIs that are not in that diagram that are an announcement to integrate logging inside Monaska. The UE is integrated into Horizon and use even Grafana to show user the data. So it's not just the Horizon UE provides you interaction with alarm definition and all the other things, but graphs are provided using the Grafana. And then there is even a new Kibana that is, there is a Monaska Kibana plug-in to integrate Kibana with the Keystone authentication so that Kibana can be run by the tenant to show what is related to this tenant and not everything that is on the system. Just introduce some terminology. In Monaska we have what we call metric is what we need to monitor. Dimensions are the properties inside the metric that identify every single dimension, value of dimension, helps in identifying the metric belongs to an off-name role, a service, or something like that. Then there are events. Events are both things that are generated by Monaska, or events are what Monaska consumes from the OpenStack infrastructure. Measurements are point in time of a metric. So a metric is like an object, a class, and the measurement is a single object of every single instance. Alarm definition, they are not alarm, but they are the rules that determine if something that happened on the system is an alarm. And then there is the alarm state. So the alarm can have free state alarm. OK, you're on the term. On the term, it looks like a little bit strange. But if you think about a VM that is monitored by Monaska, the VM is down. This is not an error because it's turned off, and so the alarm goes undetermined. And while if the VM needs to be run in run state, but it is down, this is an error. So the transition is different because the VM broke down while the VM is turned off. So what we are going to do is more or less the introduction of what Monaska is than what we did to extend Monaska. And we started from customer requirements. We are a consulting group, and so we have customers asking us to do changes. And the first case that we are showing you is how to extend Monaska to collect data from a storage cluster. In this case, it's a BSA store virtual. And that is used at the customer as a sender backend. The customer requested us to have more data than the one that the sender check provides inside Monaska. Then we had the same customer asked us to get the alarms and into their own alerting system. And so we use a SNMP trap to integrate into this alerting system. And to generate, we were asked to generate reports, periodical reports about the status of the OpenStack cloud. And so we create them asking to Monaska about what are the data. And then we were asked to, in reality, two different customers asked us more or less the same thing. And that is integrate Monaska in the Nagia's compatible monitoring system. Then I think it's your part. OK, the part that I'm going to talk about is that little green block up on the top right there. And it's the actual integration with the agent to be able to provide the monitoring for this storage cluster that we wanted to talk to. Now, in this case, we're talking about a storage system, but it could be applied to anything that you really want to monitor. So in this case, the things we had to do, we had to figure out what data we needed. We needed to figure out how to get it. And we needed to figure out how to get that and push it into Monaska. Now, the obvious candidate for this is to look at the agent, which is installed on anything that needs to be monitored or to monitor. And because that can be extended with plugins, that is the obvious candidate here. The things we needed, just for the sake of example, we needed things like the number of viable replicas. We needed to know if there are unrecoverable errors. We needed to figure out how much free space there was, kind of a no-brainer really, performance data, stuff like that. So speaking about the Monaska agent, like I said, it needs to be installed on anything that either needs to monitor or needs to be monitored. It will collect data from two sources. Now, anything that understands stats D is one source. That's kind of outside the scope of this, so we're not going to talk about that right now. The other is what they call checks. So these are just plugins. So Monaska comes with quite a lot out of the box, so it's already very flexible. But you can also add your own custom checks. And these checks come in a couple of parts, and I'm going to just talk about that in a little more detail. As with all of the other components here, these are just Python scripts, so it's pretty easy to extend. So there are two, they can come in two pieces almost. I've mentioned the detection here and the check. The check is the one that always exists in so far is that it is what actually reaches out to the thing being monitored and it questions and gathers the information for Monaska. It can also optionally have the detection component, which is what essentially configures the check. So it can go off and perform initial testing or just verifying that your configuration's OK or whatever you need, and then we'll write a configuration file that the check plugin will then use. So in terms of when they're actually run, as you can see there, the detection is only really run at startup or configuration. So when the Monaska agent starts, if you try to reconfigure Monaska on the agent on a particular node or you can explicitly tell Monaska, I want you to configure this plugin and off you go. The check is then run by a component of the agent called the collector on a regular basis. I'll show you that in a moment. OK, so in the particular case of our store virtual or VSA cluster, we had two different APIs we could speak to. One was HTTP, and one was essentially SSHing into it and requesting information, which we could then get back as XML and parse. We couldn't quite get what we wanted from either or the other, so we needed to combine both. The problem we found with that was is that what we could get from the REST API would get quickly, and it didn't put any particular load on the system, but it missed some key performance stuff, which is really the most important part of this in a way. So instead of that, we had to reach out this so-called SanIQ interface and grab information from that, but every request was slow and demanding on the system, so we needed to be careful about it. So in the beginning, we thought we could just roll these checks into one plugin and just run that repeatedly, but the problem we found was that essentially we were loading down the cluster by trying to repeat it for too many requests, so the solution we came up with was to split it into two. We could individually schedule these two components, these two check plugins, so that the REST one could be run relatively quickly, and the SSH flavor could be run on a much slower basis, because to be honest with the data, it wasn't needed at the same interval, at the same regularity as what was coming from the REST interface. One of the advantages of the Manasca interface is that you can individually schedule each check, so you can tell how long between each pass and how many passes it should gather before it pushes them off to Manasca, so it's quite useful that way. So the approach we used was we wrote a single detection plugin, which read from a master sort configuration file, and its job was to essentially check that the clusters that we'd specified existed, and then just to write the configuration files for the two check plugins. So like I said, single plugins is very handy. Now, normally, if you're monitoring something with a Manasca agent, you put the agent on the thing you want to monitor. That way, if the thing fails, the agent fails, it's pretty obvious that something has gone wrong. Unfortunately for us, we weren't in a position to be able to put the Manasca agent on the storage cluster, so we had to improvise and just connect from outside. So this brings an extra point of failure, obviously, because you could have the thing that's doing the monitoring fail, or you could have the thing being monitored fail, or both. So we needed to figure out which to switch. So in order to cheat that a little bit, we ran multiple instances. So in our two customer cases for this, we had a cluster of monitoring nodes, which we could use for this purpose. So we essentially installed a copy of this plugin on all three of the monitoring nodes in the cluster, and cheated ever so slightly, because ZooKeeper is a component that's used by Manasca anyway, and it already has its own leader election process. So what we basically did was asked it nicely, OK, are you the leader of this cluster? Yes. OK, if that's the case, run the plugin. If not, skip and wait. So that way, we got essentially a redundant cluster for very little cost. It's not an ideal solution. It's something that we want to replace with something a little bit smarter. But for the purposes of our tests, it worked just fine. The other thing, as I said, because the two different APIs essentially had different performance overhead, we were able to tune and tweak the configuration in terms of how frequently they were called in order to not load down the cluster with too many requests but get information at a usable frequency for the customer. Right, this is just a very quick walkthrough of kind of a time more than anything else to determine how it is that this process runs. First thing is agent starts up. The second one is that the detection plugins get run. These can optionally generate configuration files. And as I speak to in a minute, the configuration files have two important components. They actually configure both the collector, which is what runs the checks, and it configures the checks themselves. So there we go. The check plugins will then go off, gather whatever information they need, and then that all gets collected by the collector, which sounds a bit obvious. And then on a scheduled interval, we'll then push that information out to the forwarder, which will then package that up and push it to the Manask API so that Manaska as a whole can process it. Now it kind of marks the API here in the little box because this part of the diagram could belong to a node that's somewhere beyond, like on a separate node to the monitoring that we're actually doing here. So it just wants to kind of indicate that separation. Okay, all right, just a quick run through where things are, if you want to go do this yourself. On the nodes themselves, slash etsy, slash Manaska, slash agent, there is where all the agent files live. You can see the path there for the master configuration file and for the folder that contains all the config files that these detection plugins are generating. So one per file, it's important the names match, otherwise Manaska won't pick them up. The only other comment I make there is that you can see there's two folders there to checks and detect folders, and that's where the actual plugins live. So you just drop your Python files in there and then when you either reconfigure or restart Manaska, it'll reach in and start up the necessary plugins as it needs to. So I mentioned that the configuration files have two kind of jobs, and those jobs are to configure the collector and to actually configure the plugins. So in the case of the configuration files, they're YAML, and you basically got two things. You've got a dictionary called init config, which, as you can see there, describes how to tell the collector how to run the plugin, in other words, how frequently and how often it should relay that information that it's packaged up and stored. The other is this list of instances. So each instance in this case would be a storage cluster, but it could be whatever you want to monitor, so you provide a dictionary for each of these items that describes what it is, how to get to it, any login credentials, any parameters, whatever else you need in order to perform the checks. Now, I think I hand back at this point. We're talking about notification forwarding, Stefano. Okay. So now we switch to the notification engine, and we start working with an old version of Monaska that was where the notification engine was not plugin-based. So what we were forced to do is to work around and use the web book. It was, at the time, the only flexible way to get data out of Monaska. In practice, the notification engine, at the time, had three ways of getting notification, and the email was the one most frequently set up than the web book and then picture duty. And so for the web book interface is basically a post with some JSON data that is forwarded to a URL that is configured in the notification engine. So what we developed was a web book service, so the part that is in the center part of this picture. And this web book service is an independent service running on the same control plane cluster running in HA using the same infrastructure that the control plane used, so HAProxy in our case. And this service is able to receive this JSON data, parse, select the field that the customer need to be present in the SNMP trap and then generate an SNMP trap. Everything was developed using Twisted Python, that is a network event-driven library. The biggest reason is that because I know before and the code is really compact to manage all these things. And we provided the customer with a MIB so it's the managing information base for SNMP so that the SNMP trap receiver can load this configuration file and understand what is in the SNMP trap and translate it into something reasonable for the user. This is working, the only drawback of this architecture is that being two different systems, the notification engine running on its own and the web book service running on its own, they can disconnect each other and so the web book can go down and the notification engine can be unable to connect. And so we were forced to configure the notification engine to resend periodically all the notification. This resulted into a little bit of flood of identical traps that the SNMP trap receiver was forced to filter. So the SNMP receiver is not, this solution works but it's not ideal. The new way is, if it appears, okay. Yes, better because the notification engine takes care of retransmission using the message queue as, and so if something goes wrong in sending the notification, the notification engine got the data and take it into a queue of things that need to be recent. And so every plugin is again rescheduled for the resend to resend till this notification is really delivered or a number of retry that can be configured if I remember correctly. And so in this way, the SNMP receiver, the SNMP receiver you can avoid to create filters for and you are sure that the notification are sent just once and you are sure that they are sent, okay. In the meanwhile, I developed the notification plugin too and it's a little bit different because the information available are more and instead of just mapping the customer request to the SNMP trap, the MIB actually maps everything that is available in the notification data to an SNMP trap that I think it's a better solution. Then as we told, we were asked to create a report, a theoretical report about the status of the cloud. In this case, we interface with Monaska using Monaska APIs and this is what we had to do that is create and get an input, the start date, the end date and the data to be collected. We defined some aggregation of data and so we agreed with the customer about the sets of data that need to be collected like disk usage or compute outages or something like that. And we developed a Python script that given this free input, collect data using the Monaska API and so collect measurements and then elaborate the store them in a temporary database, do its own elaboration and then output a PDF file with some graph representation and the site effort, the temporary storage is a comma separated value file that is left to the user for further elaboration if they need to import in a spreadsheet and do some other calculation that they like. Integration with Nagios is again done using the Monaska API because again, we need measurements. In this case, what we need is measurements more or less real time. So while in the previous, in the statistical report, we were periodically running and asking the APIs for measurement, this is still a theoretical run of a Nagios plugin but the Nagios plugin run really frequently and through the Monaska API, get the measurement of the data that was collected and doing a little bit of calculation about time, we were able to just give last measurement evaluated and so that we have a metric definition. So the input and if you are not familiar with Nagios, Nagios checks are literally a script that can be executable, that can be written in your preferred language and at the end they need to output on the system the measurement and the summary data and the statistical data at the end. Then the measurement, the unit and the level of the alarm. And then what we did, we developed actually two different scripts with identical interface. So the common line require a specification of the metric dimension that need to be collected. So there is a little bit of quite a lot of configuration because the customer need to run exactly the same Nagios check with different common lines. And we developed this in two ways using the Python Monaska client. I think there is one more dash that needed that provides a Python library and we developed even using the assessing the REST API directly. We developed again in Python but the second way can be done in any language. The difference in amount of code is that you need to write a more or less double the code to manage everything instead of using the REST API because you needed to query the endpoint, get data back, parse all the data and this really faster and easy to use the Python Monaska client. And obviously you'd have to manage your own Keystone tokens and everything else. That's an extra step to that process. Okay. So we are at more or less the end of our presentation. And I think, Okay. I suppose just as a quick overview in case anyone fell asleep, why this is relevant to useful. We kind of talked about four things here. First of all, those the agent plugins. So as is written in front to be there, it's just to extend what Monaska can do in terms of what it can monitor or how much it can monitor of those things. The forwarder and other notification systems just allows you to essentially inform more things that Monaska has discovered that something has happened that you want to know about. The reporting angle allows you to get direct access to what your monitoring system thinks is going on as opposed to kind of a third party kind of readout later that you can get your information directly from the horse's mouth, so to speak. And the last part there in terms of Nagios integration, considering Nagios and its clones are everywhere. It's kind of important that we can at least get data out of Monaska and into Nagios so that people's existing solutions can work well with it. It's just to make it more useful in general. So I think that's about it really. Thanks very much. Any questions? Hi, during your integration with Nagios, right? For the Monaska, is the data collected using the Monaska agent or is it through the Nagios scripts? The data is, okay, there is a Nagios script. So we developed a Nagios check that used the Python client as a library. So you can import the Monaska client and then use the provided method to access Monaska for your Python script. Okay. All right, that's good. This is the idea. So you're not gathering any telemetry with Solometer, it's just done with static checks? No. The scripts right now in Kafka is the message bus? No, we just use Monaska and data available in Monaska. This is especially because the U.S. Packard Enterprise OpenStack Distribution uses Soloska, the Monaska interface to Solometer that does not allow alarms to be notified outside. It just provides Solometer with statistical data that can be used then for usage reporting and for billing. But alarms and events are not forwarded to Solometer in this implementation, that is Soloska. Okay, and on some admin channel chats on RSC, I've noted people talking about taking data from Monaska and building ML kind of analytics. So let's just be on that. Is that where you're extending that to? No, the Monaska analytics is a little way more complex that the report that we created. So it is a real analytics about the Monaska, so rules and algorithms running on the data collected by Monaska is not just a report with sums and min. That doesn't seem to be an open stack like a community project right now. Is that am I wrong about that or is that going to become something available? I actually don't know because I'm not following this. I know, I'm not involved directly in this part, and so maybe, anyway. Good talk, thanks.