 Hello, everyone. We're going to cover the NASCA one year later session. So with me today, I've got a couple of my colleagues with me. We've got Vidic Edek from Fujitsu Enabling Software Technologies out of Munich, Germany, and Shinwa Kawabata from NEC based out of Japan. And I'm Roland Hockmouth with Hewlett-Packard Enterprise in Fort Collins, Colorado. So we won't be covering a lot of overview or architecture in NASCA today. The agenda really is to talk about what's happened in the past year. So a little bit more of an advanced session. If you're not completely familiar with NASCA already, then there are a lot of other previous sessions that discuss that. And if you want to know more, please grab one of us after today's session as well. So we'll be focused on today going through what we've done with metrics, alarms, notification methods, some UI improvements, enhancements to the NASCA agent, logging, as well as a few other miscellaneous topics. It's about a year ago that NASCA made it into the OpenStack Big Tent. And since then, in addition to doing a lot of development, we've also gone from a single vendor status to a diverse affiliation status, which means that we're recognized for having multiple vendors working on the NASCA project. It's no longer just a single vendor solution. Some statistics that I'd just like to cover at a high level. And these are taken from Stackalytics. So the number of organizations involved is 31. Contributors, 97. Commits, 1,075. Reviews, 4,080. And lines of code, 215,000 and something. We're going to have a section that covers each main area within the NASCA. So this is kind of the overall architecture slide that we show when we talk about the NASCA. Highlighted in red here is the NASCA API, which I'll be going through and talking about some of the feature enhancements in that area. So most of the enhancements around our API have been improving performance, mainly improving performance for the sake of interactivity from UIs. The first improvement to talk about is the metrics group by capability. So the problem was that up until this change, you could only request one metric, one unique metric per measurements or statistics query. And when you are displaying metrics in a UI, you are often displaying dashboards that have multiple metrics in them. And so we knew about this problem for a while, but it started to become more prevalent at charter communications where they had pertinent dashboards. And these tenants had, let's say, 10 to 20 VMs. And they wanted to display kind of a summary view of all the relevant metrics before their VMs. So they might have a graph for CPU utilization for all their VMs. And within that graph, there might be 10 to 20 metrics. Previously, that required a single request per metric, and that was leading to some performance concerns. So our goal was to reduce that from a single request to one request, at least one request per graph. We added this group by capability. It's about 20x improvement in performance. At the command line, if we were using this, the way it would look is with the part highlighted in red, group by with the asterisk there. That's our wild card for group by all metrics. We are in the process of being able to group by dimensions in Monosco, so you could say group by service, or group by, in this example, probably be more relevant, is device. So this is for the disk space used percent metric for the last minute. That's what the minus one is. And this is showing, this is in our DevStack environment, that there are three metrics. They're all the same name, disk space used percent, but they have different devices. One is vagrant, one is vagrant cache, and one is SDA. And each one is just returning three measurements per metric. So if you think about this in a Grafana dashboard, you can now query once, get multiple time series, arrays back for each thing, and then display that. That's how that fits. Okay, so then there's this new metrics names resource. When you're navigating metrics, what you want to do is normally go from metric name to dimension name to dimension values, where the name and values are dictionary in Monosco. So the problem was users wanted to do this. We had one API that will allow you to do that. That was the get metrics API. And that would certainly tell you all the metrics that are in the system. You could have millions of metrics in the system, of course, and then you'd have to do client-side filtering to figure out what were the unique or distinct metric names in there. So this API allows you to say, just give me the metric names. And so that was the goal, just give me the metric names and I sort it and paginate it list. And that is the API that we added. There's a small example of it. So that part highlighted in red is the command line parameter that we'd use to get that. And we're just showing a partial list here, CPU frequency, up to the CPU system percent is shown, but there's a few hundred probably in this list. Okay, so the next change to our API was to address this other thing by navigating the dimension names and values. So if you have a metric name that's called CPU user percent, it could have dimension names of host name or service or resource ID in the case of a VM and there's, I don't know, 10, 20, 30 common dimension names that you would find in your system. So again, what we had before was just this metrics API, say give me all the metrics, then you could figure out the dimension names, but that was turning out to be the bottleneck in our system. So the goal was to improve that and the solution for both dimension names is up there on the right, but we added a slash dimension names API and a slash dimensions names values API. This next example just shows the dimension names query being used. So the part in red is dimension names list with a metric name of CPU user percent and this is in a DevStack environment and there's only two dimension names in this list, not a very exciting example, but it's very simple, but there's two dimension names of host name and service. Okay, moving on from metrics, the next problem that we had was displaying all the alarms in your system. So if you wanted to display like these summary views of the number of total alarms you had and the total number of alarms in the alarm state and the okay state, as well as their severities, you had to query and get all the alarms in the system and then do client side processing again to group them and order them and that was proving to be another bottleneck, so now we have this alarms counts resource. So the goal was to prove that and the solution was to add this alarms count resource. It will tell you the count of alarms and you can filter on that based on the alarm names, the state or severity or metric names and metric dimensions and you can group them, a common way to group them is by the dimension of service, service being like Nova or Neutron. So this is an example of the command line of doing that. I've got a simple alarm count with group by on there and we're grouping by dimension, value and state. In this example, there's only one, two, three, four, five rows in this table that are being returned. So there's one alarm for block storage, it's in the alarm state, one alarm for compute, it's also in the alarm state. Note, monitoring has 100 alarms but they're all in the okay state as opposed to all the other services which are alarmed. Sorry, Nova, Neutron teams were that, I did take that. Okay, so how would this ultimately manifest itself in like a larger view? This is just an example from our product from HPE called Healy Ops Console and you can see that we've got some some reviews of the alarms compute has one in the alarm state out of total of 35 and storage also has one in the alarm state of total of 142. Telemetry in this case has just 126 total and none of them are alarmed. And so that's what that last API would be used for. All right, let's come to the next component, threshold engine. So, Mulaska threshold holds the state of all the alarms which are hold by the system and the metrics which come to Mulaska threshold are evaluated and then if the threshold is exceeded the alarm state transition message is sent back to Kafka. Okay, so the first problem we had was that original aggregation functions from Mulaska like average, minimum, maximum, they all work on the evaluation period. It is per default 60 seconds and you always need to wait for a given evaluation period for the alarm to transition although the measurements are already there which actually show that there is changing the status of the measured metric. Especially for the binary metrics where you have two states and all in one like health checks and the API status, you would want this alarm to transition immediately as soon as the new value is there. So we came up with this idea of the new function last which actually is not an aggregation function but it just takes the last value from the measurements in the system to evaluate the alarm. Here's a short example. How it works, so we create the alarm definition using the last function then send metric to trigger the alarm and when we list the alarms, the alarm is almost immediately in the alarm state. We measured that latency between the measuring the value and notification of the system is approximately about 350 microseconds. The next problem we focused when developing metrics generated based on LOX, they are different in nature from the traditional metrics because the traditional metrics, they come from the agent, the measurements are taken regularly. So the measurement is always there or if the system is not there, the measurements are not available and the state of the Monaska alarm is then undetermined. In case of LOX, it's different. It's more like events like metric. So if we have errors in the LOX, we get the metrics. If there are no errors, we don't get any metrics but we don't want the alarm to go to the undetermined state. If there are no errors, we want the alarm to stay in the okay state. So then we came with the concept of deterministic alarms. So the only two states which are possible for these alarms are okay, which is default and alarm if the threshold is exceeded. And here's the example to show this, how you can prove this behavior. We just create two different alarms, one deterministic and another one undeterministic. Send the metric to trigger the alarm. At first, we will see both of the alarms in the alarm state. Then just wait a couple of minutes and after listing the alarms again, if there are no new metrics which would trigger the alarm, you get the okay for deterministic and undetermined for undeterministic alarms. So for the traditional alarms. Right, let's come to the next component, monaska notification. As the name says, it is responsible for sending notifications to the user when the alarm changes the state. The first new feature I want to talk about is heat autoscaling. So, heat autoscaling was originally implemented for integration with the Cilometer. And Cilometer works differently as monaska. Cilometer notifies all the time continuously when the alarm is, when the threshold is exceeded. Monaska sends the notification only once when the state transition occurs. And this single shot notification, it is not sufficient for heat autoscaling. Heat is not responding properly to that. So, to address this, we came with the idea of periodic notifications. So, when you define the notification, it can give the additional parameter to say, I want this notification to be repeated as long as the alarm is in the alarm state. There are some limitations on this. So, the first one is, it is only for webhook notification. And the second one, the period is only 60 seconds at the moment. It is totally okay for heat autoscaling, but for other use cases, it could be a problem. So, here is just a simple example. You can see this argument period. And with this, you get the periodic notification. Here is a nice way to test webhook notifications. It's really useful for testing rides. And the next feature, notification engine plugins. The notification methods till now were hardcoded. Three of them were available. Email, page aduci, webhook. Yeah, but you definitely would like more possibilities. And to allow this, we wanted to make this pluggable so that different developers can work independently on implementing the plugins and also the operators can configure which notification types are available in the system. So, we developed this pluggable mechanism and also three notification plugins for Slack, HipChat, and JIRA. Here, you can see the new resource for listing notification types. And I've got here a screenshot from the Slack notification. Thanks to Shinya. And now, Shinya, your turn. I'd like to show you new horizon improvements. Same as other OpenStack projects. Monaska has a horizon UI and in addition, Grafana and Kibana are integrated. This lists bigger improvements of Monaska UI. I explain one by one the new feature on next few slides. First, compound alarm expressions and match by field with auto-complete. Compound alarm expressions are great. It enables several expressions joined by Boolean operator. So, you can specify complex expression without knowing expression syntax. For example, CPU usage is high and memory usage is high. Match by parameter is useful but a little bit difficult to understand. This parameter is used to determine unique alarms. To put easily match by field, change match by field to auto-complete. Next, support for setting notification methods on all alarm actions. Notification is always sent when alarm state is changed by default. For example, OK to alarm, alarm to OK. But you may not need notification when transition is alarm to OK. If so, you can select trigger when notification is sent by checkbox. Next, translation support. Horizon Monaska UI plug-in supports translation. Same as other horizon plug-in, it is ingretted with Zanata. If you are interested in, sorry, if you are interested in translation to your language, please visit to Zanata. There are no necessary programming knowledge and development environment. You need just sign up with Zanata. Zanata is very sophisticated translation system so you can easily translate words. Next, Grafana 3 supports. Grafana integration is amazing. You can see metrics visually and you can easily understand the tendency and the spike of metrics. Charter communications have great contribution to Grafana integration. Monaska data source and Keystone authentication with Grafana 3 are available in Newton release. This image shows example of Grafana 3 dashboard being used by Charter communication in their production environment. All built-in graphs are available in the Visualization Monaska metrics. Next, I'd like to show you MetricsDB status updates. MetricsDB is important in Monaska UI. All metrics are stored in MetricsDB. Currently, Monaska has two choices, InfraxDB and Vertica. InfraxDB's clustering became closed source from 0.12th version. If you want clustering, you can buy license. HPE Vertica is the best choice now. Community edition is free and you can build clustering up to three nodes and it can store up to one terabyte. If you have larger than one terabyte data, then you can buy production license. Community edition of Vertica can apply to many cases. It supports compression so you can store more than one terabyte actually. But we think we have to support completely free choice. So we are planning to develop new MetricsDB which has free clustering and no capacity limit. We hope we can introduce new MetricsDB at next release. And we have design session for MetricsDB at the summit Thursday morning. So I'm gonna talk about some changes that have been made to the Monaska agent. Just going back to this architecture diagram that is shown in the upper right here. So we do have a Python Monaska agent. It's optional, of course. You can use the HTTP API completely without the agent, but the usual case is to deploy the agent as well. One thing that we started to see as a problem is just the amount of Metrics that we were collecting in the agent was proving to be another bottleneck. We were collecting Metrics for many VMs or many APIs or OpenV switch. And so just the sheer amount of Metrics that we're collecting was serialized to a large extent. And so we had to parallelize that. And so we're using the Python multiprocessing module to start all our collectors in parallel. And so that removes that serialization bottleneck. In addition, we're able to specify different collection intervals for each collector. And so if you wanna go faster in some cases, that the default that we use is 30 seconds, you can do that. Normally, though, you want it to go slower. For example, if you're getting capacity-type Metrics in your system, so Metrics that you don't need updates every 30 seconds, then you can specify like a 10-minute interval is what we typically use for some of those capacity-type Metrics. We've also multi-threaded some of the collection plugins internally, some of them even after parallelizing them at the collection level still take a long time. An example of that would be when you're doing HTTP status checks, if an API is down, it might take a long time to respond. And so if you're trying to do several hundred checks and they're serialized, that could be a problem. So that's all parallelized now. There's been some other plugins that we've done that in as well. We've enhanced some of the VM checks. One that's new is a Ping Check. So we've always had a check that is used Libvert for getting the status of a VM. Of course, that's not completely thorough. A VM can be up and running, but the OS could have crashed or panicked. So the Libvert check won't tell you that, but the Ping Check will. We've added Windows support. This is more in progress actually. And along with that Hyper-V support, which was the motivation for Windows support. We've added support for Open-V Switch. SolidFire has added support. And we've been spending a little more time looking at things like Docker and Kubernetes and Messos isn't on here, but we've been looking at those technologies and adding support for those in the agent. All right. That's you? Oh, that's me. I thought I had a chance to leave. All right. Oh, let me go back here. Okay, so I'll talk to you about the two other new components in Manuska. So same architecture. So we have a transform engine and an analytics engine. And in the case of transform engine, what it does is it consumes metrics just like the persister or the threshold engine. It does some processing on them and it publishes metrics back to our message queue, Kafka. And the analytics engine consumes metrics or alarm state transition events and it does processing with them. Okay, so let me talk about the transform engine. So this is a new microservice in Manuska. It's use cases. Right now it's been mainly focused on aggregating metrics for the purposes of calculating object storage disk capacity metrics or object storage capacity metrics or compute host capacity metrics or VM capacity and there's more to come. What do I mean by that? If you take metrics from a number of compute nodes in your environment or a number of Swift nodes in your environment and you wanna roll them up to some single value, like calculating the total amount of CPU used per tenant or calculating the total number of object accesses or bytes read or bytes received in Swift per tenant, you can do that. So the per tenant part is the new capability that you can do with this and it's not restricted just per tenant. The language is pretty flexible and so you can specify your own aggregations. We've been focused on these capacity type metrics right now. So currently our metrics are aggregated and published every hour. Actually the actual aggregations are done semi-continuously. They're more frequently than every hour but they're only published once an hour and this is currently in deployment in our newly released Hylian OpenStack 4.0. So its status is production as of today. And there's the repository of fine additional information on that. There's also a very nice wiki that describes that in more detail. Manosca analytics, so unlike the transform aggregation engine, the analytics engine is really under development and so don't try to take this and use it today. But I would like to talk about it just to let you know what's going on there. So this is primarily there to kind of enable more advanced algorithms and do more advanced analytics. The focus is currently on anomaly detection and then reducing alarm fatigue via alarm clustering algorithms. So taking unsupervised machine learning algorithms specifically and clustering alarms together to group them into like a single incident and then exposing that to the operator instead of each alarm that occurs. So instead of having hundreds or thousand alarms if they are all similar, hopefully you can group them together and present that to the user. There's a couple of algorithms that we've been using, one class support vector machines and LINGAM linear non-Gaussian acyclist models, I believe. I can't remember that one every time I see it. But okay, there was a repo out there. Please take a look at that if you're interested in learning more. All right, turning it over to you, the deck. Thank you. So apart from monitoring, Manosca is also logging. As you can see the architecture is very similar to the monitoring architecture. We have the agents which pushed the logs to the API. Then we have the message queue and the components which process the logs. The one which is new and interesting one is the log metrics. It generates the metrics based on the log entries, for example, for error messages. And then these metrics can be evaluated by the threshold engine, which you can use for alerting. So it's a basic implementation of complex event processing like combining different data sources and creating alerting on this. As the database, we use elastic search and as a dashboard, Kibana. In Kibana, we have our own plugin for authentication and we are working on multi-tenancy plugin for Kibana. So this is the list of improvements during the last two release cycles. We have completely reworked the API and achieved significant performance improvements with that. We also implemented the agents. We have Beaver and LogStash agents to work with a new API. Then there is that new component log metrics which allows you to create alarms on logs. And the work in progress is multi-tenancy. The log API is deployed in HP OpenStack Distribution Helium. Logging support is also included in the Fujitsu's product server view cloud monitoring manager. The deployments of Monaska, here's the list of the most important ones we are aware of. So the first one is definitely charter communications which use Monaska for monitoring the production private cloud with two data centers and up to 700 compute nodes. They're really testing Monaska in real life conditions and bringing up many issues which we then work on. So it's really great. Then there is Fireware Lab. It is a multi-region OpenStack based cloud here in Europe and also in South America. They use Monaska and the Cilometer agents to monitor their OpenStack. They had a presentation last time in Austin. Then of course HP Helium OpenStack Distribution. As I said in Fujitsu's product as well, NEC plans on adding Monaska to their product cloud solution menus. Integrations with other projects. I mentioned already heat auto scaling. So the enhanced periodic notifications allow the integration with heat. It's completely functional. Then we have integration with Cilometer. The Monaska Cilometer project collects the Cilometer measurements to the Monaska API and uses Monaska as the data storage for Cilometer data. Then, oops, sorry. Then we have in work, in progress work with a couple of projects like Congress, Vitraj, Watcha. There is integration with BroadView which supports for a physical switch monitoring of Broadcom switches. And lately there is an effort to develop Ansible installer as part of OpenStack Ansible project. Which I hope will develop good. Third party tool, integrations. So we have the lock agents, lockstash and beaver. Here the references to the pending pull requests to the upstream repositories. We have the integration to hipster. There is a Monaska sync which sends the Kubernetes metrics to Monaska API. And there are Grafana and Kibana integrations for the dashboards for metrics and locks. So we are coming to the end of the presentation. We would like to thank everyone who has contributed to the project, who has helped us. The list is for sure not complete. Please apologize if someone is missing. We would like to thank everyone. There are two more Monaskas dedicated Monaska sessions tomorrow. The first one, Monaska Bootcamp, which is the hands-on workshop. Good luck guys. So please visit, you will be able to play with a ready environment, install Monaska and try it out. And the second one is about monitoring Kubernetes with Monaska. Thank you. That's all. So we have four more minutes I think. So if there are any questions, I don't know where is the micro. So the question was have we rewritten the threshold engine in Python? So the answer is no, we haven't done that yet. And we're not currently working on that. As some of you might know, many of the components are written and we're started out as Java. All of them are available in Python today, except for the threshold engine. And the problem there is a lot of these stream processing technologies, and the one that we currently used was called Apache Storm. But the support for Python in Storm isn't very good. We do have some new components that are using Apache Spark streaming. And so we'll have to evaluate that in the future. Yes, I would say so. Yeah, so the question was does switching to Python mean potentially switching the streaming technology that we're currently using? And although I haven't done a real evaluation of the Python support in Apache Storm, everything leads me to believe that, yes, we would have to do that. And so that's why it's taken a while. Yeah, so the question was related to the threshold engine again. Manoska is really good at creating alarms when VMs are created. And that's the way alarm definitions work in Manoska. Those are kind of templates for creating alarms. So we create these alarms, but then later on when the VM is destroyed, the alarms transition to the undetermined state. Do we have anything to resolve that? Currently, what we do at HPE is we have deployed our own scripts as demons that delete the alarms for VMs that have been deleted. And that's similar to what charter communications has done. And I think most people are going down that path. I don't have any plans right now to address that in something that's architected or designed into Manoska who would love to do that. We just haven't had the resources to address that particular area yet. But was the question actually about deleting the alarms or about alerting when the machine goes down? Yeah, one of the ways that we wanted to address that was adding support for events in Manoska, but we haven't done that yet. Any other questions? I think we're pretty much out of time anyway. All right, well thank you everyone for attending. Grab any one of us if you need or want to follow up on Manoska related questions. Thank you.