 Okay, so this is the Manoska boot camp. I guess this is one of the first sessions of the OpenStack conference Who went to the keynote this morning by the way? Yeah, how was it? awesome So This this boot camp is really a hands-on lab So I'm not gonna be doing a whole lot of overview on Manoska today We'll go through I mean in terms of what it is in the architecture I'll do a couple introductory slides there and then we'll do More of the coding but this is all driven by an IPython or Jupyter notebook if people are familiar with that you can go to that GitHub address there and there's a repo Manoska boot camp and then there's an IPython file We're not going to be doing like a hands-on session where I'm going to expect you to do a lot of coding and everything I'm going to be the hands today. I'm going to drive you Through this whole thing. There's a lot of material. I really encourage you to you know, take it offline Get dev stack and the Manoska plug-in installed and then run through this offline And then if you run into any issues, please ask questions in the in our IRC room or You know, there's a couple of context there. So I'm Roland Hockmouth. I work for Hewlett-Packard Enterprise There's my email address and then my colleague and co-worker Michael Hopple Is also on the presentation. He's not here today But those he actually did all the submission work for the paper and worked on the notebook with me Okay, so this will be really an experiment doing this I've never done this big of a deal with IPython before but we'll give it a try There's also a number of folks attending the summit here I don't know if I say your name stand up But we got Declan Deaterly. He works with me also Hewlett-Packard Brad Klein and Ryan Bach from TWC. I guess they're over there Sid Logan from Broadcom has been doing a lot of stuff on network monitoring of Broadcom switches in a project called Broadview Wittek Bedeck from Fujitsu I'll be doing a paper session later today on logging using Manasca Unfortunately, he's stuck in Chicago. So he's gonna be a day late, but he should be here Hopefully arriving at one o'clock or else. I'll be doing the whole presentation on my own which That'll be an experience. Okay. So we got Koji Nakazono from Fujitsu Shinwa Kawabata from NEC Fabio Genetti from Cisco and if I missed your name I'm sorry. I tried to remember all this on the top of my head But hopefully I covered most of the folks. So they're all here at the summit if you got questions You want to follow up with folks? Please feel free to do that and reach out to these folks. They're often in the IRC room as well and Can help you there Okay So as I mentioned, this is an IPython Jupyter notebook that you can run on your own It's at that GitHub site that I listed earlier So after you install IPython, then you'll want to install DevStack using the Manasca DevStack plugin and If you follow that link, you'll see some directions on doing that It takes around half an hour or so to actually get it all installed and up and running That's why I didn't want to do this today So I just thought I'd just drive you through the notebook and then there's a bunch of Python libraries I'm using this presentation that are kind of independent of really Manasca But I use it for displaying graphs and querying things and other stuff So we'll cover that real quickly. So this is the agenda. I'll do a quick architecture And overview of Manasca just so you're familiar with that that level by the way who is familiar with Manasca Okay Okay, so I'll be covering that architecture overview very briefly There are presentations at previous summits Encourage you to take a look at the videos if you want more information on that Also, if you really want a more in-depth session while you're out here in Austin You know sending email and then we can get some time together and go through it Okay, so this is the bulk of the session is really on the API and CLI I don't know if we'll cover too much of the agent hands-on We'll get a little bit into if you're a developer, you know, how to get started where to go current status and what's next and then the Grand finale will be the Horizon Grafana 2 demo And I really do have to be careful on time here Because there's a lot of material the notebook is a lot bigger than an hour and a half I think I've never done the presentation. So We'll find out. Okay, so what is Manasca Manasca is monitoring as a service at scale My friend Declan Deiderley over there thought of this incredible name Which he loves to remind me about So basically it's four open-stack Monitoring as a service and it's based on a first-class restful API by first class I mean all the operations that you would do for the monitoring service are done via the API normal monitoring systems There's usually, you know proprietary protocols or other protocols ours is based on our HTTP Restful API. It's highly performant scalable and full-tolerant. We spend an awful lot of time trying to do that and It sorts authentication Via the keystone open-stack service and then multi-tenancy everything in the service is scoped to a tenant or project ID we store metrics that are generated usually from a Monitoring agent we have our own optional Python monitoring agent and Then you can query those measurements and statistics back out of the system and then display them or look at them somehow There are multiple ways to slice and dice the metrics and And alarms based on filters and sorting options. I don't think that read as nice when I was creating that slide so using our metric naming Mechanisms a metric name and dimensions which we'll get to there's multiple ways to slice and dice the data like by region or host name And that's all very configurable You can create alarms and receive notifications on those alarms So you can do like status and health monitoring for example, like if you want to monitor HTTP status we have a plug-in to do that and In that case the metric would be a binary data zero or one and then you can create an alarm on that I'll tell you if your API or your service is down Also, do you know everything's a metric in terms of the monitoring? So if you'd like want to monitor CPU you can create an alarm off of a Threshold of like CPU utilization greater than 80% and then alarm on that and send you a notification Notifications will get into later, but they can be email paid or due to your webbooks. It's a very extensible platform it's designed around part a Technology called Kafka Kafka is a highly performance scalable distributed Durable Message queuing technology that originally came out of LinkedIn over the past year or two years It's really been adopted by the big data community And so we use that with them an Oscar. That's an excellent technology What that allows you to do is tap into our message queue if you've got additional components that you want to add or via the rest API of course you can query and the API and On your own products externally to the system and it's designed to consolidate monitoring for both DevOps and Monitoring as a service use case. So a lot of people think about monitoring. They're thinking about Like, you know monitoring their physical infrastructure so using like Nagios or a ganglia or Zabix or any one of the number of tools out there for doing that Manosca is monitoring as a service and since it's a multi-tenant system We use it for both of our we use it for both operational monitoring and monitoring as a service use cases monitoring as a service being things like like Amazon AWS cloud watch or Data dog. They're here at the conference by the way I believe presenting But there are a lot of you know monitoring as a service as companies out there I don't know new relic of the another one Liberato, but basically the way they operate is you have a HTTP endpoint that you can publish metrics to query get data out Create alarms in some cases and do other analysis That's what Manosca is also about turns out if you can do monitoring as a service you can do both your operational monitoring and that's a really good thing because if you have to deploy a system to do your Internal monitoring of your physical infrastructure and a separate system to do monitoring as a service. That's two systems Not one and I've supported. I did DevOps for the HP public cloud when that was still around and We had three systems there. We had monitoring service. We had Nagios and we had our metrics Processing system. That was three systems. That was a lot of operational overhead and Now we can do that. We're basically one and that's in part why this system is here today Okay, so this is the overall architecture for Manosca We're not going to spend a lot of time on this today, but I did want to show it to you Okay, so we have our rest API it's the horizontal Bar in the middle there up in the upper right is our agent. That's optional You can use other agents as well or send data directly from your application via the API Data is published or posted to our API and then it's published to the blue box in the middle Kafka which is our central component and I like to think of this as a microservices message bus architecture These three components down there the persister the threshold engine and notification engine They consume messages off of that queue and they also publish is publish messages back So the persister is basically consuming all the metrics off of the message queue and Publishing them or sending them to the metrics and alarm database Our metrics and alarm database today is either inflex DB, which is an open-source database or Vertica, which is a proprietary database from HB We've been looking at supporting other databases as well So that's the job of the persister. It also stores our alarm history. So The threshold engine creates alarms, but the alarm history is also stored into the database. So we have all that History of all our alarms you can go look at that later on and see what happened and do root cause analysis later on After an incident has occurred Okay, so the threshold engine that is written using a technology called a patchy spark patchy spark is a computational engine you describe a graph they call them bolts and spouts, but basically their inputs outputs and Computational nodes and see if I Scribe this topology and you can do stuff with it really cool stuff there It's it was developed by Twitter for their real-time streaming Analytics that they do on all the tweets that are being generated so We use that for evaluating our Alarms so when we receive metrics we look at each metric say okay, is this above a threshold or below a threshold? We do moving window average calculations on all that data or other statistics, and then if the state has transitioned We output an alarm state transition event, and that's what's actually stored in our database to the right and that tricks an alarms database Okay, so the threshold engine then publishes this alarm state transition event that goes to the Notification engine sort of on the left there the lower left Notification engine receives those messages consumes them looks at them decides what to do with it if That alarm state transition Happens to match something a notification like an email or pager duty Notification then it'll send that out that engine also handles Like if the message isn't successfully Sent so like you do an email And Your SMTV server is down then it will send it put it on a retry queue using Kafka and And then it'll try a minute or two later The lower box on the left there is our config database that can be my sequel or post-gres That's basically starting storing all the configuration Information for the system so like what are all the alarm definitions and alarms? That are in the system as well as a notification method, so that's not the streaming content That's a much smaller content and the difference between the two is that content is being updated It's being created it goes through a crud lifecycle, so you create it read it update and delete it And it's not this big data stream with a lot of velocity to it It's you know much more transactional in nature Whereas our metrics and alarms database, that's the high-speed streaming metrics content That's flowing into this thing that you want to be able to query and get out information really fast Okay Let's see what else so we have integration with horizon up in the upper left there So from horizon you can visualize Well, you can interact and do all the crud operations on the API not shown in the diagram here is We also support Grafana 2 That was developed at that time Warner Cable recently added We supported Grafana 1 for about two years now, but Grafana 2 was recently added a few months ago and just like every other Project within OpenStack we have our own Python monoske client Which you can use to query the API and do things with and that library also has a built-in Python library that you can You know Import into your Python code and use it to interact with the API as well And I'll be showing lots of examples of that if I ever get through my architecture slide Okay, so that is kind of at a high level the architecture and I'll cover a few more things so there's a picture of one picture anyway of horizon I'll do a demo of that later you can see on the left there the monitoring panel and And there's these four Subpanels there overview alarm definitions alarms and notifications those being the primary resources in the API and then you can interact with that and create alarm definitions read alarms create notification methods, etc that's screenshot of Grafana 2 not a terribly exciting screenshot, but I just wanted to throw that at you show you what you can do if Grafana 2 is showing the CPU user percent here Over I don't know half an hour an hour. It's worth the data and This isn't completely integrated into Grafana 2 yet And I'm not sure it will be but the last update that I saw from the project lead on the Grafana 2 project a few days ago Is that they will be adding this into Grafana 3.0? There are some pull requests Up there and if you're interested in using that code you'll have to do a little bit of work on your own And that is documented and Ryan Bach is in the audience and he's the guy to talk to about that If you are having or experiencing problems One of the other things that we're doing in the NOSCA is logging as a service so we are in the process of adding a logging API It's Currently we have an API that you can post log messages to and There is a repo out there There's a spec and there's a presentation later today at 515 and That'll go into a lot of detail on that So if you're interested in that topic that'd be worth attending that was also covered at the Tokyo Summit So we have some updates on that Okay, there's also integrations with a lot of other projects You know when I say still relative to the new project, but we have integrations that have been going on so Solometer is out there and Solometer does collection and storage of well Solometer does one of the things it does is it collects Metrics for open stack resources, so we've integrated with that so you can send the Metrics that are Collected into Monosca and stored at them there and then we've also built Monosca as a storage driver into Solometer so basically the Solometer API can sit on top of Monosca and query all of those so all that Solometer data that's been stored in there and that The usage of that will be I mean it's it's in very good state And that will be going into production here this week HB will have some announcements on that so that work is basically Being used by HP and their healing on distribution and there are others also that we're looking at that There was a presentation at Tokyo if you want to find out more about that and a nice repo Heat auto scaling we also are supporting that today. There was a presentation at Tokyo and there was a presentation Well, it's not pretty early in Monosca really, but they're I think that they'll be covered to some extent that's going on around Heat and to some extent Monosca a lot of scaling at the Austin summit too but the Tokyo presentation was really good and I can check out that video if you want I Mentioned this earlier, but Broadview is a new project that is up in the open stack Organization there's actually three repos there But what that project allows you to do is physical network switch monitoring and we're looking at doing other things With it like neutron monitoring and OV SDK monitoring or like a virtual switch monitoring, etc so that'll be What we're gonna have some more discussions on that on Wednesday morning at around 11 o'clock in one of the Monosca sessions Sid Logan who's right there can tell you more about his project if you are interested in it And there's a session on Congress forcing application SLAs with Congress and Monosca and that's using Congress to do policy management, but using Monosca to trigger alarms and Tell you when certain thresholds have been exceeded and then Congress takes over from there and applies their policy engine framework to take actions Of a trash is a new project There's a couple sessions and we're starting to look at some integrations there We'll be talking a little bit more with them while they're here and neutron as I mentioned We'll be talking to that team on Thursday networking is a big area of Discussion, okay, how am I doing on time? terrible All right, so this page here imports some libraries that are used throughout this presentation You got to initialize the keystone of Monosca clients So obviously supply your keystone URL project name user name and password and then initialize the keystone of Monosca client Not terribly interesting, but I'll skip over it. We use plotly for graphing in this pipe notebook And we use spur for running remote commands. So pot plotly is a really cool graphing library and There's other ones that are used in the IPI fan community like Matt plot lie, but I like plotly Okay Using the API finally we're there Okay, so I hope I don't bore up any of you to death here by going through all this We're gonna go through this API. It's a hands-on session I'm not gonna And we'll go through all the resources pretty much and have little demos of them So those are the resources we've got a version of resource or metrics and measurements of statistics a metrics names resource Notification methods alarm definitions alarms Alarms count resource and alarm state history But the first before we do that let's go review a couple common concepts that are important so Roles are used to control access to the API. So there's three roles in the system. There's a user agent and delegate role The user role allows client access to all the credit operations on the API The agent role allows you to post metrics only to the API so obviously that the agent could be deployed in lots of systems right your environment and just to reduce the security threat or But our threat security attack vector. That was the word You know the agent can only post the API. I can't query or create things in the API And then there's a delegate role the delegate role is used by our agent to Publish metrics from one account to another and we use that in our physical infrastructure Monitoring when we're actually monitoring VMs or open-stack resources If you have a VM that you're monitoring like our agent is monitoring running on our physical hardware We're getting metrics for that VM which can be many ten tenants like if you had 40 VMs running on a physical compute node Those VMs could be running under multiple tenants or owned by multiple tenants So we can publish the metrics to those tenants as well So the tenant can access metrics about their VMs Okay, and then pagination Pagination is really important when you have lots of data to deal with you need to page through it And we have a limit in our API today of returning 10,000 You can deploy if you want to set that limit yourself and deploy it and use a larger paging limit you can do that but That's meant primarily to prevent the API or the database from getting extremely large queries and failing so the limit I said it's 10,000 and When you get a request back you should see is it 10,000 or if you supply the limit as a query parameter You want to see if it's equal to that value and if it is then you should start to page through that and there's offset Illuminate parameters. This is a technique that's used in other OpenStack APIs But using the offset and limit parameters you can advance Through the elements in your result set So here is an example that kind of so if we first let's just with some measurements using a limit of six So right here, we've done a command line I'll be doing a little bit more of this later, but we've done a command line. We've queried for CPU user percent and We've got six Six measurements that were returned Okay, so now let's go page through this So if we do this again same query with a limit of two We get those two values and what we're going to use as the offset in the next query is that time stamp the second for the last element in That query so if you look at that time stamp there That will become the offset on the next Query ooh That wasn't what I expected Well, you can't see it But it's there So sorry about that the screen resolution changes caused a couple problems here So anyway, if you're interested in pagination, there's a wonderful example for you to go through Okay, so metrics so this is one of our Resources so you can get in post to metrics and So a metric consists of a name Which is a string and there's a couple conventions that we normally follow When naming things usually it starts out with a group like CPU or my sequel or rabbit MQ and It's all lowercase and then we have a dot and a dot delimits groups And then we have some additional information after that dot like user underscore percent or If the group was Kafka consumer underscore lag and Then it might also have a suffix on it with some units like if you were looking at network in Bytes or outbites we'd have something like underscore bytes Dimensions is really important dimensions is a dictionary of key value pairs and they Dimensions in combination with the metric name help you uniquely identify a metric So if you've got some metric will CPU user percent, but that's reported for a thousand hosts you would typically have a dimension called host name and the value of that Key would be host name one host name to host name three and so so on Say so dimensions you we have conventions of course that we follow But you're welcome to create your own dimensions with the API and come up with whatever you want to come up with your name conventions, but we use things like region zone mount point device There there's there's a bunch of these dimensions out there So you have a time stamp. That's a part of metric. You have a value values afloat Then there's this other thing called value meta that looks like a dimension, but it isn't it is a dictionary of key value pairs But value meta isn't there to uniquely identify your metric. It's there to tell you additional Or information about your metric so like for example our our agent allows you to run Nagios Plugins and a Nagios plug-in runs. It reports a status status. Okay warning critical or unknown and message That message has useful information that you might want to supply of your metrics So you can now go look at your metrics and go look at the value meta. So if you were monitoring a service and that service went down and You were doing that via an HTTP Status metric You could provide the status code returned. It could be like a 400 or 500 error and a message might be in that metric as well and There's also a You can supply for various requests a tenant ID that allow you to do delegation I talked about So There's a slide here on dimensions. So I kind of describe that Dimensions are really important because this is what you're going to use to slice and dice your data You're gonna say well give me the metrics for that host or give me the metrics for that region So you can use it to delimit what you're really interested in looking at And those are some other examples. I meant I left off resource ID being the Identity the ID of the open-stack resource that you're dealing with so that's supplied in the case of VMs Okay, so this is an example of a metrics request body The name of this metric is we should have an underscore there HTTP underscore status and we've got some dimensions It has a URL in this case a region a zone and a service So this would be maybe testing the nova API. I doubt my port number is correct in this example a Timestamp in milliseconds a value of one and then the value met a status code of 500 which is internal server error a Value of zero would be that the API is up and okay a value of one would be that the API is down Okay, so that's a command line of metrics And I'm doubt people can read all that So I won't do all that. All right, so this is just an example of creating a metric so This is all done in IPython. So we have the exclamation point Which allows me to run command line commands in In an IPython notebook, so I'm bang monoska metric dash create Dash dash of mentions and then region equals US central state equals Texas city equals Austin and Lost some of my scrolling there and it has a value of something like one Okay, and then I run a query on that metric. Oh the name of the metric was open stack dot hands-on status So I successfully created that metric and now you can look at what's in the database and store it as a value Let me check my time Okay, so now we can go list metrics, so that's done on this endpoint get me 2.0 metrics and You can specify as a parameter like a metric name and its dimensions the start and time offset and limit parameters are also available and Just a note if you're doing this and you ever deploy and you've got millions of VMs in your environment You need to start delimiting the rate and the number of metrics that are returned and Usually you want to limit that maybe to some time range because you're only interested in looking at Metrics for the past one hour 24 hour period, so that's really important because after a few months if you're running you'll have tens of millions of Metrics in your system if you're if you have a large deployment, of course It's less important if you're deploys relatively small that is the command for getting help on metrics And that's an example of some metrics that are in the system, so there's the command metric dash list I'm limiting it to 10 metrics in this case and On the left column here, we've got the names of our metrics and on the right column here We have the dimensions associated with each metric And this is a pretty small this is done using just the dev stack deploy So there's really not too many metrics in the system right now, but any large deploy It'd be very common to have tens of thousands for your physical infrastructure and for all your VMs that are continually Changing and being added and deleted you could have easily millions So this is just another example of using the metric list command to filter down to Just the metric name CPU dot user underscore percent with the dimensions of host name equals dev stack And then we can also create a function to get metrics Python and This line here is the line. That's actually invoking the oops Manasca client and Getting those metrics and returning them And we'll use that Okay, so metrics names So if every time you interact it with Grafana if you had to query and get ever all the metrics in the system That would take a lot of time To get that result set but often when you're going through and looking for a metric you're saying I'm interested in looking at you know something right and and The number of unique metric names in your system might be hundreds Okay, now obviously there's lots of dimensions associated with them So there might be a metric name called CPU user percent and another one called system percent They get the idea there might be hundreds of those But the number of those same metrics for the entire deploy might be huge So the names resource allows you to do queries. It's very useful in the In the horizon or Grafana dashboard. We're just going to limit and say I'm interested in just the names and That is an example of that query. Well, that's the help for it and That is an example so in the previous example we had the column on the left were the names and the column on the right were the dimensions This is just the distinct Elements that are in the column on the left the metric name Okay, so measurements resource so you write all these metrics into the system. You're going to want to look at them later on So the measurements resource allows you to query that So you can say get the 2.0 metrics measurements You can supply the name and dimensions as query parameters The start and end time offset and limit can also be supplied to the limit and the range of data that you're interested in and If you're getting more than 10,000 elements back, you're going to need to page through it so you'll need to do offset and limit parameters and Currently You can only get measurements for You can only get one array of measurements back so if you're doing a query and You were looking at in your query. You didn't filter down to a single unique metric But let's just say you're running a query say give me a CPU user percent with no dimension of host name supplied Then you can use this merge metric flags to merge all that data into a single array So that's really important if you really are interested in each unique metric independently, then you'll end up having to Run a query a separate query for each one. However, there is a review in progress and That'll allow you to get multiple metrics in a single request and that'll be a significant performance improvement Okay, so there's an example of listing the measurements So in this case, we've specified a dimensions of host name equals dev stack and We've limited to five measurements The name is not showing up here, but it's CPU dot user underscore percent and We get five values back. Okay. This is just creating a function called get underscore measurements And we're using the menosca client on this line to get the measurements back and return them with some additional Coding around that Okay, and this I don't know if you guys are familiar with pandas, but pandas is the Python analytics data library So I use pandas in this notebook to create data frames and then display them using plot late so that function there Sets that up so we can convert Measurements into a data frame Okay, so we're gonna this line. We're just using those two functions that we just created We get some metrics. We're getting two metrics in our case CPU user percent and CPU system percent and Then we plot them using plot late first. We convert to a data frame and then The command data frame that I plot uses Plotly it actually uses cuff links, which uses plotly, but so right there. We're displaying CPU and system percent The red is user percent. That's kind of cool one of the things that I'm interested in doing is doing, you know online sort of in real-time sort of data science on your on your monitoring or on your cloud Okay, so the statistics command is similar to the measurements you can get data back that you put into the system So you got names and dimensions again but additionally you have the statistics parameter which is Currently just limited to average and max summer count and then you can specify a period a period would be the time interval at Which you're interested in applying that statistic. So like five minutes It would be a good value. So it's done in seconds. So 300 seconds would be five minutes You can do it in an hour. You can do it on 24 hours And then you can specify a start and end time also an offset and limit But you can basically say I'm interested in the average of CPU utilization over 24 hour period in like five minute periods and Similar to the measurements query. It only returns one List you can use merge metrics flag here, too And that same review that I mentioned earlier that is up will allow you to do multiple metrics in a single query So that is the command line for running that that's a function to get statistics similar to the previous one that converts to a data frame and Now we'll go ahead and run this command and It's not terribly exciting But that is a plotly graph of CPU. It's multiple metrics in this case There's user percent idle percent stolen percent and others that are in this Okay, and using flatly it's very easy to create box plots So in this case The middle line for a box plot is the mean then we've got I Think this is the upper quartile is the upper line the lower quartile is the Lower line here and then the top bar lower bar is the min and max. I think that's what they're doing in this case So this could be really useful if like you want it to understand the CPU load across your entire infrastructure and which systems were You know basically Anomalous in a sense that you could quickly say okay go ahead and return me all the CPU user percent for my entire infrastructure Right, and then I could display a box plot on that and see which systems are got high means low means Basically do visual analytics on your infrastructure In this case though. I'm not doing that. I don't have a big infrastructure that I'm I have access to here for this demo, but I am displaying these CPU user idle stolen other percentages and you can see that the Idle percent which is in blue is much higher than all the other percentages and since there's nothing on my system running that is normal Okay, so notification methods Basically within the NASCA you have you can connect Notification methods to alarms you spend you do that by specific creating a notification method with a name type and address So that's and then once you create that notification method you can query it or modify it or delete it We support email pager duty and webhooks So that's the command line for creating a notification method. Obviously specify those three things. I just mentioned So this is an example of creating an email notification So we've got the name of the notification method here And then the type is email and the address is John dot dough at domain You can also do pager duty Unfortunately my vm that I'm running in can't get on to the internet So I was gonna have it call me and do a really cool demo, but that's not possible But I did have the screenshot here So if you're familiar with pager duty, you can grab the integration key out of that and on this line create a notification and So it's running off the scroll here But the name of this notification method is menoska boot camp pager duty notification And then that value would be the type. Oh, it showed up. I see if the next one shows up pager duty and Then the address would be the integration key so you can list your notification methods after you've created them Okay, so that brings us to alarm definitions and alarm definitions is So there are operations for creating alarm definitions One thing that's different in menoska is you don't create alarms directly what you create are alarm definitions and you can view alarm definitions as templates for which metrics as they arrive into the system are matched against the metric name and dimensions that you've Added to your notification method and if there's a match and An alarm hasn't been created already It'll create one and then as new metrics arrive. They'll go to that alarm and Basically what this allows you to do is I can create one alarm definition in my environment And I can generalize over lots of systems or resources that I'm monitoring very easily So if I wanted to monitor my physical infrastructure and get a warning message When an alarm or when a metric value was greater than 80 percent I could create a single alarm definition for doing that and then in that case I'd have many many metrics showing up all of the name CPU user percent but a dimension of host name would be different and There's an additional value here called match by Which I need to describe as well But what you can do is you see match by host name It's you can also think of it as a group by operation In fact, that might have been a better name for it But as these metrics arrive and say I got a CPU User percent metric and host name is foo in one case and host name is bar in the other case if I say match by host name I'll end up with Multiple alarms being created. There's a simple grammar for creating alarm expressions That as an example of a compound alarm there So we're taking the average of CPU user percent created in 85 or the average of the disc read ops for the with a dimension of device equals VDA and In that second alarm the period is 120 seconds And if that value is greater than a thousand then we'll arm on that There are three states of an alarm in Manosca. There's the ok state. There's the alarm state and Then there's the undetermined state The ok and alarm that's pretty much self-explanatory ok. It means I'm within my threshold Alarm means I'm outside of my threshold above it below it whatever the case is and Undetermined means I had an alarm that was created, but there's no metrics being sent to it anymore So we call that the undetermined state And this can happen for a lot of reasons you might have had a physical host that you took down So if you had a bunch of metrics like CPU user percent being sent from a system and then that system crashed You wouldn't be getting metrics anymore. So we use that undetermined state to Help tell you that it's not that there's an alarm, but that you're not receiving metrics anymore for it Other reasons like a VM VMs are there's a high churn rate and you know Systems that use open stack with creating and destroying VMs, especially if you have autoscaling enabled And so every time the VM gets created metrics get created and then when the VM goes away the Alarm if you have an alarm set would go to the undetermined state In addition You there are four user assigned severities with your alarm defect definition low medium high and critical Okay, so when you create an alarm definition Let's just say you wanted to do a warning alarm on CPU utilization greater than 80 You would specify the severity is low and then let's just say you wanted a high or a critical alarm If the CPU was greater than 95 Then you would assign that and you'd create two alarms in that case Alarm definitions, okay, so that's an example of creating a simple alarm. So there's the expression For some reason I use six other it is no 60 is the period 80 is the value that we're comparing against and You can't quite see for some reason it didn't display correctly, but as you go to the right you can look at that The compound alarm expression doesn't show up. So we won't go through that So I kind of describe this match by thing. It's kind of hard to for to grasp when you first encounter it you know what you need to do but You know basically I describe as metrics come into the system You can group them you can say group by and you could specify some dimension You can group them by region or host name or You know any dimension that would be a part of a metric skipping over here some stuff because I Describe some of that Coming up on an hour Okay, so now you've created your alarm definition and then the system Hopefully you got alarms that are being automatically created. That's what would normally happen So what you want to do is you want to look at those alarms and query them or And so you can do that you can say get the list of all the alarms and if there's thousands you might have to page through that You can also get put patch delete a specific alarm by ID And there are several query parameters like the alarm definition ID the metric name metric dimensions state You can also sort these so You know you're gonna have Tens of thousands of alarms in your system if you have a large deploy and If you are interested in displaying them you can do like server-side filtering and sorting on them And we support a number of options For doing that And that's really important because if you want your user interface to be responsive You can't just query all of the alarms back into the browser and operate on them there You need to do that server-side and get it in its In the format that you want it. So there's a lot options on alarms and alarm definitions But I'll just kind of give you the Overview on that in this example here, we're Listing all the alarms and we're supporting by sorting by severity And it's really hard to read here, but Somewhere in here's my state column right there and I guess everything is in the undetermined state. So Got to work on that little example, but anyway Okay alarm counts so alarm counts allow you to get the counts the number of alarms in various states or authorities or Also grouped by various names or dimensions and Why would that be useful? my next slide I'll show so this is a This is an example of the Helian ops console I'm not trying to do an advertisement here, but I'm just this is the only diagram I had What this is showing is that compute All right, so this would be the Nova service here has one alarm in a critical state Zero in the warning zero in an unknown and 35 total and That's where you really need to use the The alarms counts resource to get that information because if you can't do it You didn't have server-side API for doing that you have to query and get all those alarms and then do this Calculation so that's where that resource is important It's the command line and that's just running on the command line Filtering on Service equals monitoring and grouping by state and Dimension name and I guess I didn't have any alarms in the system when I ran this a few minutes ago, so Sorry about that Okay, so alarm history Hope we're getting to the end here of resources, but the alarm history allows you to go back and query and look at the states of all these alarms so alarms are you know, some people think of alarms as like when it actually occurs that was the alarm and The way we think about it is an alarm is a persistent Object in the system. It's a resource and that alarm undergoes these straight state transitions They can go from okay to alarmed and back down again Also the undetermined state so what the alarm history though allows you to do is query the states of all your alarms and You know, this is really useful if you have like an alarm that's flapping for example might be going up or down You can look at that history very quickly if you want to do like root cause analysis on something after the fact You can do that a lot histories there. It's kept, you know indefinitely unless you prune your database and What else you can you know as an alarm occurs if you're an operations guy You can go back and look at and say what's up with that system, right? Maybe you have a system that's got a you know hard drive that's going bad or something you can look at that So lots of ways to use the alarm history. I want to show graphing the an alarm history along with metrics I didn't get to doing that in my demo today But that's something that'd be kind of cool is to basically show your metric and then show various problems occurring and then plot as data points in your system the points where your alarms Transitioned and the metrics simultaneously Okay, so Oops Kind of skipped over here. I probably don't have an alarm history to show you because I didn't have any alarms Okay So So we've gone through the whole API now Which is pretty good I skipped over a bunch of stuff So I didn't want to board you with every single command in there if you want to go run all the stuff You have access to my ipython notebook and look at the examples I Want to change the topic now to talk about the agent a little bit So with menosca in addition to you know on the architecture slide there was all those components that were listed There's also a python monitoring agent We use this within the helium deploy and t-tide and mortar cables using it and others are using this Let's see. I think my next slide talks about what it monitors. Yeah, all right, so let me get through this slide So menosca is a push model on metrics. So we push metrics into the API Usually the agent is installed on the system that you're monitoring We also Do what we call active checks barrowing kind of the nomenclature from Nagios But active checks would be where we run the agent on a system and it's actually Monitoring another system like it might be monitoring where the system is up I'm going to be monitoring the API to see if the API is up. We call those active checks So you can do that or for agent as well It has a pluggable architecture so you can plug in various Plugins to monitor things like services like mysql a rabbit mq. We have some plugins now for monitoring Open stack services like Swift And there's also with the agent a way to use the setup script We wrote a lot of the setup stuff when we were building helium and as I found out it doesn't quite work with dev stack So sorry about that But you might need to do a little bit of work if you're trying to get this working on your own systems But anyways, it's there and probably it's not too difficult to get that to work with dev stack and we'll work on that So what are some of the things that the agent Can monitor so I mentioned CPU memory network file system a lot of your usual system metrics We can do service metrics like rabbit mq mysql Kafka and many others Application metrics it has a built-in stats D demon And then we extended that stats D capability with support for dimensions And so dimensions are you know one of the central concepts in menosca that we needed to extend the stats D capabilities with We can Monitor VMs And so we get a lot of VM metrics like CPU and memory and networking type metrics But we also do like host alive checks on VMs or ping checks Host alive check actually use a lit vert ping check. It's a little bit better Because when you do a ping against a system it actually involves the kernel and the host Status it's called a host status check. Sorry in the case of the VM But in the case of a host status check using live vert You can't always tell that the VM is running because the kernel could have panicked with a ping check. You can do that a Variety of active checks can be run like HTTP status checks and system up-down checks We support Nagios and check in K as plugins. So if you have a Nagios deploy Already done and you've got Nagios plugins written This is something at time one or cable that you can basically deploy menosca and start using those Nagios plugins right away Now Nagios plugins report. Okay warning critical or undetermined states So we turn those into metrics a zero one two or three and then you'll have to create alarm definitions for them to actually trigger When when things go to the warning or critical states But you know, you can basically if you're interested in moving from Nagios to menosca That basically makes that transition much much easier And that's extensible you can add other plugins to that All right. How am I doing on time? okay, so Let me Get through a few more slides. I'm not going to go through the agent example here I spent a lot of time on it though, but I'm not gonna do it Okay, so if you're a developer sitting in the audience right now Love to have you working on menosca. We do have a number of companies and organizations involved in the project so this next section will just give you a few pointers on how to get started with the project and Feel free to follow up with a on things but Menosca kind of started out with more of a Java code base and then we've been doing a lot of porting in Python, but Depending on the component that you're interested in working with it might just be purely Python Or there might be both Python and Java Java versions of the code available There's a menosca dev stack plug-in and that's what I'm running on my system right now and that's also Used by the open stack CI system. That's basically similar to any other open stack project We've got a lot of unit tests and we have around 150 Tempest tests okay, so So I mentioned that menosca is a microservices architecture Microsurface message bus architecture and so we have a lot of repos Most open-stack projects have like a couple repos. They have like a nova Repository and then a python nova client repository. We are a little different a menosca Because we have all those repos there So we have one for API one for a persistent or one for a threshold engine or notification the agent stats D library the horizon component menosca UI will to have our python menosca client TWC is added a puppet menosca and then we also have our new menosca lot API Which is only Python the direction for the project is to head more and more towards Python So most of our new development for new components is my Python only but we've got these Other components that have both Java and Python So that's something to be aware of if you don't know Java then you and you want to be involved in the project Then there are certain components that you could work with Of course if you want to work on the full project then you might need to learn Java We try to keep those code bases in sync Right. We are very careful on the project Going forward right now today You can run all all tempest tests pass 100% with either the Java or the Python components and we basically try to maintain Full compatibility between them in fact if you look at the architecture slide every one of those components there if a Java or a Python component exists you can use one or the other and That's regularly tested that way as well So you can have a Java API over a Python for sister or the other direction or you go all Python or all Java The notification engine is purely Python the Python menosca client is purely Python So you can stay and do purely Python and the log API is all Python Okay, I mentioned this DevStack plugin. That's where you would go to install the DevStack plugin The best way to get started on that is to clone the menosca API CD into the DevStack directory and if you use vagrant type vagrant up and That will install the whole thing Tempest tests are in the menosca tempest test directory and that same menosca API repository and That's also run in our open stack CI process and that's what I just said Okay, so there's a couple of ways to get in touch with the menosca team and I just want to make sure I got enough time to show The horizon and Grafana demo and I think I will Yeah, so those are the ways to get in touch with us and I or C So we have regular weekly meetings on Wednesday morning at 1500 UTC And then we have our own open stack menosca room which If you're trying to get a hold of me it's hit or miss with me But there's other people in there and they'll get me if I'm not in there Okay, so what's new in the project in the last couple months if you've been watching the project at all We have added a lot of enhancements for filtering and sorting resources that return arrays and that was really important for You know things like or tools like ops console or you're trying to do summary and Overviews overview type pages or you're trying to page through lists without having to bring in the entire database into your browser Multiple metrics is in progress that'll allow you to query and say give me CPU user percent and you'll get back a List for every single like host in your environment Sporadic metrics is something that's in progress that allows you to support event-based metrics So most metrics you think of as that that is being periodic They're sent every 30 seconds or a minute whatever your collection time interval is and that's configurable if You have something that's event-based like something's up or down or in our case where this came up was When we're trying to integrate with the logging system and count or emit metrics for each log message So obviously errors in your log files Hopefully there's not many of them there But when they do occur they're event-based and they're usually due to some problems So with what the direction that we're going with the logging system is to event a metric directly into Kafka When like an error occurs or something else that you're interested in in a log file, so you can use log stash for Parsing on your log messages and then emitting messages back into Manasca and from there you can threshold on values like I say hey if I get an error message I want an alarm to go off and then send a notification Periodic notifications as something in progress and that's really focused on better Enablement of auto scaling with heat and so a periodic notification is a notification that will be sent Periodically like every minute today our notifications are one shot When the alarm fires we send a notification. We don't send any more so periodic notifications will be primarily used with webhooks to Notify heat that the alarm is still in an alarm state And then they can make decisions and that's how the heat architecture works around that So some things that are happening with the project right now one thing that's happening There's a blueprint that was just put up But we're adding what we call the Manasca transform and aggregation engine. What will that'll be if I Can we call back at the architecture slide? We had those three Microservices one for the persister one for the threshold engine and one for the notification engine They transform an aggregation engine will be at that same level in that architecture diagram and what that'll do is It can read incoming metrics and then it can aggregate them together. So let's just say we had metrics from Swift Across all the Swift proxy servers in my environment and I wanted to create an aggregate metric that is the number of bytes Sent or received from Swift by Project or tenant ID. So I want to do a grouping operation on a tenant or project ID The aggregation engine will receive all those metrics and then using its built-in analytics capabilities It'll go ahead and sum or total that metric by whatever you're grouping by in this example. That'd be a project ID So look for that if you're interested in that and it's a great time to get involved. I guess with that Manasca analytics something experimental, but a couple years ago. I talked about doing anomaly detection with Manasca I didn't make much progress on the anomaly engine But we have some folks now that are interested in doing this as well as what's called alarm clustering or correlation So, you know in a system where you have many many alarms going off you can suffer alarm fatigue, right? That's the typical problem for an operator So let's just say a system goes down all those alarms About that system would potentially trigger and go to the undetermined state. What you'd like to do is cluster things together Temporally as well as based on other Identifiers in the metric like dimensions in it for example a host name Okay, so that that's the other focus on that project and that's kind of experimental, but we have some folks from HP and Fujitsu that are interested in this area and we're hoping to Put more time and energy into that over the next few months and then Manasca events This has been on the list for a long time with the project Hopefully we're gonna get to this in the next few months But basically adding the ability to add complex event processing the canonical example that I've used in the past is Open-stack notifications being sent like VM life cycle events getting sent into the system and then If a VM end event occurs Triggering on that and deleting like something or taking like an alarm associated with it or doing some other action Some other areas are we talking about later this week? Retention periods in Manasca compression at the API level more performance performance has continually looked at within our project and So we always talk about that how to make things faster. It usually involves Optimizing your database queries potentially adding just like we have the metrics resource adding a names resource and Network monitoring is a big area that's starting to come up and how would you use Manasca to do that? Obviously, we have the broad view project, but we're going to be talking to the neutron team later this week and Hopefully we'll come up with some good ideas and at the next summit. Maybe we'll be presenting them Just an advertisement here if you're really interested in Manasca more and this would be developer focus sessions We have from 9 to around 2 30 on Wednesday morning if you're interested in attending that as my recruiting Okay, so Now to horizon Okay so Here that moving that screen shot earlier we had the monitoring panel and I can go to the overview page There's not a whole lot of excitement here happening right now We'll I'll show you that in a second, but what this monitoring page is showing you is what we've got you can go Go to Grafana from here. This first row here is for the services in my system We're currently showing that the monitoring service is experiencing a problem. That's why it's yellow and Then as I was going through that whole ipython notebook I ended up creating some metrics that had a host name of foo and a host name of bar But the real host in the system is dev stack and so the first row was grouped by services And the second row is grouped by the servers Okay, so what we're going to end up doing here and there's actually metrics being sent in this system right now but the problem is I don't have an alarm definition created for it yet and The metrics that are being sent into the system are called HTTP status so we're going to create an alarm definition called service status and We're going to say if the max of the Metric name so these are all the metrics that are in the system right now that I'm paging or scrolling through And so we're going to select the one with HTTP status And we're to say if that value is greater than zero I want to That'll be my threshold to alarm on it Now this next section here. It says the matching metrics so earlier. I was talking about match by or grouping metrics by Some dimension and what the system right now is showing me all of the unique Metrics that happen to match this alarm. So there's service equals glance service sender service neutron Swift salamander and Nova. So those are metrics that are being sent to the system right now for each one of those services and They're all metrics that we know about in the system And so what I want to do is I want to create a separate alarm for each one of those services I've got two choices here. I can create one alarm for everything, right? Or I can create an alarm for each service independently So this line right here, which is already pre-populated with URL host name and component I'm just going to strip off that and it's going to say match by service and I can supply a description Status of service I'm not being very creative today and Severity we're going to give this a critical my services and down. That's critical and email notifications It's not going to work if I select the pager duty one or the email because my VM Doesn't have I couldn't I enter the passcode into my VM So anyway, it does it's not really on the internet. So neither one's going to work, but we'll select monoscope Bootcamp pager duty one anyway. That was actually created in the ipython notebook. So It successfully created that alarm definition at that point So How many alarms do you think are created? Let's go over here now Well, let's go back to the overview page. We'll wait for it Wait for it Oh, I forget to sit. Oh, there they are. Okay, so it takes a little second there But now we've got this row populated Okay, so alarms are sent every 30 seconds and then there's the built-in evaluation period that our threshold engine uses So it's not immediate And then you could see that many of them are in this gray state right now that means they're in the undetermined state and It'll take about another minute for them to Be in the okay state so So basically that's kind of the the model around how alarmed of the definitions are used as You know basically as services are deployed and you know usually for your physical infrastructure You don't have a huge churn rate or you know, you're not constantly bringing up services or down And hopefully they're staying up All the time, but when you're dealing with open stack resources It's a little bit different. So, you know this this does apply to You know your physical infrastructure, but it's even more applicable to You know open-stack resources, okay, so in this case Nova turned red here And the reason why it turned red is in my I'm synthetically sending these metrics into the system right now I gave it a value of one and that's why it's red and all the other Services are going to turn green here and we can come back to that Okay, so that's Alarm definitions you can look at each alarm. They make sure I get to the Grafana, okay So so the alarm definition Was called service status. You can go click on that look at its description. You can go Edit the alarm definition Set or you can go look at all the alarms were created by listing them So here's all the alarms that are in the system many are still in this undetermined state Actually, they were created earlier. So But we have the new one that I just created So you can page through that Look at all the alarms then the other panel in here is the notification methods So we could you know if we want it to create a new notification method for John John dough Can't spell okay email that'd be John Doe domain Okay There it is, okay, so I Created that notification method that doesn't Generate notifications until I actually associate it with an alarm definition So if I go back to my alarm definitions, I can go look at the service status one I can say I really want John dough to be notified when the service status Alarm goes off and I can select John dough from there and save that away Okay so So that's kind of the what we've got built into horizon and They go back to the overview page right now and okay, so everything is green except for Nova Which is what I expected in my case. I have a little Python program That's looping through and generating synthetic metrics here if we click on that we can now drill down on Nova and say what's wrong with Nova and We can show the history of that and There's really not a very big history because I ended up clearing my history before I did this ipython notebook But had this value been changing over time here. We would see a lot more alarms, but let's just see what's shown here So we have a timestamp. That's when the alarm transitioned. We have the old state the new state We have the dimensions associated with that alarm and the reasons for it So I'm not going to correct that alarm right now that'll take a little too much time I did want to show you Grafana and how that's integrated in with this All right, so from this overview page. I can hit the Grafana button and I'm going to log in as a user minimon So in Grafana, there's something called data sources. So the first thing if you're using our dev stack Plug-in is you'll have to enable Manasca as a data source. I already did that and there's a screenshot of that and the ipython notebook But here's what you're doing. You're saying, okay, I'm gonna name for this data source. It's called Manasca the type of it It's Manasca. I'm going to make it the default the url That's interesting Yeah, it's John too. I thought it was pager duty actually Every time I run through this ipython notebook pager duty calls me Like it can't be I thought we were off the network. I said that like ten times already But that was from New Jersey. So Somebody's trying to get me All right, so that's using keystone authentication and we can test that and everything's working cool So you'll have to do that manual step and you can go back to the dashboard here and start populating it So I'm gonna say go ahead and create a new dashboard and I'm gonna add a graph panel to this and I'm gonna just go with the measurements. So this first This first selector here is for which function do I want to supply to the metric? It can be none or can be average and in max summer count. So I'm gonna say none And then which metric I'm interested in so if I click in there I can now scroll through metrics so I'm gonna select CPU user percent and Well, I'm not automatically updating yet, but now I recycle that so that's the CPU user percent You can see this is over the last six hours You can see this time was here was it probably when I was in my hotel room updating the slides and then I walked on over So the system was down and then something like that anyway So that's why that's that looks funky like that. I can go in here and say I'm interested in the data for the last hour and I can also oops Say I want to automatically refresh So every 30 seconds will be good Okay, so that's the CPU utilization. You could see it's hovering at around 5% What I normally do here is Sometimes what I do is I go in and I like I'll go ahead and say I'm gonna kill that. All right. Well true Okay, all right, so I'll put a spin loop on that we can go back and look at CPU utilization See if I was telling you the truth. I don't want this to be a magic show Okay, so let's see. So now we can come in here again and we can add another metric Panel oops, that's not working right. I don't think it likes my Did with I don't think it liked what I did with going back into full screen or something or Zooming what happened here? Let's go try it once more. Well, we'll create a whole new panel We'll say we'll take the average this time I don't know HTTP status And I'll say pick another one Kafka Kaf consumer lag, okay, it's not what I expected Alright, well, I think my demo kind of That messed up here. Oh No, okay, we're fine. It's just not what I expected to see All right So in the top panel now, we've got CPU utilization in this bottom panel. We've got Kafka consumer lag and We could spend a bit more time cleaning that up and showing you exactly what you want But that's kind of the Grafana integration that is available right now with Manoska Oh one other thing So just to show you can show you the last five minutes of CPU utilization After I created that spin loop our CPU jumped up, right? about a couple minutes ago and That's it for the presentation Hunt me down this week if you want to know more about the project Wednesday morning another advertisement We'll be doing some developer sessions on that If there's any quick questions that people have We've got like a minute or two left So Right Yes Right Yeah, so the Grafana plug-in or really it's a Grafana. That's a the Manoska data source to Grafana Is interacting with the Manoska API. It's not going to the database. Oh, thank you everyone for attending