 Ready to go. Okay Hi, everybody. My name is Dan dire for those of you who are watching the original Schedule you may be wondering where Matt is or if you're Matt's fan club so Matt wasn't able to attend so I jumped in and and I took his place for part of the presentation So you get I'm a I'm a software architect in the HP Healy on group So you get to see me try to pretend like I'm a product manager because that was with Matt jobless So I'll give you my best imitation of that start with and then Following up after talk a little bit about the motivation and we'll talk about Roland who's also in the audience is that lead for the Manasca project will talk more detail when it's going on So, you know, what's what are we trying to do here? So everybody needs to know what's going on in their cloud You need to be a measure that you need to analyze. What's happening in the cloud You need to be able to store the results of that and then you want to see what's really going on So you can take make some conclusions and so really that's what we're trying to do with Manasca in a very flexible extensible way so that we can never really build out something that people can Operate their clouds with because we all know that it's a bit of a challenge sometimes to operate that open-stack clouds Agenda real quickly. We'll do some overview talk about some use cases some for some people who've already deployed Open-stack and production environments and then talk a little bit about kind of next steps where we're going as far as Manasca So first of all, let's talk about the monitoring challenges There are a lot of tools out there Everybody's doing the monitoring for I've been working in monitoring off and on for 25 years We all know there's a as many tools out there as there's probably people that use them They're they're all over the map in terms of capabilities But if you look at you know traditional tools that have been developed versus the cloud environment and where we're moving Some of the things that are starting to come to the forefront in the way we manage and deploy clouds Starts to make some of those other tools kind of maybe not quite the best fit So first of all, you know, you're talking about scale increasing So, you know in a typical enterprise environment you compare that to say a cloud environment where you've got thousands of Potentially thousands in a large enterprise or even a public cloud compute nodes Volumes things like that and then you've got hundreds of thousands potentially of virtual machines or containers You know, that's just not the kind of scale that a lot of systems know how to deal with at this point Second is there's just a lot more moving parts and things are a lot more complicated You know, we're just in the last year or so containers have kind of taken off So people are a lot of enterprises are just getting used to VMs and now we're throwing containers into the mix a lot of different types of resources to deal with With the virtualized resources and shared resources that you have in the cloud That just adds another layer of complexity when you're trying to figure out what's really going on The apps themselves are distributed So where I used to have kind of monolithic apps that you built vertically now You're distributing those you start looking at a cloud native app kind of architecture where the pieces are moving around they're all over the network potentially all over your multiple networks to give you the kind of fault tolerance and and Resilence you need so once again very good for the Reliability and the capabilities of the app but really hard to keep track of and monitor and then more and more You know, you're expected to do more with less so Get automate everything. Okay, so what happens when you do automation? Well, you know, you don't exactly always know what's going on Unintended side effects things like that. So once again, you need something that can keep up with the level of automation So you know what's going on in your environment and then because we're you talking about virtualized resources and things like that There's a lot more transiency in a cloud environment than there was in say our old typical enterprise environments, you know VMs going up and down Workloads moving around Being optimized and changed on the fly Things are scaling up and down automatically You know, that puts a lot of load on the monitoring system to be able to figure out what's really going on What's an error? What's actually intended to happen, you know, when are you checking things? When are you actually evaluating that, you know, we scaled up or down because we had to or because there was a problem And then all of that needs a lot of data So we need to be able to react quickly We need fine-grain Sampling and we need to be able to do that over a large number of different things that we're monitoring So we're generating a lot of data. So those are kind of the challenges There's lots more to but those are the big ones and that kind of drove a lot of what we're looking at as far as How we design manasca and what we have been working on so process kind of Discussion here, you know, you're you're your typical admin Everything's good. You're running along nice and happy and something blows up So first thing you got to do is you got to know that something blew up and hopefully you know that before that all your Customers know that so that they're not all pissed off at you Typical though, you know, so our in HP and helium we target private cloud And so we're gonna probably be in a enterprise that already has some processes and capabilities to manage this with their More traditional data center stuff or even with their virtualized things So they probably have some kind of instant management system So once you've detected the problem and you know that there's an issue you've done some Basic triage to figure out that there's a problem there then you you're gonna go into an incident management system some job ticketing potentially and Identify that there's a problem and somebody needs to fix it and it gets routed through your normal support process So that the poor engineer has to go figure out what's going on now has to go in and say okay not only What's wrong, but how important is this? Relative to all the other stuff I don't I have going on so they need to do a little bit of impact analysis prioritize it get it upset right and then Finally, hopefully solve the problem get back into that nice happy state The diagram on the right where we're trying to show here is that you know open stack is not an island It has to be able to work with everything else. It's in the customer's environment Manasca is not the only thing that's good. They're gonna be using to monitor systems You know everybody's gonna have a job ticket system They'll probably have some massive dashboard that they're looking at things So you got to be able to fit into whatever processes and capabilities They already have there so that they're not expected to just go rip everything out and put something new in there So how does Manasca help with this? So the feature set and the capabilities the architecture that we built out is really was focused around that problem set That we're talking about so first of all it's got a scale out architecture, so Well, we'll talk about this a little more, but essentially You have to be able to grow with your cloud your monitoring system And it's and really from our perspective you have to be able to grow so that you can have everything in one place So that as you're processing that data you have a holistic view of what's going on If you can grow but you end up with little silos of what's happening in your environment Then it's gonna be tough to figure out what's going on second is Because people expect very quick response and they expect to have things up and running and you're running a lot of automation and Things are sensitive to timing. You really need the high resolution metric So you got to be able to pick the state up quickly and you got to be able to process it quickly Obviously if you're gonna depend on this and this is your mission critical infrastructure that's got a VHA One of the things that a lot of traditional tools don't do very well is multi tenancy So while operators are always the first line of defense in a In an operating environment, and they're the ones that are going to be the first ones who get the interest in the monitoring tools You know if you were looking at DevOps kind of environment The DevOps people are going to be maintaining the in-cloud workloads and they also need to be able to see this You may not necessarily want your DevOps guy though to see the underlying infrastructure of your system or see everything That's going on under there, especially if it's a you know a public cloud kind of scenario So having multi tenancy in that capability so that I can filter out who sees what but make that available to different categories of consumers People run in VMs versus people run in infrastructure applications That that's a pretty powerful capability that really hasn't been done very well in the past anything that we do also as I mentioned first of all Has to fit into the rest of the environment so we got to be able to do integration So as an example, you know, we've done a variety of integrations of some existing HP software products We've done integration with pager duty, you know job ticketing systems Those kinds of things are going to want the information the HP hat that the Monitoring tool has and they're going to want to be able to process on it So you got to be able to integrate with it and then you have to be able to extend that because what's cool today and what we just did You don't want to go ripping everything out so that you can throw in a new thing next week that does something new You want to be able to incrementally build on your capabilities. So we have a architecture that can Form the basis of this. It's very flexible in deployment and capabilities But also could be extended and through the community people are starting to build on top of that if you guys saw the logging Presentation that was just done that was done by Fujitsu a lot of the work was done totally independently And we pulled that together to make it look like it's fitting very well because we had some nice design pattern some integrations But that gives you the idea of you know, if you're doing things custom just to your environment You can do that or if you want to build that so that the community can use it They all those extensibility points are there and then finally a lot of the systems out there are kind of static You try to have if anybody's tried to configure Icinga for example It's not exactly easy to go into Icinga and change that thing on the fly So all of the configurability all the capabilities in Manasca are you can run that through an API? You can dynamically can control the behavior of the system so that your automation can go in and quickly change things if You want to you don't have to go restarting services Loading config files and places things like that So that's the Sort of the rationale and overview so now I'm going to hand this over to Roland to talk about more deep technical detail Hello, everyone, so I'm Roland Hockmouth on the tech weed on the Manasca project I'm going to go through the system architecture of Manasca first and When I get to the next slide, I'll tell you what's next I guess Okay, so let's see so Very zen like open mind Okay, so up in the upper right here, we have the system being monitored. We typically have an agent deployed on that system That's monitoring things like system metrics CPU utilization Networking etc. It can also monitor services like we're having MQ Apache my sequel That agent will publish to our Manasca API, which is a REST API We publish metrics From the the API the metrics are published to our Kafka message queue So we use Apache Kafka as our message queue most people are familiar with rabid MQ in the open stack community Kafka was developed by LinkedIn. It's highly performant scalable fault tolerant and durable message queue I can handle millions of messages per second in a completely Journalable way Okay, so those metrics end up in our message queue and then we have several components here My diagram is only showing three components today Often when I show this in the past, I've had maybe four or five components But the first component of this diagram is our persister it consumes message metrics from our message queue and then publishes them to Our metrics and alarms database in the lower right-hand corner. There's actually two databases in our system here one on the bottom right is for all of our high volume Data which is Metrics could be events It could be log messages in this case. It's metrics Then the next component in the middle there is the threshold engine The threshold engine is written in a patchy Storm So this is also a highly distributed component What it does is it evaluates whether metrics have exceeded some thresholds, which is user configurable You can specify the alarms. We support compound alarms made up of many alarm Sub-expressions and each sub-expression is you know, whether some value of a metric is greater or less than or equal to something If that threshold is exceeded then we publish an alarm state transition back to our message queue the notification engine Sort of in the lower not the lower left but above the database there We'll then look at those alarm state transitions and evaluate whether it should send an email or a pager-duty alert or It could also send events to other software systems via web hooks The lower right our config database, that's my sequel. That's where we store all of our Configuration information some people are probably wondering why do you have two databases in the system isn't one enough So some databases are really good at storing lots of data right once read many times That's those are database, you know like analytics type databases And this diagram that would be in flux DB We also support a database called Vertica from HP. We're working on support for Cassandra The lower right database, that's typically my sequel also postgres Fujitsu as support for postgres Recently and that stores information like what are all the alarms that you've created and the notifications that you've created in the system The upper left hand port upper left there is horizon and We have a monitoring dashboard that we've developed for horizon that does all the create read update and delete operations via Our API and we have a Python client So the advantages I'm not going to go through all the advantages today of the architecture, but some of the more important ones So this is a microservices message-bust architecture. That's what we like to call this microservices Relatively new in the industry the last couple years. I can see we have small components They can be deployed autonomously. They communicate over a network in this case Over a message queue via well-defined APIs. You can deploy menasca With all those components there or you can even add additional ones So that's a little bit about Microsoft services So system supports load balancing scalability and system maintenance, but anyway, that is if you date if your amount of data that you're sending into the systems is Exceeding your current capacity you can easily add in and scale out horizontally more components Also, it's really important when you're developing a monitoring system is you have a requirement for 100% uptime or fairly close to that and What you can do in the system is take the database offline if you want plop a new database down or deploy a new one and and Your queue will have stored all that data Hopefully this doesn't take too long for you to deploy the database and And then you can enable in a database and start sending data into that or if components fail They will end up being stored in the queue and then later on when the components come back online They will then catch up So data in Kafka typically durability Time to live on messages within the queue You can typically support days if you'd like so you could take systems down for days at a time So highly available and durability ensures no data loss the system is extensible and This is really important the extensibility aspect and mentioned a little bit of this, but it's easy to add new components Within HP. We've had you know a lot of several monitoring solutions that we've developed over the years One is called HP operations manager Oh, my there's a connector for that I'm not trying to give a plug for HP here, but I am trying to just point out that it's easy to do these integrations There's another one tool that we have done a proof-of-concept one well HP called ops analytics You think of ops analytics is a little more similar to something like Splunk Something that came up when Time Warner cable was deploying menoscas They wanted to do multi-site replication of their data. That was very easy for them to do that because they typically they just enabled Another component to send the data to a database that was off-site Anomaly detection I showed this last year at the Paris summit, but you could add in more components So we have had some proof-of-concept of anomaly detection being done Okay, so the threshold engine The threshold engine is a real-time in-memory streaming threshold engine That means as the threshold engine consumes metrics from the queue it keeps them in memory for the entire Time frame or window that it requires those metrics for it doesn't Go out and query another API or database when it needs to update those Thresholds, so we're updating thresholds once a minute is our default time today You can send metrics into the system much faster than once a minute, but we evaluate thresholds on one-minute intervals and That's just basically kept in memory When we no longer need a metric Then it just drops out of the window And we don't have to Query databases or other APIs and that's based on Apache storm This is just an example of a metric so posting to the v2 metrics endpoint Metric is composed of a name in this case CPU user percent. We have dimensions Dimensions are whatever values you'd like to have in there But in this example the host name a region a zone on the service There's a timestamp in milliseconds a value which has a float and then this thing called value meta Which we add it to better support Nagio So we we have a compatibility of Nagios if you have a bunch of Nagios plugins deployed already You can use our agent to run those Nagios plugins convert the status codes into metrics Which is a value of zero one two or three for okay warning critical or unknown and What's even it's also important is the message that goes along with that and that's what the value meta is there Showing two things in this value meta the status code 400 in this case HD to be 400 status code and a Message of internal server error, and that's all again up to whatever values you'd like to store there Okay typical deployment scenario we typically you know within HP and focused on helium But we deploy on three nodes And in this example, it's for like it's symmetric all the components are deployed across the entire Three nodes which I'm showing as Availability zones in this diagram, and then there's a load balancer with vips. So fairly Common pattern for deploying if any one of those nodes goes down The system stays up and running and then when that note that comes back online You can have it rejoined in the cluster And we've done a lot of testing with that so we know that it works and that's the guy to talk to right there All right, so our agent That's a Python agent. It's optional, but Typically you would use our agent for running with an Oscar does collects large amounts of data Like system metrics or service metrics or application metrics. We have a built-in stats D demons. So if you want to You know use this import like a stats D library we also have our own stats D library Import that and then start sending application metrics by instrumenting your application then you could do that We support VM metrics via libvert We do active checks as well. So HTTP status check system uptown checks Typically, you don't want to do those checks on the system that you're monitoring You want to do those from another system and maybe multiple systems if you're trying to monitor multiple paths To an API endpoint for example, we can run any Nagia's plug-in Or check in K and it's very extensible This is the agent architecture Important things here it has several components as well So we have a collector collector as the component that goes off and gets things like system or service metrics It sends and that typically runs We run usually every 30 seconds And that's configurable as well But that will send to a forwarder which buffers for a small period of time to amateurize the cost of the HTTP request So the forwarder will authenticate with Keystone Cache that token and buffer metrics and send to the API Round every seven seconds is the default that we use And then we're showing the stats D demon down there as well and application sending that we have a UI We have a horizon dashboard Not have some diagrams there later. We'll show you Actually, I don't think I have any screenshots here of the dashboard I'm sorry about that, but That supports basic the create read update delete type operations via API. So you can create alarms Within the UI you can look at the stat state of alarms. You can look at the history I can get the overall view and health of your services and then there's a very Cool time series dashboard called Grafana. That's out there. We've supported that for over a year now So that provides visualization of metrics and there is a port to Grafana 2.0 in progress by Time Warner cable and that code is posted at that link. It was just posted Yesterday, I believe Okay, so talk about production use cases Next I'm going to talk about Manosca at Time Warner cable Time Warner cable is one of our closest partners They have around 200 physical nodes in their infrastructure. It's all being monitored with Manosca today around 3,000 metrics per second So the agent is deployed on other nodes, it's doing all the physical infrastructure monitoring and I mentioned the Nagios part. So they actually enabled they had Nagios deployed already in their environment and what they're in the process of doing is Switching over to using just solely Manosca So they basically enabled all those Nagios plugins within our Manosca agent and the goal here is a big part of our Manosca was to consolidate systems I worked on public cloud for HP and We deployed three systems for monitoring there. We had Nagios We had monitoring as a service and we have an internal metric processing system one of the goals in Manosca was to replace those three systems with one and that's I didn't mention that earlier, but consolidation aspect was very important to us We had a fairly large team as a monitoring group. We had around 15 people. So four on each service, roughly and So now with just four people we can handle all that We replaced three systems with essentially one group Okay, so next thing self-service. Okay, so they're TWC is also doing monitoring as a service. So the ems are being monitored and other resources In that data, it's this is not I think this is in beta right now internally and They're They're gonna be hopefully move to Griffana 2.0 since they're the ones that are running at those two people in that diagram They're one is Brad Klein. He works on Manosca and he's responsible for deploying there and the guy in the back is David Medbury Who also works at Time Warner Cable? Okay, so This is just one of their dashboards. This is a Grafana dashboard and this showing the Just their overall service Resource utilization so like Nova resource count neutron rescuers count Cinder resource count and various other aspects so that's a dashboard that they developed and This is a dashboard that they have for the tenant or the project monitoring. So this is more of the What the tenant users would see We've done a lot of analysis on Manosca over the past year Most recently we've done analysis. We've been doing analysis from day one But most recently we've done analysis within our helium distribution. This analysis here is So it's really focused on what we're doing with helium We have a three-node shared control plane. That's fully clustered and we run Manosca and all the open stack services there In this environment, we were we had a hundred compute nodes around 40 VMs per host for 4,000 total VMs That translated into 4,600 metrics per second that were being sent to the Manosca API There are thousands of alarms and we also simulate a load of automatically creating VMs tearing them down because VM churn is really common and Creating VMs is one of the more expensive operations Well, it adds an extra load on the system. So we want to test it as well and we also had Logging and salameter deployed on that same control plane So the key findings in this case So Manosca stable that's good news All components performed within our tolerance levels the main takeaway is that Manosca only used three CPU cores and on these servers that we're running We had 48 cores total hyperthreading was enabled. So we're counting Cores times two there. So 48. So we use three out of 48. That's roughly 6% of the total utilization on the control plane and Around six gigabytes of memory We can scale much higher than those numbers. I can assure you and But this is what I want to report on here Okay, so what's next for Manosca? Yesterday there was a talk by Fabio Genetti from Cisco. He talked about Salasca Which is combination of salameter and Manosca one thing that salameter does very well is it sends it? does data collection for open stack resources and we'll send Samples or events into a system. So on the left here going from left to right. We're showing the salameter system sending Samples or events into the salameter agent and there's a box there the Manosca publisher Which is just within salameters the multi publisher interface so the Manosca publisher is an implementation of that and That publishes to our Manosca API where we store the data as metrics and then what we've also done is Developed a storage driver for the salameter the v2 API. So that's on the right the Manosca driver So when you query salameter it'll go through the Manosca driver and essentially we're using Manosca as a database and The performance numbers there are really good Encourage you to take a look at those In terms of being able to send data around three times faster than the current MongoDB base system and for queries What we've been measuring is up to 18 times faster and all the details there's a lot of information in the Presentation that was done yesterday. So I just wanted to expose you to that Okay events as a service is something that's also in progress Let me mention one thing that code for a salaska is available in a repo. So if people wanted to start using that That is available. Okay, so events as a service is in progress one of our use cases is we want to get open stack notifications and We might want to calculate my canonical example is calculate the elapsed time between when a VM is Versed created and when it is active. Let's just say that average is two minutes and That's kind of a good metric to have in your system It tells you when things are starting to go wrong if that value starts going up to three or four minutes You know your system being misconfigured Networking could be bad etc. So So events as a service we want to get those events we will do complex events Processing on that so we'll filter events We'll transfer them transform them we will filter them We will group them and then we will do processing when various criteria occurs and There's two components within the naska in addition to the API there. There's the events transform engine and the events engine itself, which does the stream complex stream processing Logging as a service was just discussed by Fujitsu right before this presentation So the idea there one of the big use cases that we're thinking we're trying to target is basically take logs and have them Ultimately result in metrics like I can count the number of Errors that are in my log file over some time period and have that result in a metric Okay, so there's three components there a log API log transformer log for sister You should see a similar pattern as I go through this right we went metrics events logging right very similar Similar components Similar things happening. Okay, so this is the data flow So we got logs way on the left logs going Into our message queue. It's essentially a message bus. We do some log processing on it and we transform them and Then they can end up as events in our events engine, which will create metrics with From there they can go to our threshold engine, which we can alarm on Alarms can then result in notifications. So this is the big picture where we're headed isn't all available today for those people that are developers that want to get involved this is Trying to show us the vision for the project right now The metric stuff is in great ship events and logging is in progress All right, so putting it all together just another view of how this all looks to kind of show you more of a architectural view versus a more of a data flow view Going left to right again. We have our log system Events in the middle and metrics and then just emphasizing Similar technology similar architecture and design patterns but more important than architecture and design patterns is the overall vision of Being able to take logs and have them result in metrics Okay, project status. I mentioned Time Warner cable. So they They were one of the early adopters Manosca and we've worked very closely with them over the past six months. So it's in production there HP has integrated Manosca in a product called cloud system That was released the end of October early September 9 dot cloud system 9 at O was released through the monitoring solution for cloud system We were also the monitoring solution for helium now. So the big announcement yesterday From HP about helium. We were in that I don't want to you know, this is an open source community I'm talking about these things to give examples so there is a Community around Manosca consisting of those six companies And obviously Fujitsu has done great work with the logging side of things and they're also looking at deploying Manosca within The products that they are building out and Cisco is also looking at that as well so lots of Activities are going on Obviously, we'd like to grow that community Also status for the project itself. We're targeting to be in the Big Tent in November so Hopefully in a couple more weeks Open stack will be up for review again and the technical committee will approve us. That would be awesome if that happened and We're one of our final criteria Prior to that review is that Manosca be in the open stack CI systems and that's in progress we have We have a dev stack integration from Manosca so you can Run dev stack and we have a Manosca plugin for dev stack so you can basically get that up and running And we have tempest tests that have been developed. So with those two pieces we start at the integration into the CI system and that's current right now So thank you everyone We're I think we are Had a few minutes. So I guess we can take some questions Right, we have I think four minutes Questions anyone and there's a mic right there one question Maybe two I have a big question I'm wondering the relationship between Manosca and the cylinder and I have listened to the The presentation yesterday about the cylinder and the Manosca you replaced the Cilometer storage driver we with the manasca and the cylinder API can call the manasca API to to do them to to get the to get the data and So I think and and the cylinder also have the alarm alarm metric alarm the function and Manasca also have this have this so So I so I'm wondering if Manasca and cylinder can do some merge awesome Yeah, so thank you. So there are two separate projects And so there is there are areas of low overlap And alarming is one example Right now Their alarm engine isn't capable of the performance and scale that we're actually driving so You know, we have a different API but I think go a lot into the differences there, but our API is different and the way we do things is a little bit different. So and there's good justifications for doing that So we see a lot a salameter as You know more of a telemetry system acquires data and then feeding that data into our system and then we have the very scalable and performant monitoring system So Thank you. Have you had any idea about integration of Oslo messaging so that it could write data directly to Manasca? We talked about it, but we haven't really thought about it too much So How about a patchy store a patchy spark integration? Yeah, that's the topic that's coming up for us So our threshold engine right that's typically where we think about the threshold engine So we use storm today and that is written in That's our only component that is still in Java And we'd like to Port that to Python as well and spark is an interesting one because part supports Python very well Yeah, I spark yeah, and they support a streaming engine right so it sparks streaming right so so I mean we have to look at it further understand that see if that makes sense We'd like to be able to go from you know from the same You know in the simpler deployments where maybe you're only have you know 10 or 20 systems You don't need spark there, so we'd like to be able to go from a system that doesn't involve some of these More complex frameworks And use the same code in those framework when you finally do scale and we think that's Possible, but we still need to write the code to see how it'll translate. Okay So you say you have millisecond resolution on your timestamps. Yeah Are you targeting like real-time or near real-time data analysis, you know Something that could be enabled with spark and if so, yeah Do you see that having to go through a restful API before you get to the message queue is causing a Delay in being able to process that data in an in sort of a near real time I mean so if the data went directly into Kafka, then you could pull that data off Very very quickly instead of having to go through an extra hop. Have you noticed that it's adding a Timelag there. Yeah, so we do a little bit of buffering to amortize that HTTP overhead. So we do that at a delay So we're not trying to We haven't really targeted, you know sub second sort of Analysis okay with no delays You know you you could theoretically go directly to the Kafka queue If you had a need to do that deployment Oscar and then if you had something internal that you want to monitor You could do that right the message formats themselves. We are schemas are published in our wiki So if you want to understand what we published to Kafka in terms of metrics And other events like alarm state transitions you could look at that and then you can write your own consumer and you know if you want to send that data into Spark you could do that Pretty easily. Okay. Thank you Okay, I think our time is basically up if people have more questions Get with me or Dan we have a few of the fidget-sue guys back there as well Martin Raderis and we're gonna do a session on Manasca at 4 440 in the Secura tower in room s3 So that's just a Get-together for the Manasca folks, but everyone can attend that I'm hoping that all of you don't attend that because I don't think we have a very big room But for people that it's mainly more focused on the developers, right? So people want to get involved in the project want to meet us and talk to us ask more questions Welcome to do that I'll be around tomorrow as well As the other folks are veils. They want to get together with us. We should be able to do that Thank you very much