 All right Right, hello everyone and welcome Our talk is called the anatomy of an action and we're going to talk about how you could use Events in OpenStack to debug and monitor your system My name is Ladik. I'm contributing to Nova Gordon My name is Warren Chang. I'm a contributor to Slamatter and we're both my engineers. I Guess I'll start off So OpenStack is a Pretty amazing thing like you can do pretty much anything you want with it You can create virtual machines and to do like parallel computing and tackle huge problems You can provision test development machines at like a snap of a finger without having to wait for Your IT department to build you a physical machine. You can store your data in one place in the world and Access it in a completely different place in the world And you can do all this with just like a few clicks of a button, which is pretty cool But open stack like all good ideas when it works, it's wonderful, but in execution sometimes not so much So when you use OpenStack, you might see various things such as this So when you delete a snapshot you you just created it might tell you that You can't do it and it won't tell you why you just you can't do it Or it might tell you you're not allowed to do it even though you have all the permissions to do it to delete it Because sometimes open stack knows better than what you know, so Also sometimes when you create an instance it might fail and I'll tell you to try again But it's cool because you didn't want to do it now you wanted to do it later So you just try again later and it seems pretty animate that you try again later Give me that false glimmer of hope that it actually might work the second time around But it won't because it's actually a very real error So it leaves you kind of wondering what's happening and Most of the errors that it gives you don't really doesn't really give you any good information about what went wrong And when it does give you good information what went wrong. It's Sometimes wrong itself like the error. It'll tell you is Irrelevant to what actually happened so if you're lucky you actually might find the error on some random hosts on some in some random log and for some random service and For us it's it's okay though because we all have time to do this to debug the entire system that way So to summarize debugging open stack is very hard Every action you do in open stack actually has multiple steps and Any one of those steps can fail On top of that a lot of the steps are asynchronous calls so that can lead to a lot of various timing issues which can be very hard to track down and Also because it's a distributed system that failure might happen in some random place in the world or random node that To find it is very difficult And while there are solutions out there to gather logs from different systems in the end debugging log files is difficult there's very little schema or standardization across logs and To find errors that way is Not the easiest thing in the world So one thing we will emphasize in this talk is how understanding the context and flow of The events in your system can help you debug your open stack environment All right, so let's go through one of the most common in most common use cases is creating an instance This is basically a simplification of what happens when you create an instance So it all starts with the user sending your request to the API which goes later to conductor Then to the scheduler to find the relevant to most suitable host to start the instance on Then to the compute the manager where it's trying to build the resources then it will call Neutron to create networks and then it may call sender to create storage and eventually we should have a running VM It's all great, but this pretty much may fail in every step of the way may fail here or here or here and Fortunately for most of the steps novel will send a notification sorry, that will get eventually emitted to the queue and It also will send a notification when something fails Pretty much most of the services will emit some set of notifications Notifications usually does revolve around grad type operations So now that we have Completed our tasks for creating an instance. We have notifications from all the Components stored in the queue The next step would be to consume those messages So salam is the telemetry project in open stack There's various components salam there. There's a polling aspect is a notification handling aspect and storage and alarming For this part specifically we'll talk about the notification handling part So basically what the notification agent does is it consumes messages off the queue using Oslo messaging so you can use so supports rabid MQ and Hypothetically keep it in zero MQ And So whenever a message gets emitted from one of the services such as Nova or sender salam or picks it up and Processes it Salam is original person original purpose was to take the notification and build meters or measurable data from Specific attributes that were contained within the payload, but there's also More things that I can do What Salam or does is actually it builds an event from the notifications that are emitted from the different services So looking looping back to what we had for the original create instance use case The scenario looks like this. So you have an incoming notification from the various steps in the create instance action and What Salam or does is it listens to all those event all those notifications and it builds a Single event and one or more meters from it So what is event in Salam or it was originally implemented in ice house as part of the stack-tack integration efforts and In kilo we kind of finished that off Finished off most of the functionality there Basically an event is what it represents a state of an object in an open-axe open-stack service that any given point in time And it's built off information we get from info and error level notifications that Nova and the rest of the services emit and Lastly what slumber event does is that it normalizes these messages that we receive So a lot of the payloads that we get from each of the services Either have no schema or differing schemas and what Salam or can do is it can take the schemas The actual the scheme is from the various events and remap them to Common names such a so if you have a resource ID in a different location across different Notification messages it can remap them into a single place or it can also enforce a data type on Atch attributes that have inconsistent data types So the slumber event model looks like this basically each slumber event contains three Key attributes as a message ID which uniquely identifies a notification message There's an event type to classify what kind of event it was and there's a timestamp to that of when the event happened there's additional attributes that It contains that are optional So there's traits traits are basically queryable index attributes It's relates them remapping that was mentioned earlier and There's also a raw attribute, which is pretty much the full payload dump of the entire notification messages It's mostly used For if you have like auditing purposes use cases where you would need to keep data around for a year or so So that the raw data is actually unindexed and from a slumber API point of view, it's actually not queryable So what a vet slumber does with events is that it passes all the events through a pipeline and Through this pipeline you can configure them that configure different actions to be done to each event So if you want a certain certain set of actions to be applied to Event types ABC you can push them into one pipeline and a difference if you want another set of actions done on the Event types XYZ you can push them into another pipeline and we also support the ability to publish multiple targets You can write them to a database the file a Q or HTTP target some of the databases we support our MongoDB SQL And a last search and we'll probably go in more. We'll go into more details on last search later All right, so when we're debugging a system, it's Usually better to understand the context and in which flow the error happened We can of course Go through the multiple logs that the open stack produces However, these are going to be spread across multiple servers We can of course centralize the logs however even there there will be a lot of noise in it And it will be very hard to find the exact error and the flow it was in With events the data that we get is normalized and we could clearly see the error and its flow One event of course doesn't provide much Doesn't provide much however collectively it it has it is meaningful with that said events are not coming to To replace the logs in any way, but they can provide a quick indication of where the problem is So how do we tie events emitted by all the services to? to Cilometer It of course depends on the use case. We can have different views with events We wanted to show events in a way that would help us debug a system So now I'll show you some screen shots of what we did with the last search With the last search back in in Cilometer and the events that open stack emits So just as a little background if you don't know what a last search is it's a Document-oriented schema free database It's built on top on top of a patch you've seen which is kind of designed for full text search searching which is Pretty useful when it comes to events and log handling error messages It's a distributed highly available real-time database. So there's a bunch of keywords there and It also comes with a cabana, which is a GUI interface to which allows you to query the database So this is a screenshot of cabana, which is the GUI interface Well, I've previously mentioned that you couldn't query raw data in Cilometer, but one of the unique things about using a last search is that actually indexes all the fields for you So you can query anything you want You can kind of explore the data as you please Using cabana as a query language so you can filter on At the top right, you'll notice there's like a last 60 day last 60 days Text there you can kind of filter your your data based on when The time range you want to be researching on so if you know the things when the error happened you can Kind of drill into certain time frame so you don't have to query an entire database you can filter an absolute and relative time ranges and It's a very neat tool to use to inspect like just various Information that the notifications actually give you so in this view what we did ourselves is when we Sent when we collected the data in Cilometer and sent it to last search We queried our using the command interface. We queried the resource ID of the instance that failed And it gives you all the events and notifications that are related to it By default it gives you like a dump of all of all the related events and all the related attributes of those events And but what we did is we kind of dove into those events and we could kind of pick out certain attributes That would be useful in showing view of how to debug your system so What we found was that if you the timestamp Priority of the event of the notification the event type itself Any error messages in the event type? Were important to create a view that would help Users debug their environment So this is the view that we created in Kibana So we can we can actually see here Events are coming from different services from Nova of course and the in neutron here with the port is mentioned We can also see the errors that are being presented to the user and this thing with There's not enough less available. They're not not possible and of course the the error that is Actually happening on the host which is completely irrelevant because it failed because the image itself is is broken and You can not just see the the error, but the entire process and they were exactly in the entire process it failed So this is a horizon most of the users well not most but a lot of users are using horizon and We should pretty much have the same functionality in horizon as well Not just in Kibana and thanks to our colleague George Pierce. Yakis. We have Who built this this view for us? It took him a few hours to prototype that We we have it in horizon now So in in CELOMETER we index everything on the key traits However, horizon as Gordon said need to query CELOMETER API directly and get all the information All the events related to a specific resource and then we'll get We'll get to show it when we drill down on an instance here So similar to Kibana we can see all the events in According to Time and And the event and how the flow progresses and we can also see the errors as well We can see where on which step did it fail and we could also see where the Events are coming from from which service and from which component Unfortunately, Nova is there is the only component that published notifications for errors and the The errors that we've seen coming for volumes for example Didn't have any events tied to it So as a first step, it would be really great that all the core components all the core services will will send notifications for errors and This will improve The event flow and therefore provide a better context Another point is that we should have we should have a schema and you currently The data is in notifications as chaotic and the different components are using different keys to pretty much describe the same thing In CELOMETER we we can remap the consistent schemas, but ideally we shouldn't Since now we have a viable view in the horizon we could use this the event definition file We have in CELOMETER to remap things we can map different traits to feed the horizon view and we're planning to check in this file as well and In future we we should be able to expand this functionality in Horizon not just to to other resources not just to the instances So from a slumber point of view some of the stuff we've been talking about over the past few design sessions is to add Alarm into events so when certain say an event Say the status of a instance changes you might want to trigger or an action Immediately based on what you what you hear what you receive in CELOMETER and that's something we've been discussing in the slumber design sessions this this week and there's also Been talks about adding the ability to build metrics from events. So Say you're provisioning an instance and it takes longer than 90 seconds you might want to know how You might want to know how long provisioning some some of your provisioning Actions take and because the events in slumber push through pipelines. That's actually something that we Can viably do So that's our talk hope that kind of gives you a view of what you can do with some of the events and That opens that currently amidst and what we can do going forward Hope you enjoyed questions. So we've been told that you should ask questions there What's the performance like of using elastic searches instead of MongoDB? I Haven't done benchmark on that Yeah, I don't have a good performance a querying on querying on any Any as soon as you get to any kind of size of MongoDB is just awful. Yeah So I believe there are multiple place like companies that have a lot of search running at a Prejudice scale, but then that said there's a lot of companies running Mongo at a certain time Yeah, one of the good things about the last search is that it kind of forces you to To avoid open any queries and you can you have to drill down on specific time ranges so your queries do tend to Be more performed because you have these forced restrictions What are the big differences between? Juno and kilo in the implementation of kilometer can we use both most of what you described in Juno? So a last search was added in Kilo, so you won't be able to use last search But cylinder and the events and all that is that already in Juno? Yeah So the basic fun functionality of events is available in Juno. There's a little more flexibility Coated in for a kilo, but the basic premise of it is there Two more short questions. The request ID is that inserted automatically by Solometer or who takes care of inserting that so that you can correlate events related to the same action so that the data that You saw none of it was generated by slumber. It's just everything we received from the various Notifications that the services emit. Okay, so that that might be there for this for example as we showed but again, there's like Inconsistent scheme across everything so it might not be there. All right Okay, and final question. Have you guys played with? Feeding this into an event stream processing engine like reman or a patch of storm or anything like this we haven't tried that but one of the Publishers like in slumber you can publish two different sort of targets not just the database and we do have a Kafka publisher so you directly could hook up storm to That publisher and I believe some people are doing it doing that not for events, but for metrics Thanks Thanks for the presentation Have we tried to? combine the events with a log aggregation tool and Can we correlate? The logs aggregation together with some kind of event ID Have we tried or can we try? Have we that's the view made experiments in that Expanding the the data mining to include logs because I think a lot of the project which do not generate notification may generate log Yeah I think it could be good supplementary thing We haven't tried that Yeah, one of the good things about less or from a last point of view It's just a document store so you can pretty much dump anything you want in there Cool, I think we're good. Thanks for coming out guys