 All right, let's get started. Can everyone hear me fine? I guess, all right. For some of you, this might be the first session. So the late risers, welcome to OpenStack Summit. It's beautiful out here in Vancouver, and we'll start off with this Operations Focus Talk on analyzing your telemetry in OpenStack environments. And I'll describe what the problem is and we'll get into how we set up the experimental evaluation of our OpenStack environment and what kind of insights we found. This is the team behind it, even though myself and a couple of my colleagues are presenting, there's a big team that has worked on all of this and there's a bunch of open source projects as well that this team is involved in that I'll show you as we talk. All right, so with that, I'm gonna kick off this presentation. So the problem that we're dealing with is the digital exhaust that comes out from your OpenStack environment, right? And this comes from a lot of different places. It comes from your infrastructure. It comes from the VMs and containers that are running on it. It could be an internal environment where there's more kind of openness in your network or external environments where your networks are restricted from each other. It could come from connected devices that might be interacting in your environment. It could come from the application stacks themselves. So there's a lot of different diverse sources and growing sources as an operator you might be familiar with this problem and they generate a lot of data about themselves, right? This data is information such as alerts, events, it's logs, metrics, right? All of this is being spewed continuously. It's a high volume, high velocity kind of problem. So it is a big data problem, right? And as an operator, the issue is how do you make sense of all of this? How do you figure out when things are going bad? How do you fix things when they become bad? And that's the problem that we're trying to tackle. Now, I just want to orient ourselves and say that while we'll present some specific insights that we found in this environment, this OpenStack environment that we are analyzing, it is what we'll be talking about is how to fish and not eating the fish directly, right? So it's about, okay, we did find some specific insights and we did find specific configuration issues and a specific optimization side you can do through analyzing logs and metrics and it's all fantastic data based on data science but it's all about how to do this rather than what we've specifically done because our environment was very specific. So we set up a hackathon for an internal product with about, I would say, around 40 teams and 40 individuals, about six to eight teams. And it was a very simple setup. Part of the reason that we made it simple was so that we could understand what's going on and be able to relate what we did with what we found and therefore create insights out of that. A simple two compute hosts set up with a controller and then a bunch of VMs that are related to each other through the subnet, so the edges, the green circles that you see are the VMs, five of them are on their own subnet and then they are all connected to the external network and each sort of cluster represents a separate team in the hackathon so there was a product that was being run on the cluster, as I mentioned. So this is our environment, your environment is undoubtedly probably more complex. There's a lot of data that this environment generated over, I think, over four million logs and proportional amount of metrics. So there's a lot of data to analyze but it might differ in your environment. So with that, I think one thing that we grappled with as we were looking at all of this data that is coming out from this environment was, okay, we've got to analyze the CPU data, disk data, network data and so forth and there's the metrics kind of management system that we can download off an open source system such as Graphite or Grafana visualization tools. We can do that, we can visualize logs separately, right? We can use tools, there are many tools both commercial and open source that you can analyze but bringing it all together and analyzing them in conjunction because if you want to solve a problem you have to solve it holistically with all the data that you have about that problem. First of all, given a problem, how do you know what is the data that relates to it? And then once you have that data, how do you bring it together and actually get insights out of all of these different sources? That's the sort of problem that we were trying to grapple with. We can look at metrics in silos and logs in silos. How do we bring it all together? So we came up with our own system to do this kind of analysis of metrics and logs together. We call it Zeus and it has data science capabilities. It has visualization tools and statistical analysis tools. So we are bringing all of this together to solve the problem that I just described. And I won't be talking for much more. My colleagues who did a lot of the analysis are going to speak but they'll be using a couple of these terms very often in their discussion. They'll be talking about clustering and correlation and I just want to give you some description of what those terms mean. So if you imagine a sea of your logs coming from your entire cluster and you're putting it all together in one place. This stream of logs, if you try to look at it, it's huge and how can we sort of give you more course information that refers to the same thing. So essentially, how do we cluster similar logs together? How do we group logs that mean the same thing together? So we use text-based analysis, machine learning kind of tool for text-based clustering and we created these groups of logs that actually mean the same thing and that's what we mean. These groups are called clusters basically, right? And now we can analyze these clusters and then the next thing we did was we did correlation on them. So how did we do correlation? Well, this cluster, if you look at a cluster and say, okay, here are the log entries inside this cluster that have slightly different text but pretty much mean the same thing. When do they occur in time? And if I create a histogram, time histogram out of those logs, basically, how frequently do those log entries occur per minute? I get a metric stream out of it and so what we did was we created metric streams out of different streams of logs and then we use these streams to analyze sort of correlation meaning how do they vary, how do two different streams vary in time with each other? Are they related to each other in that they vary together? Or so we can do this with logs, we can do this with metrics as well. So with that, I'll hand off to my colleagues and again, we'll find a lot of interesting insights from all of this analysis but the key thing is the mechanism with which you do this analysis and not the specific results themselves, although you might find the specific results also very interesting. And all of this that we have done, all these goodies that you see here, we are starting to open source them as well. The couple of tools that you saw previously, this tool and this tool, we're starting to open source them, this tool is called A-Vos. This tool, we're starting to bring the machine learning aspects of this tool in an open source project that we recently launched called Cognitive. So we're making this a part of, creating this in the community. So we'll soon see these kinds of capabilities that you can start using on your own system. All right, so I'll hand it off to Sarvesh. Hello everyone, I'm Sarvesh and I'm going to talk about how I can make sense of the logs. If you are a cluster admin, you see a lot of tons of logs and you're like, oh, I need something which I can make sense of, I need something which gives me the real insight out of these logs. So we did some data science to find out, create the groups which represent, from those groups we find the logs which represent the whole group. So initially we did keyword-based matching. So in here what we did is all the logs which have same keywords, we created groups out of them and then we found something very weird. This was the first step and already we found something weird. We saw that the first two biggest logs the groups which are of the range 180K frequency were actually the warning logs being generated by OBS. When we dig deeper into it, we found out that when we were installing the cluster, we were supposed to put the first interface as management interface and second interface as a data interface but we did it other way around. So then only we realized that this is, this analysis helped us right there that this is wrong with the cluster. So we rectified it. Then the other bigger clusters are actually the HTTP calls which is, we are expecting it because all the services which talk to each other make the HTTP get calls, post calls, they send the request and get the responses. Now then we realized that instead of doing the keyword-based matching, there might be a tons of log which might be occurring because of different services at the same time which might be because of the same user activity and they might be correlated to each other. Then we correlated the logs. Then we correlated those logs which are not having same keywords but that might be the result of same service or same user activity. So in this correlation matrix, we see a lot of writes and these are the logs which are highly correlated to each other. So let's find out what are those. So when we analyze these logs, we found out that NOVA has a resource tracker which checks the resources in all the compute nodes at a certain period of time which looks for free RAMs, free CPUs, free disk. So these logs are actually supposed to be occurring at the same period of time at the same frequency even though they are in different machines. So they should be correlated to each other and by the analysis, we found out, yeah, they are correlated with very high value degree. Then next, let's talk about VM creation. When we are creating a VM, we create a security group which is actually the range of floating IPs which are allowed to access the VMs. So when a VM is created, it first checks with the security group if this IP is allowed to create the VM or not. So the first call which is made by Neutron is checking if this is allowed or not and then that call should correlate with the HTTP call which is HTTP requests which are there. So here we find out, even though these calls might be happening in different hosts, the controller might have the security group and the calls to check are being made by the host. So this justifies what I just said before that these calls might be happening at different machines. They might be different calls by different services, but they might be because of the same thing. Okay, then this is also correlated with the same logs where which is just a success message that the device has been configured. Okay, we were going through, then we found something very weird again. We saw that Glance which is supposed to create the logs only when we are creating the images or snapshots or creating the VMs was actually creating logs even after we were done with VM creation. Now if some cluster admin will look into it, he'll think that oh, something is wrong with the cluster, something is making these HTTP calls, something is seriously wrong. But when we dug deeper into it because of this analysis we found out that there is a background demon running, Glance demon running which is actually making these HTTP get calls through the API gateways and this result is because of that only. So we, using this analysis we can avoid all the bad conclusions we can make about the OpenStack cluster. Okay, a lot of people do log analysis, some people do metric analysis and when they are doing metric analysis, they, if they see some activity, some spike or some change, they might consider it as an anomaly. They might think that oh, something is going wrong with the cluster but you can never be sure unless you are looking at the logs at the events that are going on there. So for that you need to correlate the logs to matrix and how we correlated them is by converting the logs into matrix by creating a time series and taking their frequency in a period. So when we correlated them, we found these interesting correlation matrix and we, in the log metric correlation we found that there are some logs which are highly correlated to some matrix and same with matrix, excuse me, yeah. Okay, so correlation is actually a degree of match of movements. So how, how two matrix change together at the same time, how they might be positive when a matrix, a matrix A is increasing matrix, B is also increasing, that might be a positive correlation if matrix A is increasing and B is decreasing, that is a negative correlation but that is the comparison of degree of change of movements. Is that answer your question? Okay, okay, so as I was saying, we can correlate the logs in matrix. So if someone sees the first three matrix, he'll see that a lot of interface and network activity is going on. Something might be wrong with the cluster but if you correlate them with the logs, you'll actually see that this is because of a neutron, neutron activity, this log is generated when a network is created. So what actually is going on here is not an anomaly, it is just some network is being created. Similarly here, when a VM is launched or it is deleted, a keystone call is made to check the authentication of the user or the tenant. So that explains this log and how it is correlated with the disk users, how that is increased and how it is decreased and similarly with the CPU usage. Now, one interesting thing is that when we saw that in compute, first compute node, we were getting a lot of Libbert logs which were about image checking and they were actually happening in first compute node and it was getting negatively correlated with other CPU and disk and then we realized that this is actually, when controller issues a request, a lot of CPU and disk activity happens in the controller but as soon as all these activities or user action is transferred to the other compute nodes or other nodes, the CPU activity and disk activity also transfers to that. So that explains the negative correlation with these activities here. Okay, now as a cluster admin, he sees tons of metrics and he just has to think that which are the metrics I should monitor, which are the actual identity of our cluster because a lot of them are similar metrics which he should avoid here. I mean, he should just consider the ones which are actually showing him the cluster health. So here you can see in the first group, a lot of controller CPU activities are correlated. So what the admin can do is he can consider them, he can just, on his dashboard to monitor, he can just put any one of these highly correlated metrics and he'll see the same result which is supposed to be there and similarly with disk. Now, this helps the user to, the admin to reduce the amount of data he has to monitor. Okay, so in this section I talked about how admin has tons of logs, tons of metrics, how he can reduce the amount of these logs and metrics and then how he can actually correlate the logs with metrics. So now my colleague is going to talk about how we can correlate the error logs and find out what is wrong with the cluster. Thank you. Hello. Oh, so like Sarvesh talked right now about generalized analysis of logs and metrics and finding correlations. Let's get into something very specific, analysis on error logs. So we often find situations when we have to see error log report which contains millions of log errors and we set confused because the scale is too much. We can try analyzing those things with our Bayer eyes and try to find out things like patterns and repetitions but when it scales into millions it becomes very tough to do that. So we try to find out certain insights and we'll be presenting some of them right here that can help you analyze the logs in a better way, analyze the error logs particularly in a better way. So the first thing that we started with was to do the keyword based similarity and grouping. So we grouped the logs, error logs which are together. What we achieved out of it was to, we found out what are the logs which are most frequent. Frequency analysis may be a simple thing but when it, you know, when something scales up when we are seeing millions of log there might be chances that there are redundant logs which are just creating confusion out of it and we're not adding information but it's getting repeated periodic intervals and basically signifying the same thing not adding to our information. So when we did the frequency analysis on our system on our experimental setup we found out that two of the log, the error log that appeared was in way higher order of magnitude than the other error logs and we got curious that what happens? Why is this like 3,000 time and the next big thing is like 300? So we did, we looked into the frequency, the time distribution like you can see here. So we saw that these two logs are occurring almost daily and frequently on a periodic basis equal number of times daily. So we got curious what happened and then we dug deeper into the problem and find out that there was some issue with the Cinder EPL server setup and this thing was generating and adding to my millions of logs continuously and this is just adding confusion and does not add, when it happens continuously it doesn't add any information, external information, it's just adding logs and adding, you know, sometimes do annoying things. The next log message that we discovered was this unknown base file error that NOAA produces. So you might be knowing that NOAA has a setting in which it clears the image cache in a pre-configured time. So if you set that time to be very low it often missed, it often causes a cache miss and then it will give to this error which says unknown base file. So this was the error that we're facing and we just, this was happening quite a number of times and we just, so this was not causing any, you know, kind of operational issues in the cluster except it just increased the time because the cache miss is happening over there. So we just went there and increased the time of the cache clearance. So these were simple things but it helped us to avoid confusion. It helped us to remove redundancies and take care of our system in the better way. The next thing that we did was co-occurrence analysis. So it is often observed that when you're looking at so huge a log, we often try to analyze them one by one. So one log statement, the second log statement, then we move on to the third log statement. But what we often miss is it is not always necessary that different error messages reflect different things. There might be a bunch of logs that reflect the same issue, that reflect the same underlying problem like the example over here. I'm taking a very simple example to keep the discussion simple. So it often happens with us that, you know, we often take the flavor size of the instance to be less than the image size that we want to install on it. So there are a bunch of error messages that gets produced due to it. And these error messages are basically nothing but reflecting the same underlying problem in different ways. And it's just adding, it's not even any extra information, it's just adding logs, logs of lines. So we try to find out what happens together so that we can use that fact and reduce the issues that we need to take care of. Once we found out that, okay, what happens together, we try to find out what is the order of in which occurs. So we try to find out what is the order of propagation, if what started at what level, at what step, and then something went from step A to step B to step C, basically reflecting the same error, but in what order? We try to find out the order. So the usefulness of this is, once we know the order in which error occurs, we can just directly go to the root. We can just go to the step one and try to solve it at that place so that it can make our system easier, lighter. It might, you know, it might make our system lighter so that the error does not propagate. We can just end it over there and send a direct message to the user. So like here, like continuing with the last example, we found that the order of propagation of flavor disk issue, it started with compute manager and then we went to filter scheduler and then driver. So somebody can directly go to the starting point and then solve it right there rather than loading our system with so many processes. The other good thing about this, so when, as NOAA, as OpenStack grows and it is continuously growing and we are adding more services to it and as the interdependency become more complex, there might be errors which are propagating from sender to NOAA and the NOAA to neutron. When we are doing error propagation analysis, we can find out what is causing what so something happened in NOAA which affected sender. So that this interdependency of error might be helpful to us. After doing this, we went a level up. We tried to find out the context in which error occurred. We just, we expanded. The thing that we tried to do was we tried to relate error logs with normal logs, the info logs. So basically, it's the info logs that defines the context of what was happening when some error occurred. So we tried to find out that, okay, before any error, what are the normal logs that appears and if it is happening in a good frequency, we can plausibly say that, okay, this is something which gives rise to an error if the frequency is pretty high. So we did an analysis on that and the example that I'm taking here is, so if there is an error message that comes up when Oslo is unable to publish a message to one of the topics, if given this message, we won't be able to find out that what was the context in which this message occurred. There must be, there can be n number of reasons which can lead to this publishing failure. But if you look at the context in which this occurred, we found out that this was due to some network device issues that Oslo was not able to connect to the server and then it was not able to publish a message. So the context is equally important. So this was the analysis that, these were the analysis that we did on the errors and wrapping up the talk for you. So in conclusion, we tried to find out, we tried to explore that what are the ways in which we can implement data science and what are the ways, we can ways like correlation analysis, log-log analysis, metric analysis, co-occurrence analysis. How can we make a life of an open stack developer and operator easier? He doesn't have to worry about so many things. And we just wanted to pitch in the idea that using these techniques, we can definitely improve the operational efficiency of an open stack class tremendously. Thank you. If you have a question in the audience, it'd be great if he could pose it from the audience, Mike, because we are recording this session. Either that or the presenter could repeat the question. Thank you. Throughout the presentation, you pointed out information that you were able to conclude was noise. And I wonder was, does that imply that a bug should be filed back with the open stack community? So that noisy data can be ignored. And instead of the admin taking the results of your analysis and knowing what they can ignore, eliminate the noise in the first place and reclassify the error type from something like error back down to info. So I think that's a great question. In fact, we felt that even without turning debug logs, we have a lot of noise in our logs. And that's something that we would like to go to the developer community. And actually, there is an entire operators group working on how to improve logs, logging in open stack. And that's a valid point. In fact, we feel that this was our first step. We wanted to understand first before going to the community and saying what we should do. And we will take it up actually in one of the ops. There is a session on logs. Thank you. So what you've done looks really great. The problem I think most of us clouds have is that we can't turn debug on most of our services because they make so many logs and they kind of overwhelm us. But what I appreciate is your mapping of the logs with the events. So if we could somehow make that more public, if you could say, hey, these logs mean these events, that would help the community out a lot because some of us can't turn on debugging to the level that you did and correlate these logs. And that would help the operators group probably tremendously to be able to say, well, if I'm saying these logs, it's one of these events and kind of get that trace back which you've already done the work for. So yeah, again, that's a very valid point. In fact, I would strongly urge you to come to the developer, I mean the operator tracks and actually talk about it in the logging session and also in the ops monitoring session, which is on, I think it's on Wednesday. Is that something that you're willing to post though, publicly maybe somewhere for people? Sure, because the other thing is that we're trying to still go through bigger and bigger clusters and see if the patterns that we've seen in this one is repeatable. And obviously you have a certain sequence of tools and software package put together. So you'd have to say, if you're using these tools, this is what we see and you may not be able to test all correlations, but. I think it's a great suggestion. We'll work towards that. Thanks. Hi, very good presentation. Thank you. I'm curious about your infrastructure for analytics. Are you doing this as sort of a post-processing, take all the logs and then analyze them? Or are you doing this in a streaming fashion where you can keep them up to date with the current state of the cloud? So we do both. We do some of this alerting in real time, but some of these are more detailed analysis. We've done it not in the streaming, but in a batch mode and using the tools that we've developed internally, but based on open source tools, like log stash and all that stuff like that. Okay, thank you. My question is about OpenStack Infra. You see last summer, we discussed the idea that OpenStack Infra produces every CI run, every test they run on every patch set creates some certain logs of OpenStack deployment and a huge source of errors. And OpenStack Infra has infrastructure to publish these logs to the public storage, which is accessible for everyone. But at the time and probably right now, we don't have the analytics attached to this log. So I wonder if you have plans for open sourcing your things and maybe do you have plans to attach OpenStack Infra sources to your system and produce some results or how you see this collaboration maybe should. So that's an awesome suggestion. We actually hadn't thought of doing that. We knew that OpenStack logs, obviously we know about the logs in OpenStack Infra project, but it's a great suggestion. We should try our analysis on that stream. Yeah, so it will produce you a lot of examples and a lot of data. Two short questions. When you mentioned that you're going to open source a few things, can you provide some details of when and where and what? Yeah, so the visualization stuff is already open sourced. It's called AVOS, A-V-O-S. Just look at Cisco Systems GitHub and you'll see there. Over the last couple of years, we've also open sourced things like Curvature that's now going into Horizon. AVOS will eventually end up in Horizon because now we've just opened it up. And then we've also declared that we would like to do some of the analysis stuff in open source under a project called Cognitive. If you look at the wiki.openstack.org slash Cognitive, we'll slowly start populating the wiki page with APIs and other things. So just stay tuned. All right, and the second one is, can you provide some details about the time windows that you use for both correlation and the sequence analysis and how you got to them and? Yeah, 15 seconds. Okay, did you have to play with that because there's a balancer to be? Yeah, a little bit we had to play with that because we need to figure out what is the best time that we experimented with and 15 seconds was giving pretty good results for us. Thanks. Thank you for the presentation. One question I have is that do you in your analysis, do you only concentrate on the locks provider in the open stack or have you also tried to do correlation within the gas machines and especially the applications that's running on the gas machines? So I think this before this particular analysis, we only looked at the open stack logs and metrics, but we've used the tool for all sorts of logs and metrics analysis, not just open stack. The tools are quite generic, but these insights are very specific to just open stack. Is there any other question? All right, thank you very much for your time.