 Okay. Hello, everybody. My name is John Stanford. I'm with Selenia. We're a product and services company in the OpenStack space. And today we're going to go through some high-level and low-level details of gathering and managing logs and kind of some mining of log data. So before I get started, I'll just give you a little background on logging. It's a very long history of logging. I grew up in Wisconsin. When I was a kid, we'd see stuff like this all the time. There's an equivalent to this in the computing domain, the first bug ever captured in a system log of some activity. But like many things, logging has evolved. This is actually more like what I saw when I was a kid in Wisconsin. And this is more like what we see today in systems and logging. So there's a good range of technologies available to you to both aggregate and process logs. And that's a giant step forward from the days of jumping around from system to system, trying to find what caused an event and doing the mental correlation of events to get to the bottom of problems that were occurring in your environment. So we all know that complexity is king in the cloud. There's a lot of moving pieces and up and down the stack and sideways and up and down the hardware stack. We've got all kinds of different things generating metrics and events and a lot of stuff to keep track of for an operation staff running a production scale cloud. So this is the basic end to end flow that I'm going to talk about today. And so the TLDR is if you don't want to see a bunch of detail and code, this is what I'm going to talk about. So shipping events via RSS log to LogStash and pushing them into Elasticsearch. The high level is we need to do a few things in the various configuration files associated with OpenStack. We need to do a little bit of modification in the RSS log. And from there we get into LogStash flow of events through their inputs, filters and outputs. And then we'll talk a little bit about querying that data in Elasticsearch. Before we get too far into this, you know, let's talk a little bit about how logs are structured and the standards that might apply. In most of the OpenStack projects, there's a close affinity to the RFC 5424 you believe the comments in the configuration file by Juno. The current format will be sort of deprecated and they'll be going towards a purely RFC 5424 standard. And we'll talk about what that is in a second. An alternate that I threw up there just for fun is kind of the NCSA Apache Common Log format. That's something you'll tend to see in Apache logs and other web server logs. It keeps track of, you know, URL calls or calls into the front end of the web server. As I said, project adoption varies today, but it seems like things are heading in the right direction. There's a few variants that I'll touch on as we go through here. But nothing that can't really be worked around if you're using this sort of chain of processing where you can, you know, patternize these events and recognize the differences. However, I mean, when you have these patterns and a deviation from a standard, it makes it really hard to know for sure that you're capturing all the relevant data. There's always a possibility that something has slipped through the cracks, didn't get caught by your filters, and didn't get put into the right bucket of information. So, you know, until some event occurs where you're actually looking through logs again to find, you know, find some specific cause of an event, you may not notice that something's slipping through the cracks. So, you know, we try very hard to find all those things and patternize them, but I think there's a real good play in there for some anomaly detection in logs and how you're processing and all that good stuff, which opens up a whole new can of worms, which I hope to focus a lot on this year as we go forward. And then finally, you know, time stamps have been the bane of my existence for the last six months, at least, and probably the last, you know, 25 years of my career. The... Okay, so let's get into a bit about what an OpenStack message looked like today. Most of the projects use a format that looks kind of like this. You have a date stamp. You'll notice that the date stamp doesn't have a time zone. That's not great. You have a process ID that generated. You have a log level. The log levels, you know, have a fairly standard implementation, although there's some audit levels in there that aren't part of the syslog spec. I think that's okay. I mean, we know what they're for, and if you process them correctly and they're used consistently, you know, everything should be fine. The program that generated the log, that's really helpful. And in this case, an empty ID. Sometimes you'll see instance IDs in Nova. You'll see request IDs in other projects. And then the message itself, which is sometimes the most important part. So in this case, I truncated it, but it was, you know, we connected a MQP Yahoo. I can send messages from a compute node to a controller. So some things we want to do to be able to centralize the processing of these log messages is to go into the Nova config or, you know, sorry, Neutron config or any config relative to the OpenStack projects that support this flag and turn on syslog, use syslog equal to true. Set it to true. What happens then is they'll defer logging to syslog, and if you don't do anything else, you'll end up with a log on your local system that follows the rules that are set up in your rsyslog.conf most likely. In fact, everything I'll talk about is based on rsyslog on CentOS 6.5 here. And the message you'll get is roughly this gray part down in the bottom. If you turn on shipping to a remote system, you'll end up with these three bullets up above that are generated by rsyslog that tell you that the Bracket 132 is really the log level and facility that it was shipped with. The date, you'll notice, is even worse than the one generated by OpenStack and the host name that generated the log event. So now we're going to move into some other things that might help us that I've been exploring. I was really excited to see in Icehouse this flag, use syslog RFC format, equals true. And the thing that really caught my eye was that it's going to add the app name before the message. I had no luck getting that to work, but it could be me, it could be something in my environment. I'm going to continue to try it and I'll chase it back into the code base if I can't figure out something locally that's causing it not to work. The goal is nice because as we'll see in a coming up slide, the app name would be really helpful for us to categorize these things so that I can tell not just by the host name or the program name where things are coming from but by actually an app name that's included in the message itself. So what I found was that when I tried to turn this on a Nova compute node, the Nova compute service wouldn't start as long as use syslog was true as well. And when I turned use syslog equal to false, then the message that went into the log file looked exactly like the message that went into the log file with this flag set to false. So again, it could be me, but I'll keep experimenting with that and we'll see if we can chase it down. So now we'll get into a little bit about how we set up our syslog to ship things. The date that we had in the previous message was very minimalist and if you go dig into the documentation on our syslog, you'll find that there are a couple templates that come with our syslog that are really helpful to make that a little more robust. So the action file default template and action forward default template control the format of the messages that go into files when you're putting them locally or that get baked into the message that gets shipped over the wire if you're sending them over a TCP or UDP connection or forwarding them to another server. And what we get out of that is a really high resolution date stamp that's in a very commonly understood format. And that'll make things a lot easier when we get down the chain so we don't have to really patternize this date stamp as year, month, day, hour, minute, second, dot, dot, dot. We should be able to just say ISO format, date, and run with it there. And we get the host name as we did before. So you'll notice that there's a bit of redundancy so I theorize that this is due to the origination of the way OpenStack logging has come to be. And I think that over time, if OpenStack decides to move towards using our syslog as the main logging mechanism and I'm not sure if that's going to happen or not we could reduce some of that redundancy by recognizing that our syslog is going to inject some fields into the message that we don't have to capture as extra work. Reduce the code base, simplify everything, make everybody happy. So that's setting up the... These are global settings in your rsslog.conf file. Rsslog typically allows you to specify a sub-configuration file. In fact, it reads everything in conf.d directory by default. So rsslog.dconf.d And so what we've done in our project is to use that to define shipping. And if you recall back to a few slides where we talked about getting the component or the app name in here, this is where I'm doing it. And so I'll get into why this isn't such a great thing, but it works for now. But let's first walk through what's going on here. We're saying that we want to forward logs over TCP protocol on port 5514 for everything to my server in my lab. And the local zero through local six are... I'm mapping upstream to a particular service in OpenStack. There's some benefits in here that are fairly obvious if you know what you're looking for. One is if my upstream server is down, this will queue up to a gigabyte of logs locally, and then when it detects that the server's back up, it will start shipping those logs. I'm not really clear on what happens to the timestamp in that case, so I'm going to make sure that I take care of it upstream and I'll take it from the message itself, which should have the timestamp from when it's generated. It might be that it's okay and gets rewritten here. Just haven't had time to explore that. So anyway, this configuration will get you shipping logs to a central system where you can really start to process them with LogStash. And with LogStash, there's a pattern of input processors, filters, and output processors. Can we get a sense of how many people have played around with LogStash or anything? Oh, good, quite a few folks. So this shouldn't be too strange to a lot of you. For those of you that this is new to, I'll do my best to walk through it. We can certainly talk afterwards and go through some things in more detail for clarification if you want to. The input's very simple. This maps to taking in messages on TCP port 5514, and I'm setting the type of events coming in on that stream to syslog so that later when I filter them, I know what kind of message it is, and I can process it accordingly. For a sake of seeing the end goal, I'm showing you that the output, before I get to the filters, the output is pushing to Elasticsearch. In this case, it's running locally on port 9200. And because of the more recent versions of Elasticsearch and LogStash trying to keep in sync, they recommend that you use the HTTP protocol rather than going natively with the Elasticsearch plugin. So that will get you whatever the result of your filtration is into Elasticsearch, which then makes it really nice, and we'll see that a little later to process that into something meaningful and useful. So the next big piece of LogStash is the patterns. I'm not really big on patterns. I wish that technology was advanced enough that I could just say, hey, here's a big test domain of log data. You tell me what the patterns are, but we're not quite there yet. If anybody's really interested in that project, let me know and we'll maybe kick something off. In any case, what we've got here is easier to read from the bottom up. So what I want to do is characterize an OpenStack syslog line, and I do that by a bunch of parameterized fields that map to different patterns in the text. There are trace messages as well as regular messages, and the trace messages tend to span multiple lines, and so there's a lot of work to get all these messages characterized. But the net effect is that if I match this pattern, I'm pretty confident that this message came from an OpenStack source. So we'll get into the filters now. So this is the internals of one of my filters that processes an OpenStack message into a way that I want to store it in Elasticsearch. So first thing I'm doing is checking to make sure that the type of this message came from the syslog source. It could have come from another input I have in the domain and would just immediately fail out and maybe not get processed if I didn't handle that accordingly. Or that we're really dealing with a source that I think has a good chance of matching this pattern. And so I'm also referencing the patterns directory where I've stored my patterns in a file, and I'm telling it that I want to match an OpenStack syslog line, which we just saw on the previous slide. And when I do that, I'm going to set my timestamp, a received at timestamp, for when I got this thing. And I'm also going to add a field, keeping the whole message intact. I use that a lot for debugging. I'm not sure if it has value beyond that, but the number of times I've had to go back and go, is this really doing what I think it is? Oh, I wish I had that message field that had everything before it got processed. I tend to leave it in there as long as space isn't a huge constraint. So once I get that, and if I don't end up with a parse failure in my list of tags, then I'll move on to continue filtering. If I end up with a parse failure, I'll do some cleanup. I may pass it on to something else that takes another shot at it for a different type of syslog message. It may just be something coming from another application or a piece of hardware or some other place where I want to process it a little differently. So I can clean out the grok parse failure tag and submit it back through the filter chain or down to a further filter. But in this case, let's assume that we got a good parse through here, and we now have a syslog message with those things. So let's include that priority, which is the little bracketed 132 we saw in the previous message. There are some capabilities built into log stash to do that natively. You just kind of tell it what the labels are that match to the different log levels, and then it'll go back and tell you if it creates a couple fields that tell you what the log level and the facility are. So out of that, I get a nice human readable version of that 132 meant. The next piece that I'm going to do is to process the date, and I'm going to take that from my messages timestamp because I really want to know when the event occurred, not necessarily when it arrived at the our syslog log stash front door. And here I've given the old way of doing that before I started playing with the new templates in syslog. In this case I'm parsing out what I know the format to be for an open stack log message and I'm removing my timestamp field because this puts it into a log stash generated field. And the side effect, the negative part of all this is that I have to be pretty confident that my whole lab environment is in UTC time zone. And it is most of the time, but somebody builds a new system and forgets and then you sort of have a skew of data that's going to make it really hard to correlate events. So getting that time zone into the timestamp is obviously critical. So moving on, the next thing we want to do is handle that syslog facility that was generated by the syslog PRI. And in this case what I'm going to do is map that to the particular components that I'm capturing in my lab from open stack based on what facility generated that log message. And so if you have looked at the syslog specification you're going to find that there are only seven, eight local log levels or local facilities. And so obviously this doesn't scale which is why I was really excited about that updated format and a little disappointed when I couldn't get it to work right off the bat. And it's also very important to me to get working because I have other projects that I want to incorporate into this pattern. So for now it's okay. Later I'll have to come up with something else. The net effect of this is if it doesn't match one of those syslog facilities I really don't know what it is with respect to open stack. I'm going to tag it as an unknown component. Otherwise I'm going to tag the component as the one that it matched. And so the next thing I do is just sort of clean up and mutate a bunch of the fields. I want to rename message to message and I want to rename syslog host to host because it's a little more parsable and readable for people who may be looking through the elastic search or doing queries, writing queries, getting rid of some fields that were just transient in my world assuming we got to this point in the filter and adding some tags to kind of give myself a sense of how this flowed through my filter chain. So at the end of this I'll know that it completed processing it was an open stack syslog and it matched filter 34. It may go on to do other things. I may pass it to another filter still which I'll go through in a second but the net is this is what I get in elastic search. I get a message that I'm really happy with and it's a broad category of messages but it's distilled from a sea of messages that could have contained everything in the whole set of messages generated including open stack. So a step in the right direction. So the next thing we'll do is take this and go okay now we know we have an open stack message can I make some sense out of it? There are some things that come through messages that kind of fit the profile of a metric and in this case it's an API call against Nova and what I might want to do then is start building some data to build a chart or a view of what the performance of the API is over time. So this pattern parses out a message and it assumes that you've already you're known as an open stack syslog message and then we can parse this a little further down into whether it was a get put post delete you know what its URI was dot dot dot effectively matching the event that gets logged in the API log for Nova. And our filtration assuming we've passed the original filter 34 is really straightforward right all I want to do is clean up a bunch of stuff that's not relevant to building this chart and I mean I could optionally leave it there I'm going to tag it as a metric because sometimes I want to search through elastic search for just metrics and I'm going to tell it which filter it passed and now I've got something very specific to work with right I've got a syslog message that we know is a metric that we know is related to an API called a Nova it came from an IP address it has a URI and I could work with that. So what do I do next right next I want to start querying for these things and making some sense out of you know out of them so in this case we've got an elastic search query that gets you back the open stack logs that were put in via our first pass through filter 34 so we're saying that a query must match a time range it must be a type of open stack log and we're going to also aggregate these events into a date histogram so that I can build a nice chart of how many occurred out at each time slice so in this case the time slice is 5,448 seconds we'll see why it's that in a second but the net is and I want to generate an event for a slice that had no events in it which is the min doc count equals zero and I call that event by log level I'm going to slice it that way so that I can filter on those things in a little bit the Nova API stats query is a little more complex what I'm doing here is matching everything I'm again looking for a time range and I'm looking for component Nova actually can submit the query against the specify the document type on the query call so I don't have it in here I didn't need it in the other one if I built the construct constructed the search properly in code properly differently in code in this case the assumption is that you're querying only the API stats again we're going to aggregate here we're doing it based on date first so I'm going to first get all the events that occurred in a time slice and then I'm going to do a sub aggregation on each of those things and break them down into whether calls were successful 200 to 299, 300 to 399 failed at 400 to 499 or 500 so that I can get a good sense of the failure rate of my APIs and finally I'm going to gather some extended statistics on these things so I can do min, max, average and variance and all this kind of stuff that you might want to incorporate into your chart so that brings me to a topic that has been near and dear to my heart for the time since I joined Selenia we introduced this week a product called Goldstone I've put the definition of Goldstone or the origin of the term in here because everybody asks what it is and the purpose of Goldstone is to bring some of this data to light let you see these things and build up an operator's perspective of an open stack cloud so you can download that from our website register this week it should be available next week and start playing around with it it's a free download at this point and some of the things you'll see are the result of all the stuff we just walked through so this is the log filtration query you can see the different log levels represented in there I didn't show you the part of the query because I was trying to conserve space that allows me to do the filtering on the log level but fundamentally if you click on one of those little boxes up in the legend it'll filter out that type of log event it will also rerun a query back on the Elasticsearch to refilter the table and so you can get a good view of things and since that query was time-ranged if you click on one of those data points in the chart you'll range into that data point with a little margin on each side and go closer into the data to navigate closer to a potential problem in the environment like that big red spike that represents a bunch of errors that occurred one day the other type of thing that we're doing is the metric report so the API report that we were doing the API query that we were building is the top left chart and so that's the NOVA API response time in my lab because it's really spiky because the tasks that generate some of the data are on my laptop and sometimes I take the lab down for various reasons so anyway it's not always available it's not a production facility so there are some other metrics that NOVA presents in logs that are really nice to pull out the spawn rate over here so I can see how often VMs are spawning in my environment and how successful that is and resource consumption across the different CPU memory and storage resources those things are generated out of logs when you spawn a new VM so you can scrape that out and build an assessment over the time slice and average it out or max it out depending on how you want to see things so that's a little bit about what Project Goldstone is doing and again you can download that from www.cellinia.com in the very short future so that's it for me my name is John Stanford I'm the VP of Development at Cellinia I'm happy to talk with folks but before you go I'd like to do something for my wife who's been patiently waiting for me to come home could you all wave and say hi Yaila thank you very much and if there are any questions I can take a few questions now I believe see how we're doing on time yeah okay, thank you thanks for your presentation is really good for your configuration files on your LogStacks do you do all your inputs save like in one configuration file for the LogStash configuration no actually what I have is a series of files in the LogStash configuration folder and they process, you know they set up a lot like etsyinit.d so I have a 01 input filter and a 34, sorry input and a 34 filter and 37 filter and on and on and on down to outputs in the 60s and 70s that kind of thing how is the Cellinia the tool which you showed the logging, displaying tool different from Kibana so from my perspective Kibana will give you a great deal of functionality in a limited set of visualizations so some of the things that I want to visualize and provide in a product go beyond that visualization capability and so rather than trying to do some things in Kibana and some things over here I just chose to put them all into our product and write our own visualization library so that I can extend to do what I want to do Thank you You didn't think of extending Kibana for that instead of writing a product? I did not think of extending Kibana for that I actually thought about it and I thought if Elasticsearch wants to extend Kibana for that they should As you increase the amount of filters in your LogStash configuration did you notice any severe degradation and message throughput? Can you speak up? I'm sorry I'm having a hard time with the number of filters in your LogStash configuration Did you notice any severe degradation? If I heard you correctly are you asking about the performance impact of... Performance impact of... More LogStash filters More filters within LogStash I haven't seen it yet I suspect that I'll reach some point where the filters get complicated enough that I'll want to find maybe a client side solution to processing out some of these filters distribute the load a little bit and have it centralized or expand my layer of LogStash processors to account for the load What numbers did you see when your transactions per second how many nodes in your cluster? Most of this stuff has been running in my lab and I've been seeing somewhere around 10,000 messages a second and that's easily enough for my lab purposes Can you share some of the reasoning behind choosing LogSysLog type forwarding instead of LogStash forward or Lumberjack? Yeah, well one and probably most importantly is I wanted to minimize the footprint on the client side and the facility is already there and it does buffering and caching for that stuff it's a well tried and tested tool and again most importantly I didn't have to put another piece of software on the client side Sorry, I was bit late you mentioned that we can try it you have already released the code we can try it if you want to Can you share the slide or the link where you get it? Sorry, can we get this microphone turned up a little bit I'm just having a difficult time hearing out of it So I can speak loudly So I was bit late, sorry Did you mention that we can try it? Yes, so Can you share the link where we can get? Yes Let me quick see if I can do this Somewhere under there will be a Goldstone I don't know the exact URL path is it slash Goldstone? Right, registration today we'll notify you when it's available we just need to go back home and package it after the conference and make it available for folks What we plan to do is ship it as a VM so you can just drop it in your management tier minimal configuration set up some credentials for API access and set up your clients to ship logs back and that should be all you really need to do Is there any plan of alerting mechanism to be incorporated in this one? For alerting? Yes That's a longer term plan for Goldstone so this is a 1.0 release for us and we really wanted to get just the basic data capture and some reporting in place but the longer term will be to do some analysis and handling of known types of events and things like that So, future release Couple of questions The first one is if your OpenStack is a multi-node cluster does your tool aggregate all of them and provide one dashboard or is it per node basis? One more time, sorry So if you have a multi-node cluster multi-node cluster does your tool aggregate all the logs into one and provide one dashboard? Yeah, that's the goal So you set up every node in your OpenStack cloud so the controller, the compute nodes, the volume nodes and all that stuff and they all get shipped back to the central location The VM that we're shipping with 1.0 it's a VM, right? It's not a scaled-out solution for a giant production cloud but the technology itself the technology that we've chosen should scale for that if built, built a little differently which is something we'll be working on in the future So another question is does it help you with the root cause analysis if something goes wrong if you have like 10 node and you try to create one VM and you don't know where the scheduler scheduled that one and it errors out and instead of going through 10 node logs to find out where exactly that VM was spawned and what the reason is So we don't have any automated navigation to a root cause right now but if you go to the log chart you can quickly see a spike of events and drill in there and that's going to bring those events into the table as you drill further and further down to sub-second if you need to so then you can quickly see what hosts are participating and right now it's really just the OpenStack layer so it's not going to tell you a lot about networks and things like that Can I ask one more question? So what is the licensing for the tool the one that you have? Licensing is proprietary for now but it's free to use so that's the way we've chosen to do 1.0 2.0 will re-evaluate and see if it's something we want to open source Thank you The previous session I was in I was talking about a tool called Elastic Recheck which is part of OpenStack I was wondering, do you see there seems to be a lot of overlap about them? I don't Are they focusing on OpenStack specifically? OpenStack build and finding random occurrences that are very statistically low occurring but occur often enough Interesting, I'm not familiar with it but I would like to know more about it in a second we can just write down the URL Okay, I'm going to wrap it up to let the next person get the stage if there is such a thing I didn't look at the schedule but thank you all so much for your time Enjoy the rest of your summit