 My name is Brian Colline, I work at SoftLayer and the product innovation team where we primarily do research and development and mostly around new products but also around new technologies including OpenStack and we're going to talk today about metrics and logging and a little bit of dashboard type stuff but more so around how you can combine metrics and logging into one central store to make it easier for you to both correlate actual log events with your metrics and be able to report more accurately on what's happening in your infrastructure. So in systems monitoring, traditionally there's two domains, they each have their own advantages and disadvantages and they're kind of mutually exclusive for the most part and logging everything is pretty much unstructured, your value so to speak for a single hit is a line of text which could be syslog text, it could be app log text, it could be anything else and so as a result you never really know for sure how to parse it, there's even some variation in syslog as well sometimes you may not get the service name and so on and so forth I feel like this approach is kind of overutilized, you see a lot of people logging just basic metrics especially as far as periodic jobs are concerned so they really should be using less of that especially considering that humans spend too much of their time worrying about logging, how to parse it, how to archive it, how to search it and so it's just not productive. On the other side you've got metrics which are much more well structured, you usually have very strongly tied values with each type of metric and inherently or otherwise there's usually some kind of unit associated with each metric whether it's bits per second or megabits consumed so on and so forth so it's much more easily parsable, unfortunately not enough pieces of software use that kind of instrumentation on the level that they should and so as results we're very limited in a lot of cases in how we can parse that and intelligently use it in an automated fashion. The best example probably being that if we had much more comprehensive information we could conceivably write scripts and things that would act as frontline reactors to either warning events or error events and so forth before a human actually has to get involved. So one step to this and there are a whole slew of log consolidation frameworks out there. For simplicity's sake and just as an example we're going to stick with our syslog today. I'll mention a couple of the others later because they're quite versatile but in this case we've got an example application in Python that uses open stack style logging. It's logging to the local Zebra facility in syslog and we're throwing just a simple warning message saying we can't do some operation conceivably open a door and I guess the next processing line that receives this message is the local syslog on that machine where the app server is running. So what we want it to do to centralize these logs is at a very bare minimum is to choose either UDP or TCP based on what your requirements are. If you want to make sure that every single log message absolutely gets to its destination go a TCP. If you can afford to lose a few in aggregate go for UDP. It's a lot less overhead involved. And so you can kind of see the configuration here that would be involved. This is literally all you would need in the local syslog or our syslog file with the two at signs denoting TCP and the single at sign denoting UDP. So we're affording those to loghost on the standard port 415 or 514 and on the actual loghost we've got a slightly more involved configuration file. So we were explicitly loading since they're not by default loaded the UDP and TCP input modules and we're also telling it explicitly which ports to run on. We're passing just some basic local system all stuff to the actual syslog on the loghost and everything else on local zero which is where our app logger is logging to and forwarding from. We're actually forwarding straight into var log stack all.log. It really depends on the use case. You could really go either way. If you want absolutely no disk interference on your app servers then that might be the way to go where you have everything forwarded no matter what and no local copies. But in the case of something like SCRI where you're storing and then forwarding essentially and then deleting when you can you might choose to do that if you absolutely need that durability. So that covers logging. Just a very basic perspective anyway. As far as metrics go we have this wonderful project inside of Open Set called Cylometer that is essentially just metering. I know a lot of people use it for billing but it's a lot more versatile than that. So there are three main pieces to Cylometer in terms of the way it gathers data or receives data, processes it and then pushes it back out to any number of interested parties. The first is agents and collectors. Agents obviously running on your compute nodes, on your network node, wherever and emitting information in real time to the actual Cylometer server. Collectors usually run from the Cylometer server and they're going out periodically and hitting other services asking for usage information and anything else that might be relevant. The second part to this are transformers and the transformers are usually involved in if you have to do any kind of conversion of data or conversion of units or just any kind of adjustment to that data before you actually go and process it. And it's totally optional. You don't have to do anything with it at all and that's actually the default behavior. The next step, final step are publishers and they take, you can have any number of these that you want. You can write your own publisher plug-ins, the same way you could agents and collectors and transformers. They just take care of broadcasting all the process information to other systems. So you could have one that goes straight back into another MQP cluster or down to the file system if you really wanted to shoot yourself in the foot. Or you could have a go into something like Riemann which we're going to talk about today, which is more of a time series database. So here's an example of a, well an abbreviated example of a plug-in that I wrote for Riemann, a publisher plug-in, that as soon as everything flows through the agents and collectors and transformers in Cilometer, it would flow into all the publishers that we have in our pipeline. And we receive a number of samples and we're going to do what we're adhering to the published samples standard interface here. So we're just iterating through all the samples we've received. We've got some minor abbreviation here just for space reasons to convert that sample into a payload that Riemann is going to understand. And just as a preface, Riemann works off of protocol buffers. So we have a limited number of fields that we can send across. Some of them are completely free text, others are a little more stringent. But this event dictionary just kind of shows you all the different fields that you could use with Riemann. So after all that, we're going to send each of these messages as a payload through protocol buffers and plug plug-in. I omitted this just for brevity, but just similar to all the other publisher plug-ins in Cilometer, there's some default values here. Riemann default port is 5554 on UDP. And I went a little crazy with the default TTL making it upwards of a day. But that's pretty much all I would need in my Cilometer configuration for this plug-in. So the actual pipeline in Cilometer is actually quite easy to digest. And you can have multiple instances of these within the same file. So you're not really limited to just one. In this case, I've bumped the interval way down from 60 seconds to 2 seconds just for testing and just trying to throw some serious amount of data at Riemann. You can selectively specify which meters you want to include in these payloads. In this case, we just want everything. And we'll figure out later if we want to discard it or not. In this case we have no transformers, there's no need to. And for publishers, you can see that we've got a plug-in specific URI scheme just for Riemann. And there's a little bit of magic that I'm not quite sure I understand yet under the hood of how this gets translated into a class name, but it works. And then aside from that, we've also got just some plug-in specific transport options. In this case, I'm specifying explicitly that I want to use TCP since I can use either one with Riemann. So now that we have a plug-in, we have data flowing through. Now what? And this looks a lot more daunting than it is because we've already covered a third of it. We've already got all of our cilometer infrastructure in place, receiving data, optionally receiving errors if services choose to publish those errors through MQP. We've got all of our syslog servers publishing into that one main centralized host, which could be running our syslog, could be running log stash, fluent or flume, and there's a whole variety of others that are out there. It seems like this week I've been hearing a lot about log stash. It seems to be the most recent favorite for good reason. But they all kind of have the ability to plug in to multiple inputs and multiple outputs. So I mean, they're all pretty versatile. So in the next step here, I have them going just for demonstration sake into a queue in case we get backed up. We don't want log stash or our syslog or anything filling the pan of that. So once those get queued up, I've left a blank space here intentionally. We'll come back to that later. That would conceivably go straight into a Riemann server through protocol buffers. And right away it can make a determination based on the filter and the configuration file, how much of that it needs to filter out, how much of that it needs to actually process and act upon, whether those actions are alerts or firing off outside processes to try and preemptively fix a problem before a human has to. But one of the advantages is that you can actually create an entire topology of Riemann servers. So if you have an extremely busy front line Riemann server or a whole array of them behind a load balancer or a DNS round robin, you can still have each of those as long as they have the same configuration forward to other Riemann servers behind them based on a service name or a host name that these log messages are coming from. So for instance, let's follow the path from the master up to Riemann A here. In this case we might have received a pretty critical error, let's say Nova Compute. We obviously want to notify human about this so we send out an alert and they kind of handle that from there. The other route we could take is if it's a service that ends up being filtered down to Riemann B or C, let's just say for the sake of discussion that all of your sender log messages and stats are being filtered down to Riemann C. Sender, if you have a fairly large sender infrastructure, one node is not going to matter that much if it falls over. So you might feel a little bit better about kind of automating either the restart of it or trying to pull information about what's happening to it or just take it out of some kind of load balancer until you can figure it out yourself. So in this case we are firing up a script that may do any of these things. It could be Python, it could be Bash, it could be whatever you want. And once it's done, hopefully it finishes successfully. And if not, it's very important that the Riemann master knows about that because you don't want to get into a situation where you end up in an infinite loop of the same events triggering the same action over and over again and not being able to get anywhere. And this is all data that you can filter on either by tag or description or host or service, just about anything you can imagine and we'll have a list here in a second. But in the other case here, if it fails and there's pretty much no way to recover, we can also shoot back up to where the megaphone is and actually alert a human. So Riemann, just as a little bit of history, was named after a mathematician who, by all accounts, made some pretty significant contributions to the field of mathematics. One area in particular was in analytics and analysis and statistics to some degree. And by some accounts, it's credited as having contributed enough to be able to mathematically describe the theory of relativity. So obviously a pretty important name. Now the project itself was written a couple years ago and by a guy called Kyle Keingskeberry. And you can find his contact information here. He's usually happy to talk about anything related to this. But the project is only a couple years old, but they've got about 36 contributors now. And this tends to scare a lot of people off right away, but it's written in Clojure, which is basically a list on top of the JVM or a list variant. But it's actually quite expressive. It's a very concise language and we'll run through a couple of configuration file examples where you'll see, A, how the configuration file is actually a piece of code, but also how simple it is to do some of these really advanced use cases of alerting and statistics. And just as a, I guess, a good overall statement of what Riemann is as a product is just a low latency transient shared state for systems with many moving parts. I usually describe it as an in-memory time series database to some degree that's just kind of a moving window in time. So it's not meant to do any kind of backing storage. It has the ability to pass it off on other systems. For instance, you can have it pass all of this data off into graphite. I believe there's one more backend driver, but I can't remember what it is at the moment. But graphite's pretty popular. Everybody seems to like it. So you can still get that functionality out of it. And you can also use Riemann's API to pull data out of the graphite's whisper files as well. So it works both ways. So what does it actually do when it receives a piece of data? First step is based on the configuration file that you give it any kind of event that comes in. And again, all these events are going to be tracked individually in memory in this moving window. It's checked against all the filters that you've defined, any kind of thresholds that you define for alerts, any kind of other calculations or whatever else you want to do. The second thing it's going to do is actually, if any of those pass, it's going to perform actions that you've defined on each of those cases. The third part is if you have configured it to, it'll send all of this event data to a backend like graphite. I think they have plugins now for sending data straight into Librado, possibly New Relic, I'm not quite sure on that one. So they're starting to add more and more support for not just self-hosted backend services, but also third-party hosted services that do a lot additional analysis for you. And the final step is since it has a moving window in time, it's got to eventually expire any events in the index that are no longer within that time window. And you can force something to stay in the index by explicitly specifying a TTL. So if it's something that you deem as a significant event, you can actually write it a small piece of code within the config file or even before it gets to reman that defines that TTL. So if you wanted a week or a month, you could still retain that regardless of expiration in the index. So these are all of the different fields that you can use in your protocol buffers payload. You've got just a simple integer-based Unix Epoch timestamp. You've got the maximum TTL in seconds, and that's more of a countdown, it's not a timestamp. You've got the originating hostname and originating service, which really could be anything. Estate, which is just free-form text that you can define. So it's either warning, critical, okay, whatever you want to pick. Then the metric itself is the value that's associated with the event. It's the actual metric of interest here. So you would traditionally think of the service and the metric being the key value pair that you would use in something like StatsD or a whole slew of other systems. And tags are optional. You can attach actual tags to an event. So if you wanted to tag events with either dev or prod or QA, you could certainly do that. That way you don't have to send out alerts about your dev environment. And then the description field is also free-form. You can also use it as a form. This is where log messages would come in to the same central store as your metrics. And that also includes tag traces. It's not limited to one line. It's just as much as you want to put in there. So here are some example config use cases. Some of these are pulled and adopted from the official Riemen docs at Riemen.io. There's a whole bunch of them in there. You could spend a day going through all of them. The first one, if you want to have somebody be alerted, but you only want to have that person be alerted five times within an hour about the same problem on the same host and the same service, you can use this roll-up feature or this roll-up helper that defines how many times they should receive that alert in X number of seconds. And then of course the action following it if that passes. And then for, let's see, any kind of error state whatsoever, you could have it send an email to apparently your beeper. And the last example here you could actually calculate for each five second interval, similar to what stats D does except a little bit quicker is the 50th, 95th, and 99th percentile of hits in this case to an API, which would give you probably much better insight than just full on aggregate. Just a few more examples here. You can also keep track of the rate of total exceptions per second across all of your apps by using this first example here for anything that comes in tagged as an exception and that actually has a metric. You can separate it out by service so that on the fly, Riemen can actually generate its own additional statistic or its own additional event that's prefixed with the service name and then just has exception rate at the end of this and then would actually be able to graph that metric if you desired. And the last one is probably one of the more useful ones I found. So you have a service that just quietly goes offline in the middle of the night. There's not a whole lot of ways you would normally know about that unless you had something that was constantly checking, whether it be a crime job or a long running job, whether that process was alive. So what Riemen can do is if you have it set up to do periodic check-ins, in this case every 10 seconds, which I think is about as aggressive as Neutron gets, you can actually say, you know, if I haven't heard from this service in 10 seconds, then email somebody about this and you can group that by service. I host in service so that whoever you're alerting won't get alerted about the same hope at host in service, but if I know they're host with the same service fails, then they're going to get an email as well. So they can actually start getting a meaningful picture before they've even gotten back into the office or to a computer to actually investigate. And the reason it says where a state expired here is because anytime an event expires from the index, it's going to regenerate an identical event for the same host in service with a new time stamp except the state's going to be expired. So it's a really kind of way of checking whether or not a service is actually alive. So what's next? I think, you know, the research that I've put into this so far, I really think would benefit a lot of people if there were a standard set of rules that could plug into a reman configuration file. It is a bit daunting when you first jump into reman because it's all closed or even they can fig. So I think having that would help, but you would have to include things like which services you want to monitor, what log file patterns you want to look for, what kind of metrics you want to absolutely critically keep track of. In this way you can actually spend way less time digging around log files and more time optimizing and finding some of the more subtle issues that happen much closer to the time that they actually start happening. And then also a standard way of executing preemptive actions before involving a human. I'm not terribly well versed in Clojure. There's another guy on our team who is, but if it was in Python it would be easy. It would just be a P open call to run a process. But in this case I'm sure there's a standard library extension with Clojure that could do this. And I guess the other half of this as well, and I researched this quite a bit. I almost put the cart before the horse before I had the rest of the presentation done was on the machine learning side. This is ideally, especially as you get thousands and tens of thousands and hundreds of thousands of compute nodes or any kind of nodes. They're all going to be generating log data. They're all going to be generating metrics. You're going to eventually reach some critical mass where there's no way you can cost effectively hire humans to manage this anymore. You're going to have to eventually start training a system to look for these in a proactive fashion. And there's already, you know, there's 10, 20 year old technology author that already does this. It's just not widely to use yet except in things like span filtering. And so in the course of the research that I found, there was one that stuck out quite a bit called... I believe it was the SCM 114 discriminator. It's a reference to Dr. Strangelove. It hasn't been very well maintained, but it actually does a very good job. It supports a number of different algorithms. One of them is just a simple Bayesian. There's another one that does a hidden Markov model. There's another one that does OSP. And what you should be doing is capturing as much meaningful error information as you can, bundling it together and telling a system like that, hey, this is all bad or significant data. So learn this as being something that should flag a human for. And that way over time you can get better and better confidence in its answers about classifying log messages as they come through in real time. So you're essentially training the computers to watch all of your systems for you, which is going to be much better at doing. It may take a while to get there, but the payoff in the end I think is more than worth it. And I guess also combined with the actual metrics themselves, since you would have data about the log patterns themselves and the stats in the same time series store, and again your choice in back end is really up to you and based on your needs. But it becomes just a lot easier to identify patterns over time, whether you do that in real time or if you do that after the fact. So that if you have a whole group of machines that you bought around the same time or a whole group of hard drives that you bought around the same time and they all came in on the same pallet and you start to see degradation of the machines that are actually running those, you know it's obvious from hindsight what the problem was. It was probably a shipping issue or a factory issue, but and so you can have that kind of visibility. It's not always going to be that obvious. And seeing the patterns are really going to be key to proactively managing these ever growing clusters, whether they be compute, whether they be network, storage, anything. And one other thing I wanted to show you is well, there was a video a while back that the author of Riemann posted showing the actual throughput. Riemann supports WebSockets as well as a native protocol that can do TCP, UDP and WebSockets because it's got support for a web interface. His next generation dashboard, which I don't think has been released yet, supports something like up to a couple thousand, well he says a thousand events per second and this was over a year ago. I'm sure it's progressed since then. But he's basically sitting here watching, looks like about 10 or 11 systems and the load average on them. And the just the amount of speed that you get not just on the Riemann side, but also on the WebSockets side is actually quite amazing. It's not something you would typically see in a normal dashboard. And the console that comes with Riemann itself is actually quite extensible. It's a little difficult to use at first because it's only driven by keystroke, but I believe that's on top of the list to put them in to improve. But I would definitely encourage you to check out the project and see how it might help out your unmonitoring solutions. I also plan and I'll enlarge this as well. I worked with another member in the community to more liberally license an existing Python wrapper around the CRM14 learner and classifier. And so I've published that and I've also published kind of a basic syslog listener that forwards events into Riemann that I've got on my GitHub right away. But yeah, I certainly plan to continue looking at this a bit further so if there's enough interest and there's interest in participating, I would love to hear from you. And with that, are there any questions? I'd probably say load balancing would be best. They can't handle up to allegedly 14,000 events per second per core on a machine. You could probably use HAProxy. If you did that, I'd probably use it in TCP mode just so you would ensure, it sounds like if you wanted to load balance it, you would want to ensure that all your messages get there no matter what. So I'd recommend something like HAProxy just for your lightweight, very fast. Yes. That was just more of a kind of an afterthought if you, depending on which of those solutions you use, you could end up with a fairly taxed centralized R-Syslog server that just would allow you to just offload as much as possible to something like Rabbit where it can just store and just move on. Right. Oh, cool. There is an API on the Riemann side that will let you get historical data. As far as streaming it, you'd probably have to just play it back. Yeah. There's a lot of different ways you could go. At one of our company hack nights a while back, just to see if I could do it, I wrote a stats D to Riemann bridge. Of course it was in Python, so it's never going to end up taxing Riemann by any stretch of the imagination being on the JVM, but it actually worked. I mean, it was fairly limited just because you don't have the luxury and stats D of all those extra fields, but it's something that you could easily customize and say, okay, well, here's a host name or here's the system host name. I'm going to package this in from or alongside this stat over to Riemann. Still like that, but at the base of it, you would still have just the name of the metric and the extra metric itself. All right. Thank you very much.