 My talk today, my name is Brian Klein. I work at Softflare and today we're gonna take a look at a couple of ways you can tie together a couple different pieces of software that work together very well to try and mitigate at least alert to some intelligent degree about things that are going wrong in your cluster or any kind of distributed system so first if you if you didn't see it in Hong Kong or you haven't seen the video there was a another presentation I gave there that was similar but it's it's more of like a first half of what we're gonna cover here and this is more of a refined a more refined approach to what we're gonna talk about but just to recap we talked about how you might want to combine both your metrics and your logging within the same system which is usually a major pain because you've got logging being very unstructured it's never guaranteed what format it's gonna be in or that it's easily gonna be parsable but on the other hand you have metrics and telemetry which are very structured usually specific data type specific units and so they're very easy for automation type purposes we also talked about how to use a open source piece of software called Rieman which is more or less just a kind of a moving window snapshot in time of all your systems that are sending metrics to it and on the back end it can send those into graphite you can also write your own modules to send other systems or you can just send a new system at all and it will automatically expire you know events as they fall out of that time window so it's it's really good if you have a ton of distributed you know systems that that needs some kind of centralized shared state that's very low latency very high capacity then there was just we touched briefly on doing text classification of log messages with a piece of software that's more popular in the spam filtering community kind of older package but works really well called CRM 114 and if you understand that reference then hats off to you there's also we talked a little bit about how you can combine all that to do alerting and automation behind Rieman once it's processed in events so you could either you know look for specific patterns specific statistical patterns that occur in your metrics or your logs as you parse through them or alternatively or in tandem with alerting do some sort of automated action on the back end to try and fix some of the simpler problems that are more just you know ankle batter type problems that keep reoccurring so the end goal is how to essentially make our cluster or at least the management piece of our cluster a little bit smarter the pieces that are sitting there watching at daily 24 hours when we you know when we can't one of the things we want to start with is is how to kind of reactively resolve issues that might crop up again some of these smaller issues that might come up on a daily basis or a you know semi-daily basis that you know it takes time to get automation right and takes time to to dig in and not all of us have you know little time sitting around to to jump into that sometimes then you might want to progress up to pro actively looking at the metrics that are coming in and the log messages that might indicate degradation of specific pieces of your cluster that might be you know subtly warning you you know with a specific signature specific pattern of some kind of impending failure or impending degradation of another kind so you'd want to start with obviously the low risk stuff you don't want to go crazy and cause your cluster to implode so very common things very easy things that you can reproduce without a whole lot of effort and also you want to treat any kind of automation like this certainly you know with a ton of testing because if these things are gonna be touching your life systems while you're not they're looking at them you know you want to make absolutely sure that what you've reproduced is an exact copy of what's what's going on and that it is within a very high degree of confidence always fixable a specific way so how do we get there in Hong Kong I had a slide that that kind of illustrated a sample architecture for all these different moving pieces and it took up a whole slide it was kind of ridiculous but you know there's there's aside from all those other moving pieces there's a specific package we'll get into today that allows a lot of this to take place inside you know inside of itself and it's very self-contained but it also has plugability on inputs and outputs and an incredible amount of filtering capabilities and so obviously we're gonna be talking about log stash a little bit the first step in all this is to centralize your logging which we covered in Hong Kong as well basically to have our syslogs and everything it's got at least you know your application level logs into a central our syslog server or something like log stash that has a syslog input plug-in so there's the actual name clayture the input then from that point on you've got a huge amount of filters and I think there's like two pages worth if you go on their website and look at them just a couple of examples you've got grokking where you can just try to parse and make sense of some unstructured text and then either tag or you know parse it or remove certain fields whatever you want to do you can drop an entire event you can also mutate an event so you can like I said take a log messenger or metric that's come into your log stash system or your lock stash cluster add a tag to it remove fields from it that might have come in via the input and so on so forth you can also clone the event if you maybe want to keep the original and log stashes back and but also maybe take another take a snapshot of that modify it and then make a more meaningful metric out of it and keep both the one thing you want to be careful of is not to produce an NS loop because you'll end up with those like that so the kind of the next step to tie together some of the classification you might want to do with logging in a through log stash you know as I said there's a ton of plugins for filtering and you can write your own very easily with Ruby you can choose to do you know pretty much whatever you want but you can write a custom one to invoke the CRM executable on your system that you basically feed in a preset definition of how it's supposed to classify like the you know the different algorithms you might want to use in Hong Kong we cover just very briefly just kind of a Bayesian filter that you would have to train initially and get it you know smart to a certain degree of confidence but but also be able to feed in enough data so you don't get too much false positives but not too little so that you don't end up with just a useless classification and once that executable is done running which is usually very quick you can easily modify the event and log stash by adding a tag to it maybe it's a classification maybe CRM whatever you want to call it and this will go to the back end with the rest of the event and you can use filters to build just about anything else you wanted to do if you can write it in Ruby you can use as a filter so on the back end behind log stash typically you have elastic search used as kind of the the eventual storage for all your events and so you can define the indexes as you want to be able to search by very easily I'm sorry index the fields you want to be able to search by very easily and you can really scale out that piece of it if you if you need to separately from log stash there's also a reman output plugin as well for those of you who may be intrigued by reman I would encourage you to check it out at least and I'll have the URL for it at the end here on the slides but again that's the moving window shared state that should say state you can also query that as well it uses just a very simple protocol buffers protocol so they're very quick and you can also use the exec output to fire off any kind of other executable that you may want to use if you want log stash being the one responsible for executing the fixes that you might have written or the alerting whatever you want to have done exec basically just let you call it to an executable so as in a use case example Swift you know is a very very redundant system if you set it up correctly but you know a busier clusters are not without their problems and simple things like dry failures could still take a fairly good amount of time to resolve on you know a human's part and so whether that's a oops whether that's a simple you know IO error or it's a simple accidental unmount there's still you know some time involved in investigating and fixing those so as some examples here we've got a disk IO error one of the more one of the more severe ones you might run across but not uncommon first you could try writing a script that does a sanity check to make it make sure the device is still there once it sees the IO error in the logs if it is try to unmount it if you get an error you know you probably got an issue you need to run a check on but you also most certainly want to if it looks like this drive is going to fail in some way or is on its way then start gradually decreasing that that devices weight and so you get it almost entirely out of the ring or entirely out altogether so you can have a better look at it next if you've got a new disk that gets hotspots into a system that has a lot of capacity if the drive is empty maybe after a certain amount of time just in case you're using that this for something else and you want to explicitly do something else with it after a certain time window to give you that that's that opportunity maybe you can automatically create the file system itself add the device to the ring and gradually increase the device devices weight in the ring so that you know you don't completely overload the cluster or the the note as I get surreplication traffic and simple things like running low on memory or your CPU is thrashing like crazy if it's not a cluster wide problem it's probably something that you would want to check out if it's not something that you see often so of course you can get notified via email text SMS or if you're still living in the 90s you can have a beeper message sent but along the way you definitely want to make sure that for the sake of having this all on a consistent timeline that you still add or still generate an event for these alerts each time so you can correlate okay here's when an alert was sent out here's when a specific proactive or reaction or reactive action took place and so if you see further decorations from that maybe there's a way you can backtrack that very simply so for instance in the case of adjusting the weight of a device gradually over time until it's reached its full weight and filled up if something goes wrong very early in the process that's a very easy thing to stop and it's also a very easy thing to detect so so you definitely want to have an event that that when you go back and visualize what's happening you have these markers that will show you when exactly things took place so the next part is the part that everybody loves is how you visualize it with log stash there's usually one specific package everybody gravitates towards in that skibana it works very well both with log stash and a few others and these these pictures I just kind of copied from the website those are actual kibana graphs that you could you know query a date range just watch it in real time what do you want to do so it can provide real-time visualization you can build specific charts and views based on search queries for your metrics for your logs whatever you want to look for and then define the graph type and so on and so forth so you can get pretty advanced with it the search queries are very easy to use if you use the search engine at all in the last 20 years you're going to be okay and just in case you hate log stash it works with fluent and flume as well which I believe fluence is I can't remember if that's the Apache one or not but you've got small turns there so where do you get all of this here are the URLs here as you can see log stash elastic search and kibana are all elastic search products there is I believe a paid version as well of kind of a different dashboard that is supposed to be more advanced than kibana if you want to go that route but kibana will certainly take you very far much need to do CRM 114 you can just install as a simple Ubuntu or a devian package and it's very lightweight it's just a very simple well it's not that simple it's a it's a C program so it's gonna be very efficient and there's plenty of wrappers around it too I forgot to put it on the slide but there was a kind of a fairly popular wrapper around it a while back that the people used with Python and the author didn't maintain it and I talked him into MIT licensing it instead of GPL for the purposes of using it for something like this but it provides a good wrapper around CRM 114 it's not it doesn't it's not very flexible in terms of intuitively defining what you want to filter and how to filter it as far as the algorithms are concerned and everything but those you certainly want to write on your own you basically end up defining a file or creating a file that defines from start to finish you know the instructions that CRM would need in order to classify something and the algorithms of it use this is actually being used I think it started fairly recently being used in the infrastructure project to classify log messages during the automated tests that get run for every patch from what I could tell it was just classifying success or failure so it was a very simple test that didn't need probably a whole lot of training to get very accurate with but I believe they're using it now and I believe as well they're using I'm not totally sure on this elastic search possibly for storing the I guess the signatures or the log messages that they want to look for with signatures for elastic recheck could be totally wrong on that but kind of got lost looking around after a while and then of course reman at reman.io there's a good video there as an intro if you want to just sit and watch and listen about how it works so that being said there you know I started to put together a demo that illustrates the entire thing start to finish and then realized it was going to take well over 40 minutes to cover the talk and the demo so there isn't a demo unfortunately but I am gathering as much together as I can configuration wise from what I did and I should pull together and get out on GitHub by the end of the week so I'll have that address here shortly so I did want to take that time instead to do any sort of QA around any of the setup involved in this or any of the flow or how you might go about you know doing some decent classification or combining metrics any questions disappoint I have a question right so have you used any standalone products like Splunk or maybe hosted services like Logly and if so how do they compare like what's the overlap would you recommend using them at the same time or is it kind of one or the other it's probably down to a personal preference if you want to keep everything internal and you don't have a full-on Splunk you know license or your or the amount of stuff that you want to store is going to require a license that's really large then you might want to go the log stash for our the fluent or flume route I would say it's mostly down to a matter of personal preference though you certainly could use them at the same time there's nothing to prevent you from doing that especially if you're centralizing into centralizing your logs into a single point where you can then fan out to a number of other systems same thing from metrics as well testing are you running on any of the stuff in production today and can you give some examples if you are currently not it's certainly like to get there the classification that I was able to do with CRM was fairly promising it just requires a lot of training of the of the actual text and so forth to get it right and to to find the right signatures especially if you're looking for behaviors so it's something that could be very time consuming to do it is also something that might be very cluster specific as well because certain clusters may run into very specific issues that are due to the way they're set up in their own environment and then there of course there are you know other issues like I think recently one of the the Swift 113 fixes was a regression from 112 where if you had a disk I where like we talked about earlier it would cause the object server to stop and you have to go in and restart it it got fixed but you know little things like that are kind of the goal and and then to take just more advanced cases like the pattern recognition so forth in terms of the proactive actions so what have you thought so far and what what what is in the road map like how do you plan to achieve some of these actions integrated into the open-stuck infrastructure management I would probably say let's see probably a good example would be something like staying with Swift again or maybe a nova example if you start to see a specific machine that has you know a number of dry failures that exceed a certain statistical percentage compared to other machines that are doing more less the same kind of work which you can see very easily through Riemann or through you know some elastic search queries whatever you want to do then you could probably predict pretty easily that at some point in the very near future you're going to have to migrate your hosts or your guests off of that nova instance if we're sticking to the nova example here so you could you could sketch the migrations you could try to diagnose further where the issue might be if there's you know rate specific issues or if there's some underlying network issue then it's really going to be again environment specific but those kinds of patterns that lead up to an all-out failure are the ones you could identify things that I guess to put more simply that you know as humans you can look at and look and see well we had a dry failure here we had three over here there's obviously a very different problem going on with a specific machine and so I guess those patterns that we can easily see and I easily be able to identify through metrics or through log signatures and so on so forth answer your question okay anybody else just curious about do you have any kind of lessons learned about how you've been managing your signal-to-noise ratio in terms of monitoring and whether you're alerted or not and if you have any kind of like automated ways that you're dealing with that or if it's just all manual tuning for metrics it's I would say a relatively simpler problem to solve the signal signal-to-noise ratio again depending on what's going on in your cluster excuse me with logs it may take a little bit more initial time you may not want to just try to train everything out of the gate as being either something of note or something that is just you know routine because that that's gonna be very time-consuming and you know based on what algorithm you use for classification that could also confuse the classifier and actually lower the confidence rate that it has when it gives you a result so the I guess the the best way to approach that is is probably to tackle first the the the problems that you're seeing on a recurring basis in your environment focus on those and then kind of scale out from there does that help any other questions here put this the slides just real quick as a quick note we are hiring as well for our objects storage team if any of this sounds remotely interesting to you either from a DevOps perspective or a development perspective from the Swift side where you enjoy working with OpenSec on a daily basis please get in touch I'll have my information on the next slide and you can also find me here talk to me afterwards and be happy to happy to talk or if you have any other questions that also so again there's all my information I'll like I said try to have you know some good configuration examples up by the end of the week as well as a link as well with that on GitHub to the example I mentioned of the infrastructure project and how they're classifying logs or log messages so if that there are no other questions