 Why hello everybody I'm hitting from the internet You may or may not heard of me I do quite a bit of writing and coding and what not, but I have little time and I'm not that interesting What's more interesting is the web work. I work for a very small other hosting company and no mind register and The reason why it's interesting to you is because we are Big enough such that we need proper metrics and logging systems in place To be able to even function But on the other hand, we are not we are small enough that we have we don't have a team who does that for us We have to do it on the side And it's just a site one part of our work, and I think that makes it kind of relatable to you At least I don't think that Google sent their logging team over here to learn something from me, so To make it more convenient for you I made a page with all the links all the concepts all everything I'm gonna mention here so just relax and listen and The agenda is three things basically I'm gonna talk about errors and how to get notified about them I'm gonna talk about metrics and how to know what the hell is going on on your service I'm gonna talk about logging and how to centralize it and where you even need it So one question ahead who is happy with their logging and metrics infrastructure liar Um, I'm not promising you happiness It's computers after all but maybe we can Okay, make it. Maybe I can provide you with functional unhappiness, which is nice sometimes so Errors I'm gonna start with them right away because they they happen You have to deal with them and there are the quickest wins to make so I'm starting with them while everybody is still still fully awake And Again, I have three expectations from my error logging from my error notification system and this is really timely notifies I want to know right away when something happens I want to be notified only once because this happens to people who just use an exception to email logger so I'm in egg and I had once 500 emails from such a thing So and I would also like to have some useful context of my errors because monitoring may tell you that something is broken But this is not really helpful to have any idea. What is broken? What's going on? So, obviously, there's a huge market to solutions. I'm gonna talk about only one of them Which is sentry sentry has a lot of things going on for it Most importantly, it's owner and same famous expert DBA David Kramer bought me a burrito once so Considered as my full disclosure But it's also open source software and it's written in Python using Django So if you're deploying Python services, you maybe already know how to deploy it and if you don't want to do that there's There's a paid solution The plans are pretty affordable. I think and there's also a free trial and a free plan So you can be up and running within seconds. So if you if you don't have any Notifications, you should try it out. So what do you get you get instant useful notifications by email But also by slack or whatever you want. There are plug-ins for that Or it contains a trace back and some metadata and most interesting button Of course is the view on sentry one a nice touch I find is that those emails have a reply to header set to hold your whole team So maybe you're on a train and you see something that exploded You can just hit reply and give them some hints how to fix it so The web interface offers much much more and my favorite button is this button because this button is telling me I've saved you 100 emails in your mailbox so Once you fixed this exception It gets marks as resolved and if it happens again, it's getting marks in a great regression and you get your notification again So basically it does exactly what you want it to do As you can see there's a lot more going on. There's a lot of metadata. It's much of it is collect automatically So you can think of it like the Django stack trace view that many people still are serving to the customers But it's just for you So how do you get your data in there? So the short answer is Jason over a GDP So you can use it with any language any framework even assembly if you want to go go through scale So there are nicer clients for various languages. They have usually the name Raven in it and The Python one supports Both multiple transports, which is how does others are delivered? So using gum then you I think IO twisted request and so on and so on but also multiple integrations, which is basically How auto data is collected automatically without you doing it explicitly So for example logging you install a logging handler and every exception that arrives there is forward to sentry You don't basically have to change anything in your code if you're logging errors for Django. There's great support. There's General whiskey support and nine more. Maybe it's even more since then since I did a slide So but let's start simple. How do you do it when they love? I just like this you instantiate a client using a URL you get from sentries web interface and Then you capture it You're done This is how you capture errors and report them into a into a nice interface what I personally like is this for ad hoc tools Which you may or may not have a lot of in your operations Every exception is caught here that happens in this function get forward to sentry So you can you don't even have to change your functions You can just add a decorator to it and your errors are caught and look forward integrations make it even easier or I mentioned that It is built on top on Django. So the authors know a thing or two about Django So the support is the best as far as I could tell you add a single line You get all five hundred three ported and you can import a client from anywhere. That's that we are already done Deploy sentry or give David a few bucks so it can be buy me another burrito install Raven and add a few lines to your project If you don't have error Notifications, and I really have to stress that you are missing errors Your customers are seeing those errors. You are not you're losing customers get something done and to make it even easier David was nice enough to issue a nice promo code, which is I think hundred bucks. I'm not getting anything out of it But if you want to try it, there's you go. There you go There we go to metrics What are metrics? Metrics are numbers in a database That makes them at time series data because they are associated with a time stamp and They are basically the difference between guessing and knowing because if you want to make decisions, you need facts I think it's an accepted wisdom Because otherwise you spent weeks and months building something that's useless or even harmful and metrics are those facts So we will give them a quick roll and I would just think between system and application Metrics system metrics are something you observe on the server Like the load or how much traffic is going through very important should be collected using something like collectee But not really part of my my talk What I'm talking about is app metrics, which is something you measure within your app And in simplest metric you can have our counters. It just something happens and you increase an integer Which is pretty fast even in Python so Then Timers, maybe you want to know how long your database queries take maybe you want to know how long your request to take on average And finally those gauges which I find undervalued because they are really useful if you want to debug something They are just numbers which you want to keep track of so it can be the number of Customers online or the number of connections in a connection pool things like that. I find them super helpful There are much more but these three are in my opinion the most important one So what can you do with metrics? So we said there are time series of data is so you can plot them and such a plot gives you a lot of Information that bear numbers don't so for example, you see development over time So you can tell that you are running at 99% capacity every day at 12 p.m. And if If you don't do anything it might fall over next week when you get one more customer And you also see trends so you can tell if you need to buy the if you will have to scale out Today to moral next week or maybe never because you are losing customers because you don't have proper error handling so And you have graphs you can correlate them so you can see like request per second versus latency How much request per second can you serve a handle and since they are just numbers you can do math on them So for example, if you have a graph of a counter, it's just a racing line. It's not really interesting But you take the derivation of that and you have requests per second If you have timers Taking the average is not very useful, but person tiles are very interesting So for example, what is the average request time for the slowest? 0.01% of your customers because what if every 1000? Request takes one minute You wouldn't know from the average because it gets smooth out by the other 999 But this customer gets regularly for some reason one a one minute request He may leave you to Thing is math is hard The average human has one ovary and one testicle Which is true, but it's not very useful information and You can do the same mistake with your system or your app metrics So unless you know what exponentially decaying reservoirs are used tools by people who do know what it is So one, I think you can also do monitoring on top of metrics Of course because you can set a hard limit for acceptable latency if it's if the threshold is is exceeded Rignable error rates if you have a busy application You usually have always some kind of errors if they go out of whack something is going on and it's actually true for any kind of anomaly if For example benign errors like 404 so 401s go out of whack. There's something going on. You should investigate and There's actually a whole stack called kale by Etsy. That's just made for finding anomalies like that so We've said They're living a database, but probably not sequel light So what are you looking for our so-called time series databases which have various features like? special querying and everything one of the most important ones that you have a roll-up of your data, so you'd Which means you have various resolutions of your data for the past because you probably don't have enough storage to store Of second resolution of all your metrics for the past year that might get expensive really fast even if you're big hard disks So you usually smooth it out somehow So you have like you want to know what the average load was a year ago per day But you want to know it very precise for the past hour So and I'm gonna introduce it to three the first one is paid and Hosted and it's really really nice You can get started immediately but by using curl and uptime and you have a curve of your load of your System I've done that. We started it like that too. The graphs are beautiful. There's a lot of goodies It's a lot of fun to work with it If you want to host them yourself The current in her phone gorilla is still graphite, which has been popularized by Etsy to and it's written in Python The front that is in Django the back end is in twisted called carbon it's finally in trusty you don't have to build it yourself and You can say that it's a widely supported standard nowadays So the the network protocol of carbon is supported by other applications too Just for just for compatibility. So the thing is it is a little bit long on a tooth so the storage configuration I just talked about the roll-ups and limits is a bit finicky and It might be not the most pretty interface you've seen today It's xjs if you don't hit a pleasure to work with it yet. And this is what happens when programmers build interfaces I mean, it's open source of there. So I'm not complaining, but it's clearly kind of a problem But that one is solved by Grafana Which is something it's really just to build pretty dashboards for graphite and Once you install it, you will probably Lose a few hours to it because it's so much fun to play with it and it looks so good and Grafana also supports influx DB, which is the next generation time series database written in go because that's pro what you do nowadays It has a company behind it that sells hosting So let's hope they don't pull a foundation DB and it is used by Heroku. So it's not that obscure toy for nerds But it is in production. It Looks better. It's easier to manage its storage You can tag values which will anyone Appreciate what ever put the server names into their metrics names that you've seen in a slide before Now you don't have to can put any tag on a value and Have clean names it offers a sequel like query language to those metrics and a graphite front end Which means you can if you're running graphite right now We can point your tools just to influx DB and it should work, but it's computers. So I'm not sure If you start on today, I would recommend to look into it first if you are run graphite and are Functionally unhappy only then I would not abandon ship so quickly. It's not That big of a deal So collecting how do we get the data into these databases and there are basically two approaches The one is that you aggregate externally. So something happens and you send also set out a UDP packet to that's D or Protocol buffers to Riemann That's the is older comes also from the Etsy ecosystem Simple to use simple setup Riemann is by a super smart person and it's configured enclosure So you probably have to be also super smart to use it Um The good thing is it has no state. It's super simple to set up and to use The bad thing is you have no direct introspection So you need at least one more service to even see what metrics are coming out of your system in the case of stats You need even two because that's T Does only aggregation and then forwards it's to graphite With Riemann you get at least a kind of rash dashboard So the second approach is that you aggregate your metrics within your application and then deliver it to your metric silo base this approach has been polarized by Coder hail and his talk metrics metrics everywhere which you totally should watch if you want to get into metrics It's super interesting. It's super funny to watch And this one gives you immediate insight into your Into your application you get some kind of dashboard out of your application and This is useful both in development and in production just as well Of course, you've got state state is bad state means bugs But I personally prefer the second approach because it's more practical So the question is how will you do it in Python? So for stats D. There's a gazillion Python clients pick one you they work all the same you instantiate a client with a URL and You should packets around and don't look at return values because there are none it's UDP everything's gonna be okay Or not because if your system is burning UDP might not be the best way to message your state, but So They're only known Working solution to in-app metrics to me is scales So it comes a little plethora of stats, but you have to set it up these I do too. Are you most? The meter stat is for something that happens per second. So basically a derived counter and The PMF stat is a timer Nothing else. So how do you use it? for metering you just call mark on it and for timing it's It is the context manager Where you do something inside of it and you're done now By doing this alone you get a nice web dashboard out of the app This is the meter ring you got already the average for the past minute five minute and 15 minutes Even nicer is the thing you get out of your timing because you get your person tiles for free plus some more nice statistics and All this data you also get as Jason so you can collect it from using collectee or whatever. I personally used Graphite periodic pusher that comes with Scales you just define the period how long How often should just send out the metrics and you're done? We are done You know how to collect metrics and how to store them now we come through logging In an ideal world we wouldn't be logging because You want to know about errors Which we now have error century and you want to know the state of your system Which are metrics so there are people like Armand Roehner Kerr who just refused to log anything I personally cannot get away with that Simply because we needed for some kind of bookkeeping because the customer calls us they always lied to us They always stated they did not log into the server. They did not change that file and we have we need a way to double check what they are telling us and That's usually not me That's someone from support and those people usually don't have don't have the SSH keys to our servers So this data should be somewhere searchable in a central place So we are talking about centralized logging and I can't talk about Stralized logging and not mentions plank and please might see it are more money bags next to the name than on the other Slides it is for a reason because this is enterprise software and it's not just one web interface It's a versatile platform. They've literally an app store It works both on-premise and in a cloud It's great if you can afford it, but it is enterprise software. So the home page is full of PDF white papers There's a lot of webinars for you to attend if you know that kind of things So more down to earth. There's paper trail and log Lee which I have heard Good and bad things about both. So it's a matter of taste I'm sure you you're gonna be reasonably happy with any of them if you choose and if you want to save your log files on Foreign servers, which I personally don't and that's why we are running elk probably heard about it, right? It's Currently the most popular stack and it consists of elastic search log session Kibana And let me just quickly show you how it works together. So we have servers They are generating log files those log files somehow get into lock stash, which parses them adds meaning to that Saves it into elastic search, which is a database That is easily searchable and easily clusterable And now the data isn't there. You can view it using Kibana, which is a web interface to all these things Yeah, and that's all that's the elk there's a similar solution called gray lock It also uses elastic search for storage and search But Kibana is only a view on elastic search, but gray lock does more because and I'm quoting here Elastic search is not a lock management system. So overall, it's a bit more integrated. They do more But I'm personally not particularly fond of having a Mac vendor in my Infrastructure So you have to decide yourself. I haven't found a compelling reason to switch from elk But I'm sure there's someone some If you have any questions about elk, Honza Krall is somewhere probably in some pub here and He works for less for elastic the company behind elk So he will be happy to answer all your questions and he's also the maintainer of the Client to elastic search. So one more thing Kibana is much more than just a web grab You have a lot of nice things going on like geo stuff and everything. So there's a lot of things to find Now let's come to the finicky part. How do you get your data in with? How do you produce it? So I'm gonna say this should be the goal for you A timestamp and something machine readable With as much useful context as possible Because that makes configuration really simple you literally tell lock-stache. There's a timestamp and Jason Lock-stache will figure it out. Of course, it's just one line, but I thought you might find it more readable in that size So how do we get there? It's a matter of context and format. So you want to look out everything important and You want to format it in a machine readable way and if you tried it to achieve that with the standard tools You may phone like I did it. It's rather tedious So I wrote something on my own called strike lock Does anyone know strike lock? Okay, let's change this So strike lock is not a logging system. It's not a replacement for logbook It's not a replacement for some libraries logging. It's not and said it gives you a bound logger that wraps Your logger. So if you are gonna ask me does strike log work with x the answer is yes Now it also gives you a context which you can bind key value pairs to it And once you decide to log this account event out this context you've saved before is Combined with the new key value pairs To one event dictionary and this event dictionary is runs through a chain of processors Which are just callables a function that gets a Dictionary in returns a dictionary nothing else the last processor The return value of it is is passed into your original logger So if you're using the standard logging from the standard library You would return a string for example a jason string or whatever format you want and return XML for all I care Strike lock comes with jason and key value pairs for a matter so The thing about processor is really cool because it's really just callable so you can do whatever you want You can plug data out of it. You can collect metrics from your log entries. You can Report errors to sentry from them and enraged with the context you've collected This is really nice. So this handles both context and format And let me give you a few examples because it's a bit abstract. So simple case You get a logger, which is everything is pretty much configurable and You can now you can walk using key value pairs. Yes, you can stop writing pros And if you any like me I hated writing pros before but it's what even worse is parsing pros. So This output is completely configurable. This is the default which is just key value pairs Which is human readable in in development? So I find this is already a huge progress over standard library, but you can do more so And this is incremental data binding so again you get your logger and I can just start binding Key value pairs to the logger and this lock object is a new object every single time This is immutable data and we have no mutable site at all. Ask your Haskell friends. It's a great property to have so in the end Everything you bound to the logger gets logged along with the event Again output is configurable and please notice there is that you don't care at all how the data is represented within your Within your business date Business code that's that's something that you care about somewhere else in a processor in your logging module But not in your business code. You just bind key values and just log them out so Now maybe even more practical. How do you use it in practice? So this is a pyramid view a very simple one, but it would probably work the same with any other so at the beginning you bind a request object to your logger and then you lock something out and How do you do something useful with that object? So you write a processor that extracts the data, so you try to remove the request from the event If there's one so you've removed it But you now you add some data from the request like the IP address of your client or the user ID of the user and You return the new the new dictionary and This is what you get out of it in case you have a Jason Formatter installed again, you did not care about what you want to log out in your view That's something that you decide elsewhere That's so that's all I'm gonna say to strike look if you have any questions just talk to me I'm pretty proud of that one Now just something slightly sadder. Let's talk about standard libraries logging I'm gonna say this is all you should do and Ignore all the rest just log to standard out and handle the locks outside Because UNIX had over 40 years to develop solid logging tools and there's absolutely no need for us Python people to reinvent the wheel like date staping that date stamping or Leopardation we are doing it worse stop doing it. Just go to standard out Also, I've heard that it's not that much fun to use. That's you be the judge so Now now you have structured a don't send out what you next well Send it into into a file or send it to syslog or any other queue like Kafka Pipe it into a logging agent like a log-forwarder You can do whatever you want. It's just a pipe. I'm personally a bit paranoid Because I Don't lock a lot but what I lock is important to me So I don't want to lock to lose any lock entry So and no network in this world is as reliable as X for so I save everything in a file With this file is rotated for 48 hours And it's those entries are deleted and I ship it off from there from this file So well, I do not want to have to use grab. I still want to retain the real reliability of Grapping through files on the file system So let me put it all together So I use strike log to bind data To look things out strike look makes it adjacent string Which goes into logging and logging send this to standard out Now I use run it to run my processes. It doesn't really matter what you're using but Run it comes with a demon that will take standard out Add a timestamp to it and write it to a file now Now with my log entry is safe This file is watched by log slash forwarder formally known as lumberjack And it sends to lock stash lock stash parsed it sends it into elastic search logging is solved Yeah, we are done early. Let's get some pinchos You're not so We have three nice components. Why do we forget about the pragmatic part? So how do you put those three things together? Uh, without making it gross because this is gross You can barely see the logic hidden in the jungle of reporting measuring counting and whatnot So I want it to look like this which is much nicer. Something happens I tell log system about it and I'm done. Of course, that's not always possible But I would really try too hard. I really like to try hard to get somewhere there So with errors, it's pretty easy. I dare say so either use some handler That comes with sentry either just logging or If you're using running Django using the Django app there using Um, or just track lock. That's what I do when I using a pyramid. I just pluck my errors out of Out of the logging stream and uh, I can drop entries if something is not interesting And in that apps There's usually also a way to define error views and this is really really cool because again pyramid Uh, you get the exception and a request object now You you serve back The error id which is served from sentry So now when your customer calls you complaining about errors They can exactly tell you the error id and you can look the error id up and you have the exception that the um, that the customer saw And this is so great That we've seen something that's even rarer than a white rhino, which is a happy arm in runnaker So Although I have to say since I made a slide he joined sentry. So take it with a grain of salt but still So onto metrics, um Most metrics can be observed from the outside And outside can mean outside of your views outside of your app outside of your server even so Maybe a look let's have a look at uh, whiskey containers the two major ones Have both knobs that will help you with that. So g unicorn offers Statsd integration right there So you add one command line option and you have average request times in your statsd and in your graphite You don't have to change your code at all Micro whiskey as usual goes far far away further They of course have statsd too. They have direct carbon aka graphite support They have a whole metric subsystem including nightmare inducing things like snmp So you get your stuff done with that and with this you get a big picture of of the state of your application Without even touching your apps. So go for it Then you can write middleware middleware is no dark magic again pyramid This is a tween which is a very awkward contraction of between And this is called on every request that comes in So you have the request object In this in this case, you just we just measure the time But you can of course look at the data within the request object and start splitting up your data Depending on the view or some argument that you're passing into your view Probably don't have to because there are things like permit statsd that already do that for you But you always have the possibility to Do things from within your app, but outside of your actual logic But of course you can extract data from locks So because if you lock something out, you shouldn't have to also count it or measure it So lock session will do that for you. It supports all major metrics back ends um that The drawback is that you have to change the configuration of lockstash, which may or may not be a problem for you Um, it's not a really problem to me, but it adds friction which I do not like so I do not do that I don't want to annoy those people who are responsible for that to fix it for me because I've added a new metric Of course, you can do the track lock. That's what I do You can just count events by their names and you already have something useful Okay Finally, you can also leverage monitoring Which is even for the outside Any any monitoring system has some support for metrics numbers in a worst case You just measure the time it takes to to execute a check and save it So you get a really external view of your data External view of the behavior of your apps um Which is not very precise of course, but sometimes it's useful to see how your system feels from outside boundary and not from within your Availability zone or your computing center Okay, so what what's left? What do you have to yourself? So if you want to measure code path You probably have to add some code to your business logic. For example database queries Or if you have certain major use cases like a view That sometimes uses only cache data and sometimes hits the database It's not very useful to average those two numbers Not to say it's completely useless. So you may want to split it up and Of course gauges if you want to expose numbers from within your application You will probably have to touch your application in some way And now we are really done. So what did you learn? Proper error logging is important. Sentry is awesome. Um Metrics are important inflex dv is probably the future graphite is the present Use whatever you want from those two met um centralized logging saves you a lot of pain and maybe you even need it Elk will have your back truck lock will help you to get your data there And now you know how to use all of them with python without gross code duplication. So I hope Everyone learned something so go forth and measure Study the talk page follow me on twitter and tell your german speaking friends to get their domains from viral media. Thank you Um, and i'm sorry I'm not taking any questions because whenever I did I completely misunderstood the question and said something very stupid So if you have any questions, I will be outside. I'm here through sunday I will be at the sprints. I will be at lunch. Just chat me up. I'm happy to answer any questions. Thank you