 Hello everybody. My name is Radu Slavstankov, Radu for short. You can find me on the internet on those addresses. Pretty much I'm Air Stankov everywhere. I come from a small country in Europe called Bulgaria. It looks bigger on this picture than it actually is. I'm head of engineering of a startup called Producent. And all my slides, because I would have a lot of code, like I usually like to have a lot of code to my slides. So all my slides are already visible on this address in Speaker Deck. And I'm sharing this because usually I notice that during my talk a lot of people do stuff like that. Just taking like a lot of photos. And this will be my last slide with the link of the slide. So if you find the talk interesting you can just take it out. So one of the things, one of my core beliefs about technology is that context is the king. You cannot do anything if you don't understand where this thing comes from. So in order to understand a bit about Producent, Producent is this website. It started as a very traditional Ruby on Rails application. Then moved to a single page React app powered by GraphQL. Right now we are better testing a whole new application called YourStacks, which is built with a very similar stack as Producent. It's a totally separate code base. The team who works at Producent, the engineering team, is seven people at the moment. This is us. And the way Producent works is it has three tiers of apps. It has a Node.js app, which is responsible for server-side rendering. And I'm not going to talk about it today. And it has two groups of Rails containers. One is the GraphQL API server and the base Rails application. And the other one is just a group of containers who are responsible for background jobs. And most probably most of you will use SideKick for that. So usually when you start a new application, like I started this new application, Producent, YourStacks, it starts like that. It looks like this fancy new car. You have a lot of fun. You're just cold. You are playing with it. You are enjoying yourself. You are moving fast and having breakfast. And developing, being very happy about it. And your app starts looking a bit like this. It still works. We do some fixes here and there. We are just adjusting some of it. All our exceptions are like cute bunnies. Oh, that's a cute thing. It's just failing, but it's not a big deal. And you know, we still work kind of it. And when we start seeing a lot of this, you start understanding, okay, there is something wrong here. And the problem is you have left that situation linger too much. So a couple of years ago, a Producent, we introduced something called Happy Friday. And this was a day where engineers could fix bugs because we were doing very rapid product iterations. And there were some bugs, and bugs are usually not in the process. We had some good features like when we shipped something, okay, maybe we should add this fancy JavaScript validation. So Friday says, okay, I want to really add this feature days. This was the time we are paying some of the technical depths of some of our transitions. Also, we use it as a catch-up project, like if your project would take a week, that's four days of work, and Friday, if you are not... And the final thing was fixing exceptions. So we actually stopped doing Happy Fridays a couple of years ago because we adjusted our process, but this was because of the Happy Friday, we were able to stop having Happy Fridays because they finished all their job. So during the Happy Fridays, we had a lot of like fixed exceptions, like just spent two hours a day, nothing much, you don't need to fix anything, just every week it was basically me and a couple of colleagues for every Friday for like two, three years, we spent like two hours, three hours just fixing exceptions, one after the other, one after the other, and basically this gives us a lot of flexibility. Like right now, like the reason we stopped doing this was because now when I open Friday to fix any of the exceptions, I don't see almost anything, it's like five minutes and it's something which was shipped this week, so I just pipe it to the person who broke it. So my first tip of the day is you should be able to have a process around exception, and this process should be very specific to your organization. Does it make sense to have a sprint, a planning where you say, okay, we fix those exceptions at this point? Do we don't? And treat exceptions the way you treat bugs basically, and most probably you have some process around bugs, but exceptions are basically that in most of the cases. So for the rest of the talk, I'm just going to give you some more actionable advice because make a process is not very actionable. So let's go back to the basics. Like there is this amazing book by Avdi Grim called Exceptional Ruby where he is, I think this is the best book on exceptions I have ever read, and it has a lot of good insights how the exception system in Ruby work. So this is a usual code. You have a method, it does something, and it just does a rescue, and we don't have exceptions, and that's great. We have done our job, but okay, that's not so good because, yeah, our bug tracker will be silent, but the goal of our exception tracker is to give us information, and if we start hiding code, the bugs is like, I associate it exactly with this image. It's okay, I have no idea where to go now. Everything is, my application is bloody red, nothing works, but everything, every monitoring system is silent. So if you are a rescue informer, so just use rescue form a specific error, usually add a note about why this exception happens if it's not obvious, like for example, file errors, network errors, those are obvious, they will be there. Also don't use my name in the notes. We had this policy at Producent, every time you leave a note, you add somebody's name, and the reason for that is because codes move around, and a lot of times in Git history, it's really hard to trace who added the note. So we had a very good policy around noting code, like we add who is user who added it, we often add dates, like for example, after this date, this thing can be removed, or if this is not fixed until this date, it would never be fixed, or stuff like that. So just basic stuff, like in this case, we rescue from a timer error, the usual, the Wi-Fi doesn't work. So my second tip is you should be very explicit around exception, candle very specific errors, and explain why things happen, like try to explain it. And when you start doing that, your second best friend is your monitoring service, like if you don't monitor it, you don't know what's going on. Like if you just wait for your users to send you a screenshot, I see about how beautiful your 500 pages. Like we have a product that's very funny-looking, 500-page, and people sometimes send us to us, and our community are laughing because they're like thinking this is cute and near us. And we use for monitoring, we use century as a service, but my tips work for pretty much any other service. So again, this is our infrastructure, we use century for pretty much all the four layers, like the browser exceptions, the node exception and Rails exceptions. One of the tips here is we use separate projects for each type server, like one for the node server, one for the node browser, which is very easy and sometimes it has errors on Japanese. And that's something which I have not seen many people do, but we actually use two projects for our Rails applications. So we have one project which is only for exceptions who are in Rails and ones which are only in Sidekick, because we found out those exceptions patterns are very different and if you mix them, it gets a lot of noise because sometimes you get an exception in Sidekick and it gets rescheduled, retry, retry, retry, retry, and you see, okay, I have like 10,000 exceptions, but from this one single job to one single user and the web ones have different. So the first thing was when I started doing that, I started to do like more systematic approach and my systematic approach was first reduce the noise. The biggest, the motivator is you open your exceptions and you see 20 pages of errors and you have no idea which is the important one, which happens and what are the different types. So the first thing I did was I excluded everything which was unactionable, like rack timeout, requests, routing error, invalid tokens, Sidekick shutdown is an interesting one because it's like when you deploy and you are dropping Sidekick, but you should be really careful with this. Like sometimes this can lead to problems. Like sometimes this can lead to very interesting problems. Like we were very careful when we blocked because I really don't want this situation, but this is an awesome gift, really. This can also be used for a factoring. So when we started cleaning the noise, how many of you have seen this exception invalid byte sequence in UTF-8? Every time I see it, it's like that. It's just so bad. But every time again, keep calm and read the error message. I mean, it's really bad to use keep calm in a red background, but again, keep calm. And what we do is just check and there is a great blog post in Totbot, Giant Robots smashing into other Giant Robots. They actually have a really good fix around it. It's very simple. You just fix, change the encoding. So one of the things we do is everything which is related to exceptions, we put it in this module handle. So everything related to exceptions, we grouped it into its own folder basically called handle. And we know that if something is there, it's related to our exception tracking. And surprisingly this folder is quite big now and has a lot of code. So this is in handle invalid byte sequence. So we just pass every time we see this exception, we added it into different strings and this works perfectly. Basically you don't have this exception anymore. So my third tip is to reduce the noise. Like you should only see exceptional errors. You should only see things which are actionable and things you should be able to handle. Even though it's something, you cannot handle it. Okay, maybe it's noise or maybe you have a bigger problem. So when you reduce noise, you start, you can now focus on the next thing, fixing the actual exception. So this is an exception that we have undefined method status on milk class. I can imagine a lot of people who have seen this method. And we scroll down and we see, okay, this is in our admin. The status of a subscription is no. Okay, we are calm. That's very simple. I know how to fix it. This is the code. We use the new syntax. Everything is great. Everything is fine. No, it's not. Like if you start fixing code like that, you are slowly hiding your errors. Why account doesn't have a subscription? Every time when you see an error and you are, okay, this should not happen, but okay, I will just put it there. You start putting a systematic approach. First, how many accounts don't have a subscription? Second, why those accounts don't have a subscription? In this case, we had a book where if you cancel this trial only on its last day, you don't have a subscription because there was some logic. It wasn't very logical, but we still call it logic. So then we fix the root plabrum. Like we fix the bug because, again, a lot of the exceptions are caused by brown bug. Then we added the missing subscriptions to the account, so we actually fix the root problem and we create the issue resolved. We don't just hide the problem. We make our system better and actually learn things about our system, which is funny to learn something about something to write. And you have created it. But I can imagine it's the same. The other thing we started doing is, we started adding some guards. For example, our system, if account doesn't have a subscription, we actually use this thing also locks stuff into the exception tracker because at some point we noticed that exception tracker was clean and we needed to put more stuff in it because it was getting boring, and also it's a very actionable place. So we started manually capturing raw errors if there was something around dating integrity, which we wanted to make sure. For example, stripe transactions fail. We put them, we locked them into exception tracker. The user sees a very hopefully friendly message which brings them to contact support and still gives us money, but we were still handling that. So don't hide the exceptions. Fix the root causes. Don't just rush it, get your adrenaline pumped on, wipe it and be ready, but more try to fix why this happens. So one of the first talks of the conference, like the second one, talked about GraphQL. So we are also using GraphQL very early on. And one of the nice things about GraphQL is it has a single controller. So I don't need to wonder, should this be three controllers? Can I just use an update method? It's very simple, but having one controller doesn't help you very much with the exception tracking stack trace. Because now I see an undefined method of value for null class. And I just see it in some random code, but I don't know which query caused it. Like I just see some exception. I have no idea what the key story is. I don't know why this happens. Like in Rails, if you have like a separate controller, I can see, oh, that's from the post controller or something called, that's the post most probably. So what we added is, we added this chunk of code. Basically in our controller, we handled the error. And we added also a bit more help. Like for example in development, we actually returned the error in a JSON format and our frontend tools get this exception and show it in a nice way to the developer. So when developer clicks around the app, hopefully when they work and they see an error and hopefully they see it in the, they don't need to go to the logs to check it. It's already visible for them. Also in tests, if there is an exception during tests in our feature test, we also see it in production. We don't want to show a lot of info, but again we handle this exception gracefully and we capture this exception and we add this extra query parameter. Like with exceptions, we can add some extra information for monitoring and this extra info, what's the query? So if you go back here and scroll down, we now see which was the GraphQL query which caused the issue. And this query also, that's the comment query and that's when you try to create a comment. Oh, that's because we don't do a, wow, some validation. So my, I don't count them anymore, is invest in your monitoring and exception traceability. Like every time you see an exception, don't rush to fix it again. Just think about it. Okay, am I missing some information? How can I add this information and how I can add this in a such a way that the next time something similar happens, I know what's happened. So the other set of exceptions which I was cleaning with, like in a bulk, like I remember, one day I fixed like 70, 70 exceptions in a SideKick related things is SideKick. Like one of the first things we did was again we have a separate SideKick project and that was great. It can produce the noise but SideKick exceptions have this quality of something fails in SideKicks. It constantly fails until you fix it or you remove it. So how SideKick works, some user makes an action and let's say our action creates a achievement. So it's this action schedules an achievement job. This achievement job creates an achievement tracker and you cannot read here this but this is unique. You can only win achievement once. So the code for this is quite simple. We have a job it performs, it calls a service object, create the job for achievement for the user and everybody is happy. Okay, but the user is fast. Users can click quite fast. So this user for example does two actions really quick, faster than our system. And what happens is we start to have this race condition where the second job starts executing before the first one is finished so the achievement is just when the web rails checks the databases is for example the uniqueness check it checks, okay, is it recorded to the database and it returns you the response while the response is returning another commit can add it. So the safest way to have uniqueness is to use a unique index. So in this case this would blow up and this would blow up with there is a couple of exceptions. One of them is if you're in Postgres Postgres unique value valuation. So we can imagine that in achievements create we first try to say okay this user having the achievement but rails don't handle it. So we have another handle thing it's something we call handle race condition which is not the best name but it kind of stuck. So the idea about this helper handle race condition is very simple. So it just executes a code and it handles an exception form which can be caused by unique violation and unique exceptions and it just returns the code again and the way this works is it expects that your code knows how to handle duplication. If you have never added a check is this thing exist or it doesn't exist and this thing is surprisingly resilient and clean up so much code because again users click fast. Another thing which we cite is we have to send notifications like we need to send emails, push notifications or other things like that. So how many of you have seen this nicely named error? Like this one of the best error names. So you have that one and you're like again how man again one of those and you can say okay I would fix the disease we just rescue from it I would try the job that's great then you see another exception because again this also happens now there is a time out there for some reason alright and then you add and add and add and add and add and there is and I mean those are limited number of errors. So what you can do here is you can create we created this handle network errors module where it has listed all the network friends network and friends and it redefines this triple equals method by the way that's a trick from this book. So now I can handle all network exceptions with a single error exception thing so these groups all my exception work and it just retries it so every time I need to handle network I just use this but that's a bit repetitive and because okay I have to write rescue and I have retries and everything works so Rails 6 like we had a monkey patch for Rails 5 but now on Rails 6 you can just have retry on you can say I wait 2 minutes because for example if it's a network timeout maybe the servers on the other side are late you give it attempts 5 and you can write this line you can say okay retry on I started adding those and I noticed that every time I me and my colleagues see the exceptions we just copy paste this code we don't change any of the numbers we just copy paste it so we move it to just a simple module which is candle job network error and now every time we have something related to network we just put the code there and over time if we need to have more logic it would fit our whole system and again it's a very simple self-included module nothing special here another interesting thing when you work with network is every mature Rails project is one which uses every popular or not so popular Rails networking gem like if I check all my network dependencies I think I use every networking library out there like it's wow so yeah it's bad but what can you do I don't speak I mean I pick my dependencies but I don't pick their dependencies so I just have all of those there really when it's a network error all of those are the same so I just list all of them so in a nice way we don't want to care because a lot of these times you cannot think written by us so next tip is you should have tooling around your exception tracking you should like if there is something which is very common to handle like network errors or Twitter errors like if you use the Twitter API as we do have building tools like where your engineers know okay just use this and I don't need to worry about it yeah so now again we are kind of done like no exceptions I'm clean I can go back to the mud like now I can start making exceptions and yeah my talk is close to an end now so again this is my radical proof tips for dealing with exceptions like you should have a process around them treat them like real people have a process around them to avoid them be explicit about your exceptions like don't hide them reduce the noise don't because if you have too much noise you don't know what's going on don't hide exceptions again reducing the noise doesn't mean you need to hide everything invest here in your monitoring like this is really important if you cannot monitor your system you don't know if your system is working and have tooling around the way you handle exceptions like let's make your life easier we are engineers we like making abstractions just build things around that so yeah that's basically my talk