 So I'm Michael, I'm part of the CKI team which runs the Kernel CI pipelines internal to Reddit. And I'm going to talk a bit about incident management. This is something that as we run a service, it's very important for the user experience. Like our customers are kernel developers and they tend to be crumpy. So we try to kind of like keep the quality of this service high enough. So just give a bit more detail on the background. So we are a service team which runs pipelines. This was not something we understood in the beginning. So in the beginning we thought we develop the CI system. It took us quite a while to actually understand that we are mainly running it. So at the point of the system itself is that we want to prevent kernels from hitting the internal Reddit composes. But we also try to shift testing as far left as possible. So moving to integrate into the upstream kernel development workflow, providing feedback on patches on mailing lists. So whatever that means, right? Integrating into the upstream kernel development workflow is fun. And we're also running, because we are a service, quite a bit of infrastructure to make this all happen. So we are running a main GitLab pipeline. So nowadays kernel developers sit on GitLab.com and they have a merge request and it's all kind of like normal. So they get pipelines and green checkmarks and all those kind of things. And we also hand testing off to beta. In addition to this, we run all kinds of microservices. We run stuff in EC2, in Lambda. We use OpenStack. We host our own messaging cluster. So that's quite a bit of infrastructure that runs and obviously is also able to fail. So if you wanna know more, I've linked the homepage, the code is on GitLab.com. You can take a look what this actually means. But yeah. So normally, on a normal day, we run this pipeline, we run our microservices and we deploy quite some changes in there, right? Because we run a service, we change it, product owners come along, have some requirements that need to be done real fast. And then, so basically there's some churn and that's also the possibility to actually break this thing. So what is actually meant by incident management? So if you would define an incident or the internet helps us as saying that an incident is an event that will disrupt the service, that's easy, basically it doesn't work anymore, but also that will reduce the quality of the service. So that means pipelines might take longer, sometimes it fails, sometimes it works. People have to click buttons instead of stuff going automatic. And incident management means that what we do in response to them, how we can mitigate them so that they will not be as visible anymore and how we can resolve them. And in the best case, we might actually do something to prevent them from happening again in a similar way. And now this is about small service teams and the reason I put that in there is that if you have a huge organization, you might have some dedicated teams that do this incident management for you. You might have really dedicated roles, dedicated side reliability engineers that take care of running it and handling this incident and recovering. But for these small teams, that's mostly just the same people that actually also develop the service. So I would talk in two parts. One is how to detect incidents. And the second part will be about how you actually recover from them after you've detected them, right? So yeah, why do we actually want to detect incidents, right? It's not just necessary to detect incidents. Your users will normally tell you, right? It's a bit cynical, but it's okay-ish in some way. If this is good enough, if your users are maybe not as grumpy or they don't rely on it too much, they might actually tell you, right? Eventually, you should notice the biggest problems that your service has. But if you want to know before, or if you want to be able to detect the things that are not as obvious, you need to do some work. And that normally comes in the form of a monitoring and a loading setup. And so this is the first part of the talk. It's trying to detect these things as early as possible so that it's a relaxing thing to fix them because if your users haven't noticed yet, maybe they're still awake, right? Like an international team. You're sitting in China or in Europe and your customers are US-based. You have a couple of hours to fix them. So if you notice before them, you might just fix it before they ever actually see that something was wrong. And now depending on how you built this thing, whatever you are running, it might actually mean that a lot of different pieces might fail in a lot of different ways. And that makes the thing also as complicated as it is. And the normal components of such a setup is basically you try to log so you can check what went wrong last night. You have metrics so that you can basically measure how long something takes so that you actually have some numbers. You can create some pretty graphs, not only for management, but also for yourself to debug stuff. You can use something to collect exceptions if your code fails somewhere deep in the stack, sometimes. It's really hard to see this in any aggregate measures. So surfacing these exceptions is one part of it. And then alerting means getting a pager alert at night, getting emails, having a Slack channel, those kind of things. And to actually make this happen because talking to different teams on board to some of these things to others not, it makes sense to think of it in a way that it's really easy to onboard whatever you are developing in the future because normally stuff just aggregates. You get another microservice, another cron job somewhere. So finding ways to actually make this very systematic so that you don't have to do any work to add in another of these pieces will actually make sure that people actually onboard to these things. And so I will go through the pieces a bit. I'm not sure how many people are actually familiar with this stack. Could you see maybe who knows at least one of those? All right. Two. Still. Three. Okay, that's good. Four. And how many do you have? Five. Okay, that's interesting. And which part for the ones that didn't raise their hand at five, which part is missing? Like which part do you don't, which part do you not know? You don't have the exception. You don't have said, do you? But we know what it is. You know what it is, yeah. Okay. And for others, what piece are you missing in your setup? Like which is the most unknown one of the list? Anybody wants to say? Yeah, tracing is missing. But like from this list, is there anything you are not doing in your own service if you are on a service or which one is the most unknown, which one? Like if you now take this list of these five services, which one of those do you know the least from all of this? They all know. Now let's just go over them really quick. Let's start with Locking. So there are lots of ways to aggregate locks. The easy one to set up is Loki. It's basically, yeah, HTTP endpoint to send it locks. There's a client tool called Promtail which pushes them, which is interesting and which makes this thing pretty resource constrained is that it's not going to index your locks. It's just storing them in S3 buckets depending on how you configure it. But the point is that you index the labels that you put on them. So you might put the service name on them, but there will be no indexing. So if you're actually going to look for something, you will basically download all the locks during a certain interval from the S3 bucket and then go through them locally. And that makes this thing pretty simple in some way because there's no index database that needs to be kept in space. It's really just an S3 bucket, right? Like it's basically blocks that are downloading them. And pieces you want to put in there is if you run Kubernetes lock files from pods, if you have corn jobs, you need to figure out a way to can tee into them the standard output. If you run nodes, the journal is kind of nice, but this is for example, one of the pieces we miss. But if you want to know what happened on this node before it went down, kind of interesting. And if you run stuff on AWS, any of the hyperscalers, then actually getting the locks from these systems in there as well is also kind of nice. And so one other thing next to just being able to debug these incidents, one other thing that locks give you, at least here for Loki, but I think also for other systems that you can actually alert based on what is in them. So some weird issues you might only be able to find by some message locked somewhere. And so Loki allows you to actually say like, oh, if this message is in there, a complaint or send an email or something like that. And yeah, one way of applying these things if you run Kubernetes is trying to figure out how you can put this into whatever you use for Kubernetes, YAML, templating, customize, helm, whatever. So that basically becomes really easy to onboard the next microservice here. Promises is for metrics. Why do you need metrics? They allow you to measure stuff, stuff that's not obvious like duration, the number of jobs, those kind of things. It's basically a time series collection system that puts labels on stuff. And it just takes a text file from a metrics endpoint. So this thing is as simple as it comes. That's the reason why it's so popular. There are very few data types. You can have counters that only can go up. So these things can cope with restarts because then the counters start again at zero. So it's possible to unwrap these curves. So that works very nicely in the Kubernetes context. You have gorgeous, how are you pronounced that actually? They will vary. So it could be, I don't know, something that you measure continuously. There are histograms if you want to, if you care about distribution. So that's basically it. On boarding this thing is pretty easy. So one you can in Kubernetes scrape all pods. It takes a bit of configuration to do that, but after you've said that, you deploy a pod and it will be scraped. And exposing that in something like Python is all of four lines. So it's importing the package, defining the metric, doing something to the metric and starting the HTTP server. This thing will start up with HTTP server. You put it into your pod description or like deployment description in the service definition. And then you can curl it and it gives you what's shown at the bottom. So it's basically, here in this case, it's a counter. This is basically it, right? Like this is all there is to it. This number will get ingested and promises will hit this endpoint again and again in a certain interval and this will build your time series over time. And then you can make graphs, right? Like that's why you do it. But even if you don't have to graph, so I like you can create these rules at the top in the beginning, it's weird, even after a while it's weird, but it gets easier slightly, right? So there are ways of aggregating and cross different instances, linking stuff together. So this has a very powerful query language and you can define alerts based on these rules as well. Now for the exception handling, that's an interesting one because normally if you have code deployed and you don't handle exceptions, the normal case what you try to do is kind of like do something, log something, go on. Which is the right thing to do because you don't want to have your service go down just because there was a small problem. But that kind of like gets these weird things lost. So sentry is a system that basically hooks into your code. It can hook in web front ends, Python, Golang. I'm pretty sure there's an SDK for nearly all programming languages available. I'm not sure about Bash. And you can see these things develop. Sentry will collect these and aggregate them. The only thing in Python is one line of code sorting this thing up and configuring it. And it will collect all the information. Yeah, a lot of information, it should make it easier to debug the problem you had. So in this case, for example, chose one of our things and at the top, so you see a list of exceptions that happened during the service, right, like it's not stable. And then you get these little graphs that show you how often this happens. Here in this case, in the last 24 hours, there's something consistently wrong with the thing. And it shows you the message of the exception. Now if you would click through, you get the stack trace, you get the variables of the stack frame, you get HTTP access that you had before. So normally that's good enough. The most important feature is that you can assign them to somebody, right? Like you can find somebody in your team that needs to care and then you just assign them to them and hopefully they will care. And the last piece is alerting. So if you find any issues, especially around promissios, you have some metrics, some SLAs, whatever your service level agreement is, you need to at the end alert, depending on the severity. So one thing is you can send emails, but you also can send stuff to a page on. So, and then this is what it looks like, right? Like there's a web interface to it as well. So you can basically go there, you can silence these alerts. So after you've basically got it and you find somebody to handle it, then you can silence it so that it will not spam out. So this is kind of like a pretty normal alerting stack. Monitoring stack, but yeah, okay. Now you know, right? The next problem becomes is like, what do you do actually, right? Like so you kind of like found issues in your code. And now, right? Now if you tell this to an engineer, normally what you get is basically this. And that's eventually for sure what you should do, right? But yeah, the point is a bit like, what exactly do you wanna fix, right? So they are both technical and social components to handling incidents. Otherwise I wouldn't be having this talk here. So one thing is that you want to fix the immediate problem, right? Like if it breaks your customer, you want to unbreak them as fast as possible. You also want to fix it properly most likely, right? Like something that needs to happen in a certain place to kind of like do a real solution, not just somebody logging into some machine and changing a config file, those kind of things. And if you're really good, you will actually find the root cause, improve on the problem that was actually causing the incident in the first place and do something to it so that it will never re-cure. Now the social problem is who does these things and do you actually do all of them? And yeah, as I said, right? Like if you're a small service team, there are no people dedicated to this handling process. It's a team responsibility, which normally means nobody does it unless somebody is kind of inclined to like doing these kind of things. And normally, or what I know from our own team, from other teams in our proximity, the person to fix that is basically the senior engineer that knows how the system fits together. So that's basically the one that knows exactly what to do, will do something real fast, might also do a proper fix. But if this person is on PTO, you're kind of out of luck. It's really hard to learn from this person as well because basically the person doesn't, right? Like there's no visibility into it. And then depending on how you stable your service is, it might actually mean that this person actually could burn out from having this on their shoulders because it's the only thing. Does this process include dispatch or not? Yeah, so the question is, does this process include how the work is spread in the team, if I understood, right? Dispatch. Dispatch. Is there a ratio of care and the worst technician fits about or like some of the customers? Yeah, so is it sent to, is somebody just picking it up or is it actually delegated to somebody? And normally in these small teams that I've seen at Wettered, for example, is that basically there are a couple of people hanging out at Chat Channel and a load comes in and then there's somebody carrying and they pick it up. It's not, that's the bad example, right? Like there's another one coming, right? But this is something that I've seen happening and which actually works and is very often the case if there's no formal process around incident management. So yeah, you shouldn't do that, right? Like maybe I should have put this on the slide. So next version I will put something on the slide, like don't do that, right? Like this works but not recommended, right? Yeah, that's something we figured. And so we thought like actually how to come to a better process because having this in place will hopefully reduce all these disadvantages on the previous slide. Now, if I say something about the process, people get all blurry in their eyes, right? Like this is not something that people or engineers enjoy too much, like talking about the process and it all feels like bureaucracy and complicated and management and those kind of things. So we try to come up with something really simple that would actually be something that engineers do instead of just something that's on some website. But yes, the first thing you have to do is create a ticket, right? Like it is that bad. It's pretty easy actually to create tickets on most gitforges so you can press a shortcut. But this is the first thing you have to do and also one of the main building blocks is that if something fails, create a ticket. So if you ignore all the rest of the slides, this is one thing because otherwise you will not have a conversation about the incident and there will be no place to learn for any team members. There's also no place to delegate. So if you want to kind of hand it off to somebody, you need a ticket. You can post screenshots in there. Most gitforges allow you to have confidential comments if it's a public tracker, which is really highly recommended. And then do it in a structured way. So that's the most advanced figure that's going to be in the slide. So these are the places. So normally you have an active incident, something is broken, something exploded. And the first thing that people normally do and should do is try to reduce the impact to give you some breathing space. You will figure something out, right? Like normally people have some idea what's going on. And this can be the most senior person in the room doing something real quick, right? That is quite acceptable. And you will get to this place where it's called mitigated, which means that somehow your customers might not know that much anymore. And then you can work on solving a property. So it might be you're using GitOps, you need to do something, you need to change some Python code, go through review cycles, all those kind of things. Do it properly, right? Like this would be the second step. But normally there is this first step where you kind of like get it fixed real quick. Maybe it shouldn't be that way, but it's being realistic, that's mostly what happens. If you get to this place, the ticket can be resolved. And now comes the interesting part, right? Like just because you resolved your incident doesn't mean you're done. So the last part that mostly got ignored and why we designed this process in the first place is that there's more work to do next to resolve in the incident. And that's mostly improving on the road course. So there was something that caused the incident, not just that it would actually explode it, but that it was possible to explode in this way in the first place. Do you need a problem or change ticket? Sorry? Could you read? Do you need a problem or a change ticket that's up to the map of the incident? Yeah, that's the thing, right? Like they asked teams that basically create a new ticket and then it gets put in the backlog and then it might disappear in the backlog. But the recommendation is to not do this, but actually keep your incident open as long in this resolved state. That's what we do until you actually prevent the reoccurrence of the incident because otherwise you accept the fact that it will happen again, right? Like this is basically what it is. If you're not improving on the road course, you accept that it will happen again. I never get an example, something that happened three weeks ago. Spot instances got more expensive on AWS. They increased in price. They hit the on-demand limits, on-demand prices. And the tooling we use, the Stalker Machine, you need to give it a limit, like a price limit for how much you wanna pay for the spot instances. You need to configure this because the tool requires it. It has a default that's kind of real. So you need to set a default and set a limit our limit was to low. So we didn't get any spot instances. That's always beautiful, right? Like GitLab jobs did not run anymore. Okay. So the first fix was to secure shell into the GitLab runner to change this variable in the config file. All right, like that works, right? That's just like BIM, regular expression replace all. Boom, they spawn again, right? That is going from active to mitigated. The second stage was to configure properly in whatever GitOps solution you do. Now, for case it's a deployment repository, pipelines, reviews, configuring it properly, it deploys, it's still open. So if you click on the link, I created the ticket this morning, not three weeks ago, so that tells you something about how much we stick to the process. What is still open about it is actually that the Docker machine should not require you to specify this limit because this comes from a time when AWS was actually bidding. Nowadays, spot instances are not done with bidding. They have a price, you take them or you leave them. They are also limited by the only model price. So the root cause fix that this will never happen again is basically remove all these limits that you have in your configuration. And now, for the last minute, five minutes, something, I will ask you about something. I don't know how good you, how many of you are retellers? How many of you read emails? You have enough mailing list internally, but anyway. So there was a site in some company where SecureShare certificates were renewed and before the SSS certificate. And before that, the SSS certificate was issued by an external CA. And the renewal was done, but the CA used was an internal one. So basically, any customer needs to have these internal CA configured in their system to connect to this site. So this is basically what happened. It broke all the customers that internally needed to connect to this site. And it surfaced on a mailing list. This is just a hypothetical example. And now, I don't know, the game is, what would you think, what would be the fix, or what would be the actions you would take as a team going from an active state where basically somebody complains on the mailing list to closing the incident ticket. So that's the interactive part of the presentation. SecureShare to the server or ground server bot to give me a scan of some of the data? Yeah. Okay, so the answer is SecureShare into the machine and use third bot. I can tell you it doesn't work because it was a service that's not SecureShare accessible. Next one. Yeah. The users should return access to the site. So the answer is to roll back to the old certificates. And that was exactly what was done. They were all back because they were renewed but the old one was still valid. They were from the public CA, so it would restore access. And so doing this would mitigate the incident, right? Customers are now able to again connect to your service. It's mitigated because if you don't do anything else, it will break when the certificates expire, right? There was a reason for the renewal. So what would be the next step that people would need to do? So the answer is use the right CA and create new certificates. Did I understand that point? No? How to see and provide the sale towards the end? That would be one way, yeah? You could provide the internal CA to all customers, yes? So yes, that would be one way to resolve the incident, track down our customers and give them the internal CA. What happened in this case actually is that they rolled new certificates with the public CA, right? Like because they could do it the first time around. So here it was actually done the second time. So basically you renew it correctly. So that would move it to resolved. Now what would you need to do to actually close the incident ticket? So that it will never happen again. Policy. What did? Policy. Policy? Right, policy related to this. So the answer is to write a policy so that people do it correctly. Okay, what else could you do? Why automate? Automate? Yes, write automation that these, these certificates automatically renew so nobody needs to touch it. Yeah, that would be second part. What else would need to be done? What else went wrong? Yeah, so the answer's at monitoring because here they only found it after the user's complaint on the main internet. So there are like three steps that you need to do after you resolve the incident to actually close it, so. Yeah, and so that's basically all there is to the, so these are the answer. So this is all there is to the whole process. If you take anything away from this talk, that is that if you resolved an incident, it's not yet closed. There's work to be done. And if you don't do this work, if you move it to another ticket and put it in a backlog, you accept the fact that it will happen again. And there's a social aspect to the whole thing. And you need to account for this. So if you have, there are phases defined, they make sense, they are necessary, you shouldn't skip them. If you have phases, you can delegate them to other people, right? Like so the senior person can respond and you can handle it. So there's learning involved. You can track them, but make sure that you don't drop these issues from your view, because otherwise, next year when these certificates come up again, you might make the same mistake. And yeah, this thing actually works. So the process, that's something that where we have a Kanban board and yeah, we move these issues along. Yeah, and now if you tell us that there are a lot of active incidents, yes. But yeah, it is totally worth thinking about these processes, defining something and trying to surface them and track them over time. All right, like this is one of those tickets we have. It's basically an example. You see, there's the stuff outstanding. It's a resolved ticket, but it still needs work and it's still very annoying to have this on this board. So that's kind of the point, right? Like it should annoy you. The process should be not, it should make things visible. Okay, so that's it. Do you have any questions? Maybe to the social aspect, how did you convince everyone in the team that this was important? So the question is how did we convince people that it's important? So we have something called request for comments. So there was a process agreed. Now, and I think in this case, we didn't write it down, but it basically was a problem. So we had a problem that these incidents needed to be handled and we needed to involve the team because one of the things that comes up here is that if you as a senior engineer just fix this stuff, you will not fix it as good as if you talk to people in your team. So that was one of the reasons that basically we tried to implement this, that it becomes visible. Did we convince all of them in the beginning? No, are they now convinced? I'm not sure. But yeah, it's just. Yeah. You can go ahead and see a little bit about SLAs. So you get to see how we're getting better over the time. So the question is, do we have SLAs or SLIs for actually, yeah, an SLI like an indicator of how we are doing related to these incidents? And no, you could create one out of these tickets. So for us, the main thing was that they would actually be visible and that there wouldn't be too many tickets in this resolved column. So this has happened, but yeah, it's, as you see in this case, so we switched to this process a couple of months ago, you have a lot of tickets in the active column. These are kind of like weird issues that sometimes happen that are hard to track down. Yeah, so no, we haven't done that. That would be the next thing to do. But just having an SLI never really gets the work done as well, right? Like it's this agreement of a team to work on it that is more important than having a number symbolizing it. You started like measuring people. Yeah, yeah, would be, so the question is yeah, using an SLI would actually allow you to figure out how you're doing over time. Yeah, totally agree, would be nice to see this. I would be too sure what would come out of it, right? So. So the question is, is there a priority to incidents? There's no priority between incidents, so incidents are all labeled very similarly. The priority in some way comes out of this, where they are on the board. And of course, it's related with the state, if there is one, they don't have to cut them out. Yeah, so it's related to, I'd like, would you agree on? But normally incidents are things that should not happen, right? Like, it's, I forgot the quote, but somebody said like, if you don't hand the incidents correctly, you are basically breaking the promise to your customers that you're caring about their experience, because they tell you or something breaks and you're not making sure that it will not happen again, which means that whatever you do instead, you consider more important than whatever broke their workflow. So that's kind of like, now obviously, as you see on our incident board, there's just things in there. So it's, it's hard, but this is what it comes down to. Yeah, so what do we do? Yes, so the question is, do we have any knowledge base to prevent reoccurrence? So we have something, Operations Manual, where we put instructions on how to fix things, but the focus is really on preventing these things from happening again in a structural way. Most of the times there's something you can do to prevent it. Some architectural thing, something you, I don't know, you need to change two pieces instead of one, right? Like, and then sometimes you forget to change this other piece and then explode, right? Like, that's an architectural thing. So moving to one source of the configuration so that it basically gets used in both places would be the fix. There is currently, there is a step helping to detect and protect from continuum repeating problems. Because as you mentioned, we out of the monitoring, it's from one side, we can add monitoring, yes? But this is just monitoring incident, would happen more, and right now, I know only one step, which would help us for doing this, Grapana on pole, and Grapana incident, and this step, continuously attached to these three metrics and can create even major topics related to this and it can encounter issues and problems. So when we help her incident, or incident, we know that it was previously and can also get some information from previous, yeah? So if I understood correctly, there's a Grapana stack that can be used to kind of like link incidents together and link them to... Because my goal is to help to do this or in your other space. Okay, okay, so thank you very much for your attention.