 Okay, hi everyone. I hope you can hear me all the way in the back. So my name is Henrik I'm the moderator for these sessions. We'll get started here in just a second. Just want to remind you Once the session concludes, we'll have a microphone here in the middle for questions So please line up and ask any questions you might have I'll try and get a couple of questions from the virtual meeting room as well. Make sure those guys are included and after the session, please don't forget to rate the session and If you have additional questions after the session Concludes and we're out of time. Please continue those discussions out in the hall So we can clear the room and prepare for the next session and with that I'm going to leave it over to you. Thanks Good morning everybody Can you give me a thumbs up if I'm perfectly audible? Great Okay, so this is the title of our talk It was originally the subtitle was the title of our talk but our PR person about had an aneurysm when they saw that I was submitting that talk But I explained to her first of all Everybody at least all of our customers know we had an outage and second of all Everybody has outages. It's not news. You want to a SAS provider has an outage So we wanted to come here and share our experience with the community to maybe help you avoid an outage or deal with it When it comes up So my name is Rick. I'm currently the VP of product at influx data At the time of this incident. I was the VP of engineering for the platform team So I was there during the the day that we deleted and restored one of our production environments and Hi, my name is Vojtek. I am a platform engineer at the deployments team at influx data And I've been involved in many parts of the incident and follow-ups including post mortem and fixes around What broke and how we make sure that it doesn't happen again? Okay, so Who are we? Why are we using Kubernetes? Why might our experience be relevant? so at influx data we view ourselves as a Development platform for writing time series Applications we don't really view ourselves as a kubernetes tools vendors that said a lot of our customers use us to monitor their kubernetes clusters And then we have different companies that have actually built SAS solutions for monitoring kubernetes on top of us So while we are an application development platform for time series at the heart is a database called Influx DB it's an open source database That's purpose built for time series, but currently our flagship product is called Influx DB cloud and this is a multi-tenant SAS solution It's built on top of kubernetes and the reason it's built on top of kubernetes is because we offer it on the top three clouds And multiple regions in all those clouds, so I think we have Like 12 or 15 production instances running around the world right now That we manage and kubernetes provided that cloud abstraction layer that we needed to be able to manage The same application in all of those different regions and clouds Right, so the timeline which is basically how to delete your production in a few easy steps, so let's As we mentioned in fact to be cloud is a flagship product It's a state full kubernetes based application and we use github since ICD to keep it running and keep it up today We have multiple Tears in the application some of them are stateful most of them are stateless So one of the key things that we have is the storage tier that's using pvcs and and keeps data in Kubernetes native volumes Also using cloud native objects like S3 for long-term persistence But the key is that it's keeping the data so it can be query ball readable write the ball really fast on Desk then we use Kafka and zookeeper for rider headlock meaning as soon as we get the request to add new data It goes into Kafka and then storage tier of processor that as soon as possible At the time of the incident we used at CD for all of the metadata that we had so things like Bucket names, which is where we put the data organizations users So all of the all of the metadata for the application right now moving away from from that for postgres for various reasons But at the time it was at CD and like I said, we have multiple stateless micro services So when you run a query against influx DB it goes to our query engine, which is entirely stateless It passes the query it sends the commands to proper storage tier pots by Knowing which shards to ask for the specific data gets it back We constructs it returns the data and most of our most of our components are Stateless and we use Argo CD for deploying all of our instances of inflex DB cloud So we have recommended 12 to 15 production ones. We have multiple staging or testing ones And we use github to manage all of them We use CI CD to get everything built as soon as the code changes then see obviously to deploy them We use Argo CD's feature of Arc auto sync Which means that as soon as something shows up in our github's config repository it gets deployed And we also use prone which means if something was in that repository is no longer there It'll get deleted and I think you know where this is going and now use something called the app of apps pattern Which means we use Argo CD to configure how and where Argo CD deploys its things, which is also important. So How it all started I'm suspecting this may be a bit out of focus, but that's actually by design so this PR was merged and This is basically just adding new data. There's like 500 new lines. Most of it generated YAML files It did not remove any single thing But as soon as it got merged within minutes, we deleted all of our entire single production environment So I'll show what happened still out of focus, but there's also Bay design So what happened is we use our code name IDP for the Productions of the thing we want to keep running in all each of our clusters and We also have an open source project called Iox that we wanted to deploy alongside But what happened is but you may not even see it because it's a tiny detail is that we had a naming collision So it's so the Iox was deployed as IDP and you can see that the first arrow points that It should be Iox and it's IDP. I'll show it again in a bit in a picture The point of this is it was really difficult to spot in code review Same as it may be difficult to spot for you now So what we wanted to help is how Argo CD works and what we wanted to get to So on the left side, we have the apps of apps button just for one cluster We have that for all of our clusters, but just for one of it What we had initially was our IDP is our production environment and that was an Argo CD Argo CD has its custom resource called application that defines that has a name Defines the location so the git repository to get the definitions from and then where it should be deployed to And also the path in that git repository So then that repository specified like a namespace and defines all the all the deployment stateful set anything else that you want deployed As part of that application and then what we wanted to do is wanted to add Iox alongside So we added a new Argo CD application We or we wanted to add a new Argo CD application We wanted that to the point to a different path in the repository that would have all the objects for Iox but what really happened is Because we had a typo and it was IDP and now it's on the left side in the middle in red because it was IDP instead of Iox Which is what we wanted Argo CD went in and said, okay, I have two definitions for the IDP CD Application so I'm going to take the next so I'm going to apply the last one Which was actually pointing to the wrong git repository. So then Argo CD applied that and then the actual Argo CD deployment for that application decided, okay I no longer have IDP to deploy and I should be pruning everything So I'm just going to go ahead and delete the whole the whole production cluster and deploy this Iox testing environment instead so and this is a snippet from our from our Status page for the incident once we had the postmortem in place. This was exactly what happened due to some Mistakes when we were adding the Iox we should have put Iox CD AWS Pro01 use central one instead we added IDP and This way we managed to basically delete the whole thing and this is where the incident began Okay, so I woke up at 7 in the morning I Live on the east coast and I don't for anyone here who manages a Sass environment, you know that you never stop thinking about it and the first thing you check in the morning Is like what happened at night and so this is what I woke up to Good times this by the way if you're trying to quit coffee, this will help you wake up without coffee So I just went straight to my computer and just started working on this so This is the first part. This is when the damage was done This is actually all like right before I got up. So the PA these are all times that are UTC But I'm based on the east coast so the PR was merged and Then all our monitoring systems started to report API failures But we had a bit of monitoring fatigue at the time So we told it like look you really need to wait a while before you start paging people because like these bothering our staff with these transient API failures is Just annoying people and it's not helping much then a customer called in and then another customer and said hey it seems like something's not working so the support team jumped in and Said customers are reporting a problem and we looked around and sure enough all the alerts started firing at that point And it was clear there was something wrong so in our culture when we Deliver code that causes a problem. We just to revert it immediately and we really try to develop so that a revert of code is the Path back to stability, but this was not a code change. This was an infrastructure change. So the developer actually Followed that process Submitted the PR, but then the team realized that like just reverting a change like that since it was an infrastructure change It's probably not the way to do it, but fortunately We also have a culture of anyone being able to stop the line. So we stopped all synchronizations We have this thing that we call internally the big red button Anyone can press it and it'll just stop deployments through all of our deployment environments Then the engineering team said okay We really need to start really planning a proper recovery process and by the way did anybody update the status page and we Updated the status page to alert customers and then they started calling everybody like hey, this is all hands-on deck You know when they really the penny dropped about the severity of the problem Okay, so here I don't have individual times because this phase really unfolded over the course of hours But the first thing the team did was to create a deployment checklist and and double-check that list and that I was really impressed with that everyone's staying very calm and Instead of you know panicking they say okay like let's make a plan But like let's double-check the plan Before we start stampeding and to make sure that we don't make things worse and I actually credit the short time that it took us to rebuild the production environment to that systematic mentality that the developers followed so Then we went through and we started to redeploy services carefully and in the proper order and the main thing was to Connect when we could the stateful services to their persistent volumes Because that saved us a lot of time in terms of not having to playback backup data, right? So the data was there we could just reconnect it and start using it then um Additional services, especially the state was stateful the stateless ones were redeployed in Parallel with people going and making sure like did we actually Recover the volumes property like is the data actually safe and proper and it turned out it was We had some services that we just recovered from Valero backups If the team thought, you know, okay, the best strategy is just to get that from Valero But then also we're like, you know what when we turn this back on Everybody's telegraph instances like that some of the eight we we have an agent that people can use to write to inflect DB Those are all going to start writing Everybody's queries are going to start stampeding us. Everyone's going to be trying to catch up So we scaled out especially the ingress tier, but also also other tiers Anticipating that surge in traffic when we came back on And then finally the smoke started to clear we enabled the right service and then let started accepting people's rights and spent some time to verify that That was all working and that we hadn't lost any data And then we realized that we had we're going to have another problem for and careful Influx DB has something called a task system, which means if you have a script that is Down sampling data or like importing data from another data source and joining with time series data or exporting to another data source Doing some kind of custom calculation and that's happening on a schedule You can push all that work down into our platform and some people run those every second every minute every hour And we realized when the task system comes back on it's going to realize it wasn't running tasks for the past few hours It's going to try to run all those tasks and if it's trying to run all those tasks and users are trying to run queries at the same time It's just going to be a complete collision You know there's a total traffic jam of queries So we decided to do was before we turned on queries again We just left the task system run its course and fortunately it wasn't too long so we let all those tasks Run and fail and then when the backlog was done and the tasks were caught up We want to head and turned on the query service and at that point Really the smoke had cleared and we were back in business. So We alerted all our customers. Hey the the services back We also went through and collected a lot of information for customers like hey, here's the idea of all your failed tasks Here's a script that you can run to rerun them if you want if they're still relevant And just other things that we could do to help them recover During the course of the day. I was busy just like writing down and logging what happened So we were able to write in that RCA doc and put in our status page within an hour or two after the incident so that people could go back and You know see what really happened from that. We got an interesting piece of feedback from one of our big customers who said well We're just glad it was automation and not somebody sitting in front of the terminal So they actually was a little bit almost confidence building in a way that it wasn't somebody fat fingering at a terminal And it was our our automation Going crazy there so we did obviously we spent a couple days during internal RCA's So One of the things that we found during the RCA process that we ran was that Our our cross-team efforts were really effective, right? So SRE team deployments team developers working together, and I really chalked that up to our blameless culture It was about like what was wrong with our systems It was never about any any person no person made a mistake our systems were lacking and That I could see really enabled that kind of collaboration While we had downtime we did not lose any data and that I mean if anybody Tried to write data while we were down they got back a 404 or 500 or something right? What's terrible is that they get back a 204 you could got your data But it didn't actually write the data because then they're in an inconsistent state And they don't know like how their application is going to perform So in this case they could go back and decide if they wanted to backfill during the time that we were down or how they wanted to handle it We did avoid panic as As we mentioned while we did like immediately sort of knee jerk to that rollback attempt We were able to stop and create a plan first Props to Valero also props to our SRE team like the week before they had been practicing Valero backups So that was like right at their fingertips to how to use Valero to get back Some things that we were not so happy about was Like how did our automation allow this to happen, right? So like CICD or like get ops It's a it's a very sharp knife Which they say a sharp knife is the safest, but it cut us deeply in this case And the other thing was we never anticipated Deleting a cluster so our alerts were tuned to Errors in our code or some user doing something that we didn't anticipate But we never imagined like we need to be alerted that the cluster was deleted and we had no runbooks for recovering the cluster So we do now because we wrote it over the course of that day Okay, so in terms of recovering some Some like Wreck up and maybe some technical information So as we mentioned our first instinct was to revert the change like when you're doing a lot of small code changes And you're in a microservice base architecture That's what you often do you make small changes that aren't breaking so it's easy to deploy incrementally and these rollback but in this case it wasn't a good idea because we would be creating new volumes instead of Reusing the volumes that that were that were retained by the city by the communities and the and the and the clouds So at that point as we mentioned the team stopped and we started creating a proper plan And the goal was that we want to restore all the Staple items manually and then recreate the rest via CD with maybe some manual tuning that we will Get into in a bit. So what we did and maybe some questions around that is like Why didn't we just redeploy the storage here and storage here is the part of the system that keeps all of the time series data So all of your metrics for the last n years And the reason for that is if we were to do that and not and not like we attach the volumes that we had that we still had We would be we would need to fetch the data from the cloud native objects So like S3 or Google store any other storage system and replay that data And that would take several hours and keep in mind that we were able to recover in just I believe it was six hours Less than six hours. I think even and if we would do that This would probably roll over into more like 10 or 12 hours now Dear question is but maybe I'll back up and just mention one thing So most of our volumes in that in that cluster were defined to be in retained mode meaning that even if you delete the PVC the actual persistent volume and the actual underlying cloud Storage volume is kept there. So but that wasn't the case for everything. So for a zookeeper That's just keeping the state for Kafka and it doesn't really have any changes our topics in Kafka and everything in Kafka is pretty much static It's just a day-to-day Kafka that changes So we were able to just restore that from the hourly backups and we and we decided that for zookeeper That's that's enough to just have the backups in place for Kafka and at CD We were a we will since as I mentioned, we had the persistent volumes We just restored them. It was initially a manual process Then once we made sure that it works we could like script that and just run it But we have a lot of pots in in Kafka at CD and storage So it would be painful to do it manually and prone to fat fingering that we didn't want Once we had that we could recreate the stateful sets and then quick and then by then Kubernetes will recreate the pots With storage, it was pretty much the same thing. The main difference is that the way our storage still works It is has to index the data at that point, but basically as soon as it As every single as a storage pot wakes up It re-indexes the data in the persistent volume in case it was shut down incorrectly So they'll have an up-to-date index and then it can asynchronously inject the Rider-head log data from Kafka Last question, why didn't we just enable everything at once as we mentioned that there are some tricks to that So we started enabling parts of inflex data as they started to work So even before the storage theory was fully restored We could start enabling rights to the system because we because the rights just put the data in Kafka and Report to or if the right to Kafka was successful So people could start writing data even though they were not able to create yet And it wasn't properly persisted it but the Kafka with its availability and the replication Allowed us to be confident with that approach Once we had all the stateful services in place We deployed the remaining ones we increased the number of replicas and again in terms of what was Enabled or what was running we started with tasks without all the tasks run We made sure that the buckle was empty and then after that happened we enabled queries and at that point We were confident to just fall scale back and configure the number of replicas to what it is on a day-to-day basis or what it should be Right so now after we've done that the obvious question is can we not delete production again? Please because it took a lot of people a lot of time and everyone was had to stop whatever they were doing at the time or maybe like Use the text message instead of coffee for waking up So the first thing is like obviously we don't want that thing to be merged again and just a quick introduction to how we do things We use jsonet which is a tool that allows rendering multiple Objects basically it's not Kubernetes specific, but we use it for Kubernetes objects prior to the incident We were just writing everything to a single YAML file. So the PR you've seen was that was just Big YAML file that had Object with the same with the same name and namespace added to it So that was the problem we moved that to house to have a single Object in a single file and the file name is generated from object properties. So it's the API version It's the kind so like a service deployment stateful set, etc namespace or if this is a if this isn't if this is not a namespaced object then I think we use a static string there then the name and And and that would prevent it because at that point that period would show that we're overriding an object in YAML at the YAML level and not adding one but we went a bit further than that because As we include the API version in Kubernetes You could have like v1 beta 1 and v1 and that technically would be a separate file But at the Kubernetes level would would would generate would override the object in a way So we use a tool called kube config to generate YAML from JSON it and we've added a smarter logic to To detect those collisions in there and basically it's going to refuse to generate the YAML files if you have a collision in there and I think that the big upside from that is Now when we review PRs we see which object gets changed because sometimes when you have a complex deployment stateful set And you just want to add a variable another container something It wasn't really clear what you were editing and you had to scroll up or down a lot So now it's now it's more efficient in a lot of other things And then we looked at can we do something at Argo CD level Argo CD is a great tool But you want to configure some of the things to make it even better and like make it even safer to use it Because of well the automation in place It's really easy to to do something really bad when you when you make a mistake So Argo CD has annotations that you can add to objects to basically any Kubernetes object and one of them is prune equals false I believe that's the annotation. I mean that the value for the annotation Which will mean Argo CD will never delete the underlying object So if you add it to your stateful set, but don't remember the annotation name Even if you delete it in your YAML files or wherever Argo CD is getting the object definitions from Argo CD We'll just leave it alone. I will just no longer manage it One thing we learned as well, and I think this is a valuable lesson You need to set it for namespace as well because in one of our first drills and of testing the Dischange we noticed that actually Argo CD deleted namespace Kubernetes deleted all the stateful sets So it didn't really help without setting it to the main namespace level We also added annotations that make Argo CD refuse to update a resource that already exists And has annotations that specified it's managed by another Argo CD application So if we would ever accidentally clash Like an object level so someone else would create a stateful set with the same name in the same namespace Argo CD would just fail to sync, would just report an error, and this would show up in our alerting system immediately and One last thing that we've done that was based on Rick's handling of the incident and the team's coordination of that is Well first thing we decided is we need to basically go back and delete an environment again just to test everything But in this case we deleted a staging so testing environment not a production one We did that after what we do after we've written the runbooks, but this was a way to test them We also were doing some exercises fire drills around things like Argo CD does after updating Argo CD version does the Annotations to work because at some point some of the configuration settings changed in Argo CD itself and We learned a ton of new things about what happens when things go wrong when we just basically did an equivalent of Chaos monkey, and I think everyone should think about doing this. Maybe don't start with production, but Like testing what happens when you get ops go sideways and testing how you can potential thinking about how you can potentially break it And making sure you can't you can't do that is is is a really important thing to think about We went back and looked at all the volumes across a lot of environments So this was a relatively new environment that was done when we have all the automation in place and all of the Volumes were set to retain properly, but we went back look back at the first environments that were created manually And I believe in some cases we found some volumes that should be retained But we're in vice versa and we went back and and and updated all those settings And one last but very important thing we looked at our processes and Made them more consistent and made them more easy to follow So this was our first incident that involved a large number of customers that were affected by it So the first thing we learned is you need to have a way to list all the customers that may be affected by an issue then make sure that you have all the valid contact points because Sometimes that may be a different person than the login that someone's using Have a consistent way of contacting have someone leading the the contacting effort and making sure we know when an incident is resolved And that we update proactively update all the customers about The incident status and that they are updated all the way to it being resolved and that we just don't depend on someone refreshing our influx status in fact data status page and That being the only way they find out about how things are and With that, I believe that's it So if you have any questions, please make your way up to the microphone here in the middle And we have a few minutes left for for questions while you do that I think I'm gonna do one from from the virtual attendees and see so I think you touched on this earlier, but Someone is asking here in which cases Valero is a better way to recover the application instead of just redeploying the applications So I think that really depends on what data you have and how much time How much time it takes to be recreated and how accurate the data will be once you do it in each way So in the case of zookeeper that data didn't really change often So just reusing Valero was fine because once you have Kafka set up the data in Valero doesn't really change often So that was fine For example, if we were to use hourly backups for data that our customers would be writing There would be potential of losing the data for the up to last 60 minutes, for example So I think that's always a case-by-case choice When we were designing which volumes to retain we chose not to retain zookeeper for us There was about three levels the first is just like if it's stateless Valero doesn't really help Just let CD redeploy that if it's stateful and the data has not changed since the last Valero Backup then Valero can really help you out But if it's stateful and the data is constantly in flux like it is for us as people are constantly writing Then you're going to lose all of the user data between the last Valero back up and where you are right now And so then just manually deploying and reconnecting to the pvcs was a better option for us during the incident Okay, great. Thank you. So see some people line up at the microphones. We'll switch it over to room questions here So I wonder if you wouldn't have stopped the git revert Wouldn't it have been okay? I mean if all the volumes are correctly retained and Your revert to the national state seems to me like it's could have worked, right? It I think we in some cases may have been missing some objects I don't really remember at this point. It was I believe it was like the eight months ago I think we've made a conscious decision that something may go wrong. So there's a chance things would just work out The issue is it's always in doubt in those cases when you're risking losing data I think it's safer to just go with the manual approach. I Believe it may have worked, but I remember we had to recreate some of the Persistent volume objects or something like this So I mean one thing that was just missing in here is we use our own CRD and Their own controller to manage the storage to you We don't use stateful sets for that and I believe this is one of the places where we may have the way we May have needed to do something slightly differently. I just don't remember the details right now But all right So this was a conscious decision that you made a risk assessment and decided yes Yes, because if we just redeployed then Argo would have said okay new storage pods new PVC's and then we would have had like to Had a juggling act on her hands Instead of just you know, just Creating the new pods and attaching Did you manage did you investigate it in your CICD deploy with Argo to generate all the All the objects and then run something like Coups et al deep to see What will be applied and what would change and then review that instead of the code Right, so when we started designing our CD process, that was one of the discussions We had we want to commit the generated files or have a tool random And maybe it also generated diff at the PR level We chose to explicitly commit because While this isn't the case in most most of the consensus But we were worried that the version of Jsonet in Argo CD may differ for the version of Jsonet that are Tooling other tooling may be using and then you may have those cell differences that show up when you apply It's just it just seems like a safer way to do things and also this makes it that this is part of the PR And in GitHub you can configure which files are generated So when you open a PR in GitHub, it doesn't show you the generated files But you can expand them at least you see the number of additions at the number of deletions So even with just the numbers you would catch that it's an override But often when we're touching stateful things people open the ammo files and read them It just just seems more natural and seems safer And that's what we decided at a company level and in all the CICD that we do for all the tools that we have we commit the ammo files Hi First question has this incident changed to how your organization work for example The priority of disaster recovery is actually higher Sometimes it then feature development I'm gonna say no because for us Safeguarding the users data and not losing data is always the top priority for everything that we do So there's like really no way for us to make it an even even higher priority, but we did learn a lot and Because we have that As such a high priority The company was very tolerant to us actually taking the time to do some more investigation and Apply the learnings whereas, you know, maybe other companies if they're under more Feature pressure they they may not have had that luxury. So did that answer your question? Yes, thank you. I think great time for one final question One say little question. What was the first thought you had when you read the text message? It finally happened And one second question How did you manage to deploy the different objects while you had one see a big single file with all the definitions? Did you separated the one file and single like matches to deploy or do you mean during the recovery process? Yes, I Can answer that so basically We I mean we're proficient with Jason it so there are ways in which you can pick and choose which objects get generated It's relatively easy to just generate a subset of the files at the end of the day Jason it is just an array of objects I mean the the thing that we render out from those files is an array of objects And you can just filter on that array of objects and choose the ones you want But it's just Jamie did a lot of cube cuddle. That's another thing like in the heat of the you can just copy paste Right. I mean there are multiple ways to get it. It's not like it's our closed system You can get like you can Jason it dash dash evolve and then you can do a lot of things in in one liners as well So this is beyond the scope of this, but it's it's not difficult Okay, thank you. Okay, so I think that wraps it up. So a big. Thank you to our presenters. Thanks everyone for coming If I can please ask you to any do any additional questions outside the room and don't forget to rate the session afterwards Thanks everyone