 All right, someone talked about when there's a catastrophe to FDB. A little introduction, Brand and I are both with Wayfront. We're both in the operations team. And I've been at Wayfront for more than three years. But we've been running FoundationDB as the core data platform for the service for more than four or five years. So we'll go over a little bit about a background on Wayfront and then 2K studies. So this talk is titled, Restoring FDB After Catastrophe. But it's really about managing FDB when your infrastructure decides to break on you. And that's important because over the time we've been using FDB, we haven't had issues with FDB itself. It's always been something related to infrastructure. And as Evan said, the database runs on computers and disks. Those tend to break. So a little bit of background on Wayfront for those of you who aren't aware. Wayfront is a cloud native monitoring platform. It's a time series observability platform. And as I mentioned, we use FDB as the primary data store. So we take in all this data, and we make sense of it in either alerts or charts. But most importantly, mostly charts. Because sometimes your data looks like this, or like this, or even like this. And since we're arguably in the fish taco capital of the world, sometimes your data looks like this. And that's an actual Wayfront chart. So I'll talk a little bit about how Wayfront uses FDB to sort of set some context. We run a lot of FDB. When I put this together, we had 90-some clusters. So that's about, I think, it was a seven or two instances, just under 19,000 processes. So not a small deployment. And since Evan used last year slides, I'm going to also use some of the last year slides to kind of show you what Wayfront looks like to help you understand how it can break for us. So super high level. This is a very simplified view of a Wayfront cluster. There are three tiers. There's this web server tier, an application tier, and then a database tier. And importantly, Wayfront runs active-active. So what you see here is duplicate on that little blue square that's not filled out. And if I take a 30,000-foot view, this is what it looks like. So we have this front end or the web servers I showed you. And then we have the duplicate services either in two different AZs or two different regions. And on the bottom there is, of course, FDB. So I mentioned, sometimes, the infrastructure decides it doesn't want to cooperate. So I'm going to let Brandon talk about the case of the missing kernels. OK, cool. So all right. Yeah, I'm going to talk about the case of the missing kernels. So a little bit of background. So something happened earlier this year. We were, our database tier was still pretty much on trusty, which is, you know, a version of Ubuntu Linux. Our, for data storage, we were currently using EBS for all of the FDB storage. And at the time, we were mostly on the 33 kernel, which is, was kind of the stock kernel at the time. But based on some recommendations from working with Amazon, they'd recommend updating to a 4.4 kernel that was supposed to improve some things. And at the time, also, we had a few clusters that we had running on Xenial with a 4.4 kernel, where we were testing moving to using the local NBME instant storage. So this is a graph of TCP retransmits. This is on 33, OK, before we start putting out the 4.4 kernel. We put out the 4.4 kernel. You can kind of see, you know, scope-wise, it's under 1. We switched the 4.4 kernel on some clusters. And we start to see pretty bad TCP retransmits. A lot of other things look fine. In fact, you know, it didn't kind of manifest itself as a big problem for a little while, when the kernels had started trickling out to some of our clusters. So, right, big jump. So we say, OK, this isn't working. We're kind of able to look at graphs, look at times, realize, OK, it definitely is the new kernel. In fact, we do some verification, some testing to verify that, switching between kernel version and some of our test clusters. Let me go back, so you can see things look much better afterward. So, OK. So now at this point, we have this mixed kernel version. We're going to get everything back to 33. At the same time, we're also doing in-place FDB upgrades, which for us is, because we have this active-active model, we can be serving on one mirror and we'll actually stop FDB, do an in-place upgrade. And since we're kind of pausing the world on one mirror while we do that, we thought, OK, well, we'll go ahead and get these kernel versions consisting with a reboot, because that's really, really quick. So we're working through doing these upgrades, in general, what's going well. But as I mentioned, there's a little bit of a wrinkle where some of the clusters are zennial with 4.4. So, right, multiple clusters are being upgraded at the same time, things seem like they're going well. We're getting a groove. One of the zennial clusters, we run the trusty steps, which include purging all the 4.4 kernels. And there are some sanity checks there. This is all kind of manual playbook with Ansible. And so working on the mirror cluster, go to the reboot, they're not coming back up. OK, this was zennial. We have no kernels. We're stuck at grub. Do I still have a job? Actual quote. First things first, we call it WS. Is there anything we can do? Is there any way we can somehow get at the boot volumes of these instances, which is the boot volumes or EBS? We can slide another kernel in there in some way. Otherwise, basically data gone. Basically, they say no for various technical constraints and a lot of them around not being able to get a customer data, encryption things. Basically, no. So that's gone. 20 terabytes of data gone. So how do we get out of this? FDBDR. At this point, we had upgraded part of this cluster to FDBDR, to a version that had it available. So we knew it existed, but it wasn't something we'd actually used yet. So what do we do? Well, we start doing some research. As I mentioned earlier, we have test clusters. So this is actually a very cool thing that we're able to do, where we can build a full wave front cluster, and we have the ability to replay data out of our production monitoring into test clusters so we can give them a good volume and a good data shape. So we take one of those clusters. We build through the runbook. We basically figure out what we think the process will be to drop new nodes, set a backup agent or DR agent, all that. It works. So testing and production, sort of. Now, I say testing and production, sort of, because like I mentioned, we went through the whole process once on one of our test clusters to verify it works. Get everything out of the way. So then with our detailed runbook in hand, we go through setting it up. Wire everything up. Data starts moving. 20 terabytes, many, many hours of data later. It works. It's done. Everything's caught up. And so we disable the FDBDR, the link, basically. We break that. So now this other cluster is independent. Do a bunch of tests for functionality and correctness. Cluster has high availability again. We're back in action. OK. So what do we learn? First thing, well, hey, now we have a pretty thorough, pretty good FDBDR runbook built on our duress, but it works. Another thing that came out of this part of this process was, OK, realizing now that we're going to shift more and more to using NVMe only instances because it's a significant improvement in available disk IO, we actually started making some IAM policies. We run an AWS around how these instances potentially be stopped and started. Because that's the main drawback, is the instance stopped. Those local disks are gone. So we improve some things there. Automate checks. Automate more of the process. Like I said, we had a few things in Ansible, but some of it was copy and like a runbook. You copy this command, you copy that command. OK. Automate more of that. Add more automated health checks. And that's also a thing that we took to other routine procedures and other places. And then finally, fleet replace. So fleet replace is a term that we use to have an entire FDB cluster. And what we will do is we'll actually launch a whole new set of instances. And we froze the fleet replace using the native exclude include capabilities of FoundationDB. So this is actually just for in general operations something that we have had a lot of success with and very much rely on as we scale away from clusters up and down based on point rate and demand. We have a process of deciding how many instances we use and what size. And so in the case of going up or down, we will launch those new instances, bring them in. We have a whole Ansible-based process that goes through and make sure coordinators are moved, starts the excludes, depending on whether it's SSD or memory based. We have a different process for that. And at the end does a bunch of health checks. Is it healthier? Are there any interregional processes? It's just kind of going through all these things. And so having that process in place, in fact, in some ways maybe learning from this experience, adding more robust health checks, looking for more corner cases automatically, we also decided, OK, things like kernel upgrades we can do in the fleet replace. Because what we found is that FDB, as long as it's running, it doesn't care. You're currently running on trusty instances and you drop Bionic. That's fine. In fact, we went through that process with all of our databases during the middle of this year. Completely moved all to Bionic and forward 15 or whatever the current kernel is, which has also been good. So this is where I'm going to hand it back to MRZ to talk about some other things we learned. We are running on EBS, where we were. In an EBS world, we had a couple of problems we'd run into. Either EBS would have a castor like failure. And since we had a duplicate copy of our data in another mirror, we could copy it. And the other use case for us is to take an existing data set and just copy it someplace else. So I'll walk you through what we did from a process standpoint. So very, very important. You should back up your coordinators. We tracked that by using instance tags or volume tags so we could relate the volumes back to where the coordinators were. And then we stopped the cluster and we did an EBS snapshot. So those are the simplest steps. But the important thing is make sure you track your coordinators. And then the lengthy piece was to resurrect those snapshots on other machines. We were using LVM, so part of it was to also make sure LVM came back up and then fix the coordinators and redistribute the cluster, config to all the members, and then restart the database, pray and hope and wait. And so what that looked like on the file is we would have a set of machines that were the origin machines and we would copy the cluster elsewhere. We just simply hand edited the file even though it says do not hand edit, we actually did that. And then I think we were lazy. We just SCPed it out. And then we started back up. So these were the six steps we had. But I think the important thing was we had always thought that this could work and then we were in a situation we had to figure out it doesn't actually work. And I'm just gonna read from my notes here. So this actually worked because the coordinators key tracked the topology of the cluster. And so as everything came back online they checked in with the existing coordinators and then update all the, essentially update and all the charts will be accounted for. And then behind the scenes that the bytes on disk must be rematerialized from the snapshot. And I think this took, well, a matter of hours. It was not fast. But we've done this I think at least once, maybe twice now. It's a great way to copy an existing production instance someplace else if you can afford the downtime to do the snapshots. Yeah, I think that's all we had. I don't know if we have four minutes for questions if there's any questions. I'll say as I decide the TCP segment retransmission we saw was not catastrophic until it was because processes would join the cluster, retransmit so much they'd fall back out. And so the cluster was thrashing. So on larger clusters it was never stable. And then we looked at all the charts if we saw this piece. That was on the, if you care, it was on a four, 15, 10, 37 kernel. Don't use that one. This is not using that yet. This was just using EBS snapshots. I have not looked at that. I'm not sure anymore because we're not using EBS anymore. We've moved over entirely to NVMe because I have infinite IO which seemed to be one of the gating fashors to running a high-performance database. With the caveat that you have to put a lot of guardrails in to making sure humans don't destroy your fleet, you have to be able to monitor Amazon's infrastructure because they don't tell you about SMART. You don't get to look at that stuff. So you have to infer disk failures or pending disk failures by looking at, I think we're looking at the disk IO write queue length as an indicator when it starts to go not straight. It's an indicator that it may be time to just shut down F2V on that node and replace it. I think the other thing we did too is we put only a few processes on each actual physical disk so that if one disk dies, the whole machine's not down, just the processes on those. All right, thanks.