 Okay, good morning, everyone. I'm Matthew, this is Mike. Mike needs an introduction. We're both at Wavefront. We've been at Wavefront for about two years. And we're both on the team that run the service called Wavefront and Found HDB. And this is our story. Before I talk about FDB, I'll tell you a little bit about Wavefront. We're Silicon Valley's best kept secret. We're a SaaS time series metrics, tracing, monitoring platform. And one of the few places running FDB at scale. And so because of that, we've learned a lot of things about running it over the last couple of years. And people like us, we have startups and large SaaS companies as well. So we talked about scale earlier. We run about 470 instances across 80 clusters. Some of these clusters actually span multiple AWS regions. And that for us is about 330, or sorry, 3300 FDB processes. And our largest clusters often exceed about a million writes a second on the memory engine. Since we're all showing off architecture, I'll show you ours. So this is a real simplified view of a Wavefront stack. There's three tiers and applications, sorry, web tier and application tier and then a database tier. And I'll go into there. The web servers here run a small three node key value store running FDB. We share content that's common to both, Wavefront runs two mirrors, so primary, secondary mirror. And this key value store shares, stores content that's common to both sides of the mirror. So things like alert or dashboards or user names. On the app tier, this is where all the magic happens. We take in telemetry. We do stuff with telemetry. We do queries, we do alerting and we store it into the database, which is FDB. On the database side, we run, as I mentioned, we run a memory engine and the SSD engine and a co-processor. And I'll let Mike talk about the rest there. Yeah, so this is a very high level view of how we architect our cluster. Every Wavefront cluster, as he said, is actually two databases. We run the memory and the SSD engine. Actually, it'd be easier if I come out here. On the SSD side, we put a stripe across a set of EBS volumes, just basic GP2 volumes. And we shim in in front of them using Enhanced IO cache volumes. Those cache volumes we use to cache our reads. This is important for AWS and we'll talk about that later. And the writes are, it's just a write through cache. So the writes actually fall straight through into the EBS volumes. The memory engine is not using a stripe. The memory engine has, we can kind of take advantage of being a time series database. We know we're always writing to the tip of the database. So we use write throughput optimized, ST1 disks are actually magnetic volumes, but they're highly tuned for doing sequential reads and writes. Never at the same time, but the FDB memory engine is only ever writing or reading and it's doing reads during recovery and it's doing writes during normal operations because your reads come from memory. And in between the two databases we have on each node, what is an FDB co-processor. Let's actually see if I can, if I have to click over here to advance it or... Where'd it try to go? There we go. The FDB co-processor lives between, the FDB co-processor lives between our SSD and our memory engine. And because this isn't like a Cassandra database where it's consistently hashed, it's an ordered key value store. It's performing compression, sorting, optimizations, aligning the data and the boundaries in the database so that we get the maximum reads and writes out. And it's aware of the operating space available to the memory tier. The memory tier is very sensitive to how much space is available to be written into it. So we have our FDB co-processor that basically can shut down writes to the database, shovel data out and then resume workloads. And we queue all the workloads up, keep everything durable so we don't lose anything. We run our databases with a one-to-one processor count. So earlier we talked about like there were different roles like proxies, resolvers and logs and storage. There's these stateful and stateless transaction processes. Those actually aren't like specific things like you have to run. They're just a denote on a process that this is a transaction process and it can be any one of these roles. So we make sure that we have one-to-one for storage engines and transactions are a two-to-one. So we run two storage processes for every one transaction process on the memory tier. Because we're doing such a high volume of writes, we need lots of proxies and resolvers to push into the transaction logs, which are carrying those permutations. I'll pass it back to you. All right, we run Linux and so we've become experts at running on Linux and experts at running in a somewhat hostile virtual environment. So we've learned a lot of things on how to tune Linux or the machine we're running on around disk IO, around CPU and memory and kernel tuning. So NUMA, it's a thing that no one remembers until Snowflake told us to think about it. So on multi-CPU instances like the i3 16x or 8x, we're pretty prescriptive. We bind a process to a CPU and a memory so we don't have the CPU thrashing across from memory access. You know, Mike touched on our disk layout a little earlier. We're heavy users of a disk caching called Enhanced IO. We focus mostly on read caching and so we have a one-one mapping between the EBS volume and the instance store itself. Mostly because Amazon doesn't discriminate between IOPS and so we sort of cheat. We get free read caching IO. We then have a bunch of extra IO for our writes. And what this means in practice is you can see the blue line is our read cache rate across one of the larger clusters we have and the green is the writes which we don't really cache at all. Yeah, so we get about 100% read caching because we shim and NVMe in front of each one. So the only overhead really on reads is just what it costs in FoundationDB because it's lightning fast. The only cost is also the time it takes to build the cache which is not very long, about a day or so. We're also experts at tuning the kernel or we've become experts. These values work for us on our workload. You should experiment. But this is something that we have done. Yeah, it was really important for us to tune the networking layer on the kernel especially for the cluster controller. That's where latency can kill your cluster. It's doing the health checks. It's deciding who's gonna take what role and what roles are gonna be assigned out. If your kernel isn't tuned well, you can actually flood and DDoS your cluster controller and can take the cluster down. And we're not saying it's ever happened to us. No, not on production. But so it was important to make some of these changes so that we could go from being able to run three, 400 process clusters up to much higher now. We're almost 500, 600 on a single cluster just out of tuning the kernel so that we get the best performance. That's all you. All right. So one of the things that's also important if you're gonna operate FDB at scale is the instance lifecycle. It's hugely important because we have some clusters that are operating at up to 20 nodes in a cluster. So being able to ensure that these instances come online and they're ready to go, that they are 100% operational ready is important. So we use Terraform. We have a system called landing party that configures these instances on first boot. And we also make heavy use advanceable so that we can do entire fleet replacements at once. Which is we wanna be able to quickly and easily change out all of the infrastructure. So our database configuration in Terraform is dead simple. It's how many nodes, you know, of what type, how much storage do you want on it? What version of Ubuntu are you running? We were locked to 16.0 for a bit. We've just been unlocked because we had to rewrite some of the Enhanced I.O. Cache's support later kernels to get us on to AWS kernels as well. That enables us with just some very simple helper scripts that will go out there and deploy these databases. And of course, because we're observability, it's important for us to be able to immediately see when these nodes come online and how they actually perform and what it means. You can see new memory storage nodes coming online, how much CPU, you can see a spike in the data that was being processed because there was likely a slight backup as the node joined. So honestly, we wanna let computers do the hard work for us. So that's what landing party does. We have landing party and post boot systems. They actually, when a system comes online, they're gonna go in there. If it's a brand new cluster, it's gonna configure the memory and the SSD tier for us automatically. We don't have to touch it. It just configures new double redundant memory or SSD engines. It's going to prepare the foundation DB configuration files like our NUMA settings. It detects, are we on an I3? Is this a 16X or an 8X? So I need to push out the NUMA configurations. It's gonna get everything ready for us so that all we have to do is turn FTB on. That's the one thing we don't do. We don't let it turn FTB on automatically. The instances join, they're prepared, they're ready to roll. We turn them on with intent. It allows us to stage work. It allows us to prepare a customer to grow their infrastructure and then turn it on under controlled. In the background, there's a process that's taking the cluster to config and pushing it into S3, so on launch, we can pull it back out so the machines kind of come up ready to go. Yeah, so we can, yeah, exactly. So the landing party also pulls down that cluster file so that's synchronized across all of the machines, all of the nodes, so that we don't ever have to manage that either because that can be rather error prone. So fleet replacements, really. Let computers do the hard work for you. The thing with the fleet replacements is it's we need to be able to change the tires on the car, wait, it's to the left, right, yep. While the car is moving, we don't want to have to deal with it. So we have written tooling that actually will go in there and completely replace an entire cluster without any human interaction besides starting it. It goes through, identifies which nodes are to be removed, excludes them, re-coordinates the cluster, checks to make sure the excludes have completed, checks to make sure the coordination states have changed, and let's see if I can crane my head enough to speed this up for you guys. So that's actually what it's going through and doing right now. This is just an ansible job. We wrote, we had been challenged, when I had first started at Wavefront, I was challenged by one of the founders who said it wasn't possible to do this. He said, you're not gonna be able to automate, excluding, and replacing an entire cluster all at once. And he said, all right, so we did. So this allows us to do in-flight OS upgrades without basically hit list upgrades. We can just drop a whole new set of infrastructure, the tooling understands what was new, what was old, and goes through the exclude process without the human interaction piece of it. And at the very end, it disables term protection and turns off the instances for us. Yeah, so yeah, the last thing it does is, before it does that is it does a sanity check, coordinators move for both tiers, our excludes finished. Did we actually exclude the right number of nodes? And if everything passes, disables termination protection, destroys the instances, and then re-includes the excluded IP so that you don't have wasted exclude sitting out there. I love the motivation for the tooling was we learn the hard way what happens if you terminate a cluster coordinator without having a replacement for it. Yeah, which is... It doesn't work after that. Yeah, it doesn't work after that. So this, I believe, yeah. So it's going through the cleanup phase now. So all of this is really like prime candidates. We have these in Jenkins, they're schedulable jobs. We don't want operators to have to remember syntax and have to go in there and do this. This is just a good show for how we, what it looks like, how we do it. Like, yeah, it's only three minutes, it feels like. It's the longest three minutes of your life, man. No, really, though, the importance here was we did, as Emery said, we did find out the hard way what happens if you destroy your coordinators before you've re-coordinated, the cluster's gone instead. What happens if you terminate your node before the excludes finish? Well, if it's only one, it'll heal, it'll re-replicate. If you don't have enough replication factor or it's too many, you have data loss. So these tools were specifically built to remove that human error, get rid of that element of it's possible that I could, you know, fat finger something, forget to exclude something, forget a re-coordination step. The goal was to enable future engineers as they joined the team to not have to worry about those things and just join without there being any concerns. They could focus on operating and not learning the scary parts of FDB. What is it? Oh, there we go. I don't know if that went too far or not. So I don't know how many of you are familiar with the FDB trace logs. This is what the trace logs look like. If you enjoy reading XML, I don't. They're incredibly spartan and I actually found reading the source code for FoundationDB lighter reading than trying to grok these. So we have a tool internally. It's called Wayfront FDB Taylor. It specifically takes this eyesore and it turns it into this so that we can see this is on one of our larger clusters. It takes and it shows you where the roles are assigned, what machines, what process and port they're on, how much data is flowing through overall on each node and then how much per process so that we can find when there's data distribution issues. Sometimes data is not properly distributed. Sometimes the CPU is hot somewhere and your T logs, which are CPU bound are gonna have some issues. It shows us how much storage memory is being used, how much, if there's any transaction processes, how much is being used. This is everything that comes out of the trace logs, but readable. So, which leads to the next point of monitoring, which is basically the bread and butter of what we do. So we dog food, we use our own platform to monitor and internally we have some tools, the Wayfront FDB Taylor and we also have some Python scripts that pull data out and push it up into our monitoring platform. But instead of really just showing you more pictures of dashboards, I think what I'm gonna try and do is just show you live what it looks like on our platform and how it works. So while we figure out the technical aspects of if this is just gonna work, give us a second. So Mike Mitch in the Wayfront FDB Taylor, at the end we'll have a link, that was open sourced today. Yeah, that's a big shout out to Devon sitting over there and Jay Baal right behind him who put in a lot of work to make that possible. Because we weren't sure if that was gonna happen by today. Yeah. So, the big thing in our platform is we had time series, monitoring, alerts, events, tracing. But I think that really what sells me on it and why I love it so much is just how easy it is to see what's going on. Now, this just looking at it doesn't look like an incredible amount but this is actually running a derivative of the roles that are assigned. And we can see immediately that there are changes in the rate. So if the role is a log, the value is always some amount and if that changes, you're gonna see a flick on the value. And we can see a big red box. That's an alert that we have cash errors. There's something just broke. So, this is how we monitor at FDB. We have to know every piece from top to bottom what's working, what's not working. Yeah, there we go. And hence I owe cash errors. We can see which host is dead, which cash is dead. So this was the start of an event where we actually lost an instant store. And we caught it right here. So one of the fun things that AWS will never admit to is that things do break all the time in their environment. It is a very ephemeral cloud. Things come and go all the time. Instant stores will die, EBS volumes will go corrupt. You have to architect to deal with that. So like we put an NVMe in front of each of the EBS volumes. When one of them goes, we have to start, stop the system to get an instant store back. We have to rebuild our caches. But we have the alerts that show us this and we can actually see the FDB data that tells us that, hey, something isn't performing, right? And then here, near the end, you can see a lot more role changes as the cluster is starting to restore order and come back to life. Even in this field, I think it's gonna be fine. So this is more of what comes out of your trace. You can see around 1045 that the latency in FDB starts going awry. We're missing data. It's spiking up and down. The cluster is actually struggling a bit. The storage and log cues, which we had talked a bit about earlier, your log cues, writing the permutations of what's going to eventually make it into the storage cues. Typically should never reach above a threshold. And for our configuration, it's about one gig of data. We can see where we begin to lose out, how much operating space we can see it falling off. This is all stuff that we get out of those, the trace logs from the tailing system. We pull it in. We can see where the SSD is here, which is actually this area right down here. The key value store drops out a couple times. And that is from the initial we lost a drive to we had to restart. One of the other nice things we get is you can see which processes are dying. You have immediate observability into processes are dying, they're restarting. Those are just memory storage notes that went down with the event. And there was, they begin the process of self recovery. This is a very, very durable database. It's very operationally, it's... It's a challenge. It's a challenge, but it's not a challenge because you're trying to keep it from corrupting your data like MySQL or Mongo. It's not a challenge because it has insane defaults. It's a challenge because it's so durable that it favors keeping your things together, which means it's gonna make some hard decisions for you, which is it'll stop the world and recover itself if it thinks it needs to. But the alternative is losing customer data and you don't wanna really afford that. This is actually right. This is one of MRZ's favorite charts. This monitors the file sizes on disk for the transaction processes. It monitors the actual transaction log itself and it monitors the rate of change in the transaction logs or transaction processes. As we are recovering and it's pulling data in and it's reading those files back into memory, the processes are growing in size. They're changing. They're recovering and we need to be able to see that and that black line right there, which is his favorite line, tells you whether or not the transaction logs are still reporting out at FDB. If those stop reporting, FDB is hard down. You have to then go in and do a little bit deeper surgery on FDB, but this is how we can quickly and easily identify and show people, hey, FDB's recovering fine. It'll come back on its own. Don't worry. It's re-reading in T-LOG files from an earlier outage. And some of the other items around this event, you can see we track all the metrics because they matter to us. Let's see if I can... So you can see in the window when it first went out, our IOQ depth, if you're very familiar with IO, it was not suppressing, which means IO is not being read or written. Nothing was happening. You can see where we lost it and it came back, that particular node. So this is all what goes into observability for FoundationDB for us. And without it, I don't know how we would survive, to be honest. And we build alerts on this? Yeah, we do. We actually build a lot of alerts on this. So one of the charts I have scrolled past quite a few times is this PEG CPU process. So what's fun about this chart is so those trace logs are constantly emitting values. And a very important one is the CPU seconds. How much time, CPU time is this process consuming? If a particular process dies and it becomes unreachable by the cluster, it stops emitting that value. Well, the FTB log Taylor actually continues to emit the last value it saw. So we can abuse this knowledge and write alerts that look specifically for processes that are no longer emitting this data. And we can find processes that have fallen out of the cluster and they have fallen out of the cluster for any number of reasons, but we can go identify those processes because the rate of change in that process is now zero. And it allows us to nullify all other processes that are showing rates of change and highlight only the ones that are no longer reporting their CPU time. And we use this, find the process, kill the process and restart it. FTB monitor goes in, restarts the process. Cluster goes back to normal operations. So these are just some of the operational challenges for us, but having those trace logs turned into telemetry enabled us to really dig in and find these kinds of issues where we would typically before be banging our heads against the wall like why isn't this recovering? Why isn't this coming back? We built dashboards, we built alerts and now computers do all the heavy lifting and work for us. And that's all I have on. Yeah, I think that's it. I don't know if you guys have any questions about how we do what we do. You wanna put your phone for the last slide or? Good, sure. I have to check it. So he tries to find the last slide here. It'll have the link to the GitHub repo. Or they're posted too. The dashboard that Mike showed you, we've explored the JSON, it'll be up on the same repo too. So if you were to try it out, you could experiment with the same sort of dashboard that we use in production. I think that's all we got.