 Okay, now it's on, much better. You ready? Okay, all right, so I'm delighted to be here with all of you today to talk to you about a project that we've had underway for about a year to take an existing open source document database and retrofit it so that it uses FoundationDB under the covers, right? And so, you know, those of you who've spent some time around FoundationDB, I mean, I think we recognize its potential for building these, you know, elegant pieces of engineering, right, that, you know, are just architected and designed and implemented to the T, right? My project's a little bit different, right? This is not green field development. This has taken something that we're very proud of that's got all kinds of bells and whistles and figuring out how to put it on a more solid footing going forward. So I need to tell you a little bit first about CouchDB itself. You know, by the way, it turns out this is something people actually do with some regularity. It's kind of crazy that they pick houses up and move them around. I didn't realize. So CouchDB sort of at a glance, right? It is a database that's delivered as a web service. You communicate it, communicate with it using HTTP. You send JSON payloads. These JSON documents, you know, that is the atomic boundary of updates to the database, is the boundary of a document. These documents have primary keys. You can create secondary indexes. Oftentimes using server-side JavaScript to do so. The project kind of pioneered the use of these internet-accessible change capture feeds that have turned out to be a really popular way for interacting with the database and building kind of event-driven applications. And one of the main use cases for that is active-active replication. We have lots of people who run CouchDB in multiple cloud regions, in their on-premises data center and in the cloud region, and synchronized in both directions. And the database has all the metadata necessary to ultimately converge the state on each side of that, or every corner of that topology. And really, all those features have been around for almost a decade, right? 1.0 included everything you see on that list of features. In 2.0, we introduced support for clustering, which is gonna be a big topic of the conversation today and how that's gonna evolve in this brave new world that we are entering. 3.0, which is any day now, is intended to be our best attempt at the classic CouchDB architecture. So we've added a few new features, but really it's our attempt to place a stake in the ground and say, great, that's kind of the end of that architectural line, because 4.0, we've committed as a project, is gonna be using foundation to be under the hood. It's a relatively compact code base, and one of the reasons for that is because it's implemented largely in Erlang, which has been a pretty expressive high-level language and a good language for building highly available concurrent web services. Not such a great language for high throughput handling of data, and so we've worked around some of those issues, but overall something that we're pretty happy with, and so step one of adopting FoundationDB was us implementing Erlang bindings and putting those out in the community, so that was one of the ones on Alex's list there. So let's talk for a minute about how that clustering that we introduced in 2.0 works, just so you can kind of understand our motivation and our rationale for heading down this path. Every database in Apache CouchDB is split into shards. Those shards are replicated across a series of nodes, and each of the documents is mapped to a specific shard in the database using consistent hashing. We also have some support for compound primary keys, so instead of the entire primary key of the document determining the routing, you can say, hey, I wanna co-locate these documents that have a partition key that's shared in common, right? So we do that, and in crucially, every one of those replicas is able to independently decide whether to accept a particular update. Updates are supposed to be applied against a base version of the document, and so the most common reason for an update to be rejected is because the snapshot of the database has changed, right? The document has been updated underneath you, and now it is gonna reject an update that is applied against an earlier base revision. But crucially, every replica of a shard does that independently. There's no consensus going on. The shards do maintain enough metadata to synchronize after the fact and to ultimately get to the same view of the world, but that can be a multi-version view. If, you know, replica one accepts one update, and replica two accepts another update, eventually they'll both see both updates. And then from an indexing perspective, each of those shards builds a local index, which is great for scaling the indexing throughput. It's nice and easy and simple, but it does mean that when we wanna query secondary indexes, it's a full scatter-gather operation, because we don't know a priori which nodes are actually hosting the portions of the secondary index that are relevant for the query that the user executed. So this is a simple, I would argue, clustering design. It is operationally simple in the sense that the cluster is basically homogeneous. It's simple in that the system ultimately gets back to a good state on its own in 99 times out of 100. And it's served us well in production for quite a long time. My employer in IBM Cloud, we use Cloudant, which is based on CouchDB, both as kind of our answer to DynamoDB on the Amazon side, and also as a core piece of critical internal infrastructure. Lots of the IBM Cloud runs on the Cloudant service. But those of you who know your way around distributed systems can foresee some of the pathologies that can occur with a system like this. And so let's just kind of enumerate them a little bit here. One of the challenges is, of course, the scaling of those queries. There's a top end to the amount of throughput we can deliver for queries against secondary indexes in this design because you're always hitting every shard, no matter what. Another one? Yeah. Guaranteeing that time moves forward when reading these secondary indexes is a challenge. When I'm reading primary documents in an Apache CouchDB cluster, I've got quorum operations. I submit a document and it gets acknowledged when a majority of the copies commit it, and I read and I also do a quorum operation against a majority of the copies to actually return the response to the client. And so that papers over most of the kind of issues that might be encountered in this eventually consistent system. When I'm reading the secondary indexes, though, there's none of that quorum going on, or if you like, it's implicitly an R equals one quorum against the index read, which means that if I let the cluster kind of bounce back and forth in terms of which replica ends up being used, it's entirely plausible that I might put a document in, get my two-thirds majority quorum to accept it, and then right away trigger an index read and hit the third copy and I don't see it, right? Or if I've got some sort of recovery situation where the node was down for a little while and it's kind of replicating in its changes and catching back up on the index, I can see a very divergent mixed up notion of time. We do stuff to try to protect against that, but it's stuff we have to do to try to protect against that. Similarly, those change capture feeds I talked about, they end up with another sort of interesting pathology. We make an at least once guarantee on the semantics of the changes feed. We guarantee you'll never miss an update to the database that occurred since the last time you checked in on the change capture feed. But that means if we have to fail over to another replica of a shard in order to hand over that changes feed, we don't always know exactly which updates have been observed, right? Like the question I'm trying to answer is, what is the largest possible update sequence of this shard that guarantees that I don't miss anything that was written over on the other replica of the shard? And we try to do a bit of bookkeeping to make sure that we have a very good bound on how much of the feed gets replayed, but it is annoying that these feeds do get replayed from time to time and fail over situations when people see updates that they had already previously consumed in their change capture feed. And finally, the really big one for me is the unavoidability of edit conflicts. Like it's really almost impossible for a developer against this system to build an application that never gets into an edit conflict situation. Even something as simple as a retry loop with some sort of mutable bit of data that's going in a timestamp or something in your document can easily end up in a situation where you end up with concurrent edits getting accepted by different replicas of the shard simultaneously, and now the system has a multi-versioned story for that one particular document. And I think if there's one thing we've learned in the past 10 years of NoSQL, it's that managing those edit conflicts correctly handling them in the application layer is a really fricking hard problem. So it kind of feels like we're here. Like we've got this house and we're kind of like, we see this cliff looming and the cliff is getting closer. And so we said, we really need to sit down and work hard to address this stuff in the core of the database. And we kind of did the sizing exercise, what it would take to introduce consensus over top of the individual replicas of the shards, do the work to ensure that there is one total ordering of rights for each of those individual shards, do the work to reorganize secondary indexes based on that, maybe something like RAM transactions, things like that, that would give us scalability for the queries. But we also said, all right, let's do our homework and see if there are other things that could help us accelerate, addressing these different gaps that we felt we had in our clustering technology. And so we said, we need something that preserves our existing API. We've got tons of users in production. We've got lots of people who are happy with the semantics that we provide. Sure, there are warts, but we can't do a wholesale change and just be like, throw everything up in the air and expect people to join us on that journey. We wanted something that we could be confident about. Like if there's one thing we've done as a project over the past decade, it's earned a reputation for reliability and durability and we couldn't afford to kind of go backwards in our posture on that front. Like we really needed something we could count on. We needed to scale up. Certainly in my environment, we run lots of large at scale clusters. But we also have a broad based user community that just downloads CouchDB and uses them in very lightweight scenarios with little web applications in diverse environments. And so we needed something that could scale down. Even if that wasn't necessarily its primary goal, we couldn't have something that had like a minimum footprint of seven servers and hundreds of gigs of RAM, right? And we wanted something that just kind of had an impedance match and this is like a, you know it when you see it kind of thing, but we needed to feel right about the sort of layering, so to speak, of what we might be putting in underneath what we wanted to do here with CouchDB. And so around this time last year, that's when we sort of started taking a closer and closer look at FoundationDB. A few of us came to the summit last year, heard lots of great stories about how people were using it and you know the way the internals were working and really had confidence in our ability to depend on this as part of our go forward architecture. So what does that do for us? Well it does a few things. It absolutely eliminates those edit conflicts when apps are targeting a single cloud region or a single deployment of Apache CouchDB. It lets us really refocus our efforts on that active-active multi-region replication which I think continues to be one of the main differentiating capabilities of the project. And so rather than kind of having this replication system kind of serve multiple purposes of synchronizing stuff within a cloud region, across availability zones and between regions, now we've got a nice separation of concerns we can optimize our replication system for that and let FoundationDB handle all the stuff in region. We can redo our secondary indexes in a much more scalable way and included as part of the right transaction. Do it the way it really deserves to have been done in the first place. And we get that totally ordered, sortable list of changes from the change capture feed which again is a nice upgrade for a feature that a lot of our users find pretty attractive. I don't have time to go through all the data modeling but I can give you a little bit of a sense. If you've perused the FoundationDB site and gone through the design recipes for the document model and for simple indexes, you get the general idea. It looks a lot like the way that works. Version stames for those of you who know about them know that they are a great way to build this change capture feed. For those of you who don't, this is a way for you to tell FoundationDB to insert a version of the database as part of the commit. So it doesn't, you don't have to know as a client what that version is going to be. You just tell it, hey, this sequence of bytes in the key or in the value, replace this with the version at the commit time. And so by using the version stamp as a key, you just get that changes feed just kind of falling out almost for free. It's really quite nice. You do have to be a little careful because it's not an item potent thing at that point. And so we actually write a separate little transaction ID that allows us in a retry scenario to see whether the database has already accepted it. Because you get these situations sometimes where FoundationDB just doesn't respond to you if it's not healthy and then you gotta go figure out whether it did or did not ultimately commit the update. But it works really well. Our data model is organized so that all our transactions end up being self conflicting. That whole bit about how CouchDB expects a base revision of the update against what you're trying to, of the document against what you're trying to apply the update. That just means that we end up with self conflicting transactions, which is a good thing, right? We don't have to do extra gymnastics that kind of just falls out of the data model. And I think is a nice example of how the MVCC views of the world at the CouchDB level and at the FoundationDB level are sympatico. They kind of hang together nicely in that respect. We do use the atomic operations inside FoundationDB for maintaining database statistics. Otherwise this would be a fairly high contention operation and we'd have to do some gymnastics to try to avoid lots of conflicts there. So that's like a nice little feature inside FoundationDB that we leverage. And a new piece that, you know, if you've been paying attention to forums has shown up recently and I think 6.1 was the metadata version caching. We use that as well. So this actually lets FoundationDB includes a bit of information with every one of your transactions that says here's this version value, right? And the way you can use that is you bump it if the information that you would like to cache has changed. Otherwise you can assume that these, you know, sorry, let me step back. The metadata version ships with every transaction and you can check to see whether it's been updated for free without issuing another read and you know, ending up with a hotspot in your FoundationDB environment, right? One nice use case for this is to enable you to cache certain portions of the key space in your client. So for us, we use it for database metadata, schema information, access control lists, design documents, index definitions, that kind of stuff. And we do it in a two level hierarchy. So there's one global metadata version for the entire FoundationDB cluster and if that hasn't changed at the beginning of our transaction, great. If it has changed, then we have to do a second check to see, well, did the database metadata that I'm actually interested in change, right? Because we've got a hundred thousand databases floating around in this cluster. Most of the time it's not gonna be your database whose metadata changed. We don't wanna go and reread all of that schema information and recompute all that stuff. So we end up having this kind of two-layer caching hierarchy where the second layer does actually trigger a read to FoundationDB. But the first layer doesn't and most of the time nothing's changed and we go on our merry way. It's a nice little improvement for us at least. So thanks for that. There's more stuff on the data modeling. Garen's gotta talk this afternoon specifically on how we did our secondary indexes and kind of the different options we had there for search indexes and all kinds of stuff. So if you're more interested in that topic, definitely attend his session. We're gonna spend a little bit of time on how the deployment looks. I mentioned that we wanted to be able to scale up and scale down, right? So in CouchDB 4.0, we still have the simple situation of I can run CouchDB with an embedded FoundationDB. Everything's great. I can run two of those things in two different regions. I can set up replication across them still just like you were using CouchDB in the past, right? Including all kinds of crazy topologies. You wanna set up a ring across five different regions where each one's replicating to its peer and the next one, you can still do that sort of stuff. But where it gets fun is disaggregating this stuff a little bit and actually having FoundationDB do the stuff that it's good at in terms of giving me a strictly serializable, consistent, scale out underlying key value store. And then the stuff that I have on top, my layer that implements the CouchDB interface, this is stateless, right? So I get to scale that out nicely and I get to take advantage of all the fun stuff that the folks in the Kubernetes land are doing to make stateless application development in the cloud pleasant, right? I think we're contractually obligated at this conference to talk about microservices. So here we are, microservices. We've taken the steps to decompose some of the bits of the functionality that we're doing in CouchDB. So now the JavaScript execution engine can go off into its own service. The replicator can go off into its own service. And we get to basically do all the fun stuff, like I said, that goes along with Kubernetes. On the FoundationDB side, we also have a bunch of stuff we can do to take advantage of the scale of large compute infrastructure. The way we are running this at the moment is in what's called the three data hall mode. So we map FoundationDB's concept of a data hall to a cloud availability zone. The way this system works is it does replicate every key value pair that you put into FoundationDB into each of those availability zones. But the rights are actually replicated two times in each of two availability zones. The reason it does this is because FoundationDB has to have every transaction log that's configured to accept a particular key value pair, accept the right in order to commit. And if you spread those across three availability zones, now you'd be in trouble if one of those availability zones went down. So instead, we store two copies in each of two zones. That means we can lose an AZ, and if we've got sufficient compute capacity in the other AZs, we can lose a server in each of those, and we can still keep on moving. We're still experimenting with some of this stuff as we go through performance optimizations, but one of the other things that we're starting to think may be important is nudging the stateless transaction processing processes into a availability zone, because they do have to do a fair amount of communication with one another at the beginning and ends of transactions, and so making that communication a fairly low latency operation seems to be important. And then finally, the other thing is the coordinators. The coordinators, to be perfectly honest with you, they worry me a little bit, they scare me, like that's your state, like you lose that your toast. So we are keeping those spread nicely across lots of different data centers and keeping them sort of far away from the action, right? You don't have to do that, but I personally feel a little bit better about having those things kind of off in a quiet corner of the infrastructure and not in a place where they're in the data path for lots of unexpected customer traffic. And so that's kind of our design, our deployment architecture. Where are we in this project, actually? Well, we've been at it, like I said, for close to a year. We've implemented a bunch of the functionality of CouchDB, not all of it, but a lot of the big chunks. We've gone through a lot of the modeling, gone through a lot of the implementation. A lot of these things are V1s, so there's some low hanging performance fruit that we're working on addressing. And to that end, we've kind of turned our attention now to some operational hardening, monitoring, and so on bits of work. One of the things we've started to do is because we've got like a distributed system problem here to a greater extent than we used to, is we've picked up more tracing technology, right? Used to be we were running a distributed Erlang environment and we could just use the tracing functionality inside that 1VM, but now we're dealing with FoundationDB and some other stateless stuff on the CouchDB side. And getting an end-to-end view of how the request is progressing through is something that we're finding pretty helpful, certainly in the development phase, where we just don't always know where we're introducing high latency operations, doing our reads that we didn't need to do, that sort of thing. But ultimately, this is something I'd like to be able to turn on at a sampling fraction in production as well, just to kind of keep tabs on the latency of the system over time. What we've technically done here is actually just taken the JSON trace files that come out of FoundationDB and post-processed them into an open tracing compatible format. Our layer has open tracing stuff built in, and so we can load that all into Gager or whatever, and visualize things as we go. But we found this to be a nice tool and an area where I think we wanna drive some more integration. Because I do think it's, getting that end-to-end view is important. What have we learned? Like I said, this is a brownfield exercise this code base was not designed with FoundationDB in mind, and we find ourselves using, as a result, maybe more transactions than we intend from time to time. You're sort of running through the request path, and the code is, you're flowing from code module to code module, and then you realize, oh, I need this extra little bit of data from the store, and I just use my closure, I wrap it in the transaction off I go, and not only is that slow, relatively speaking, it's also not correct to get a different version of the data store in the course of responding to one request from a CouchDB user. But that's the beauty of it, right? Is that I can very easily advocate for improving that situation, because the correct way of working with FoundationDB also happens to be the most performant way. So that's a very nice little detail of things. It so often isn't the case, right? Instead of making shortcuts for performance, it's actually the most performant way to use the store is to use it that way. The other thing that's happening, I think the one that, if there's any aspect of this project that causes a little consternation amongst our community of users, it's that we're getting more restrictive in the data sizes and volumes that we allow to be stored. And frankly, this is a case of us taking our medicine. We should have put these limits in years ago. We didn't, and we've been living with the fact that they're not really documented, but the database won't really work very well if you store that much data in this place, or index that many fields in that document. And so now we're sort of using this, frankly, as a place to say, look, hey, FoundationDB has these limits on keys and values, we're just, you know, our hands are tied. They're not really, if we wanted to work around them, we could, but we're using FoundationDB's rigor in this space to good effect. My slides are still showing up on my laptop. I don't know what's going on here. Anyway, I'm essentially down to the end of my talk, so we can go quiet. My closing thoughts, you know, they say in religion that like the faith of a convert is the strongest faith, and after running eventually consistent systems for a decade, like, hallelujah, transactions are awesome, right? The combination of the key value interface and transactions in FoundationDB is simultaneously flexible enough for us to undertake a project like this, and powerful enough to make it worthwhile to do so, right? On the community side, I mean, Alex, it was great that we had the community update from Alex. I think, you know, the thing that I've recognized running my own open source project for a decade is that contributions come in all shapes and sizes. We don't all have to go like learn flow from Marcus and dig into simulation and Joshua and so on in order to make meaningful contributions to FoundationDB. There's a ton we can do to support the project and, you know, sort of the periphery, and a lot of that is happening, but certainly something that I would echo is super useful. And then finally, I guess just, you know, enjoy the summit, have a fun time here, and absolutely hit me up if you have heard the questions about the work that we're doing. I'd love to meet you all. Thanks.