 Carnegie Mellon Vaccination Database Talks are made possible by Ototune. Learn how to automatically optimize your MySQL and PostgreSQL configurations at ototune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. So, we're happy today to have Adam Kogolowski. He is an IBM fellow working on the cloud systems. He was also a co-founder of CloudInt, which IBM bought, and that's why he's there. My favorite thing about Adam, I've known about him, I've never actually got to meet him. The only reason I know about him is because he worked on databases and he has a PhD in physics from MIT, which is not something we come across very often in our land work. So, Adam, I'm super happy, super grateful for you being here, and as always, if you have questions for Adam, if he gives us talk, please interrupt him, say who you are, where you're coming from, and ask us questions, ask your question, and feel free to do this any time. We want this to be a conversation, not just him talking to the boy at the Zoom. Adam, go for it. Thank you so much for being here. All right, Andy. Thank you for the opportunity and for pulling this series together. I personally have found it a great resource to keep abreast of all of the things that are happening in our field across this broad spectrum of different projects and products. So I'm here to talk to you today about Apache CouchDB, an oldie but a goodie in my point of view. So Couch has been through the requisite decade of poking and prodding and so on running production applications, and I think we know a lot about what it's able to do and know a lot about what it maybe doesn't do such a great job at and where we want to take it going forward. So what is it? It's a DBMS as a web service. So the HTTP API, the JSON protocol for exchanging the data is the way that you interact with this database. So sort of around the side interface there. It's a document store, it sort of embraces that model of variant data types and so on of that being the way that you organize your data. So it's always been positioned as something that sort of is intended to be easy to get started for an application developer who is iterating fairly quickly on the functionality that they're introducing and doesn't maybe stop to think too hard about normalizing all of their data right in the ground up. It's also got a big focus on event driven systems and reacting to events and reacting to updates that are happening in the database and we'll go through sort of the investments that we make under the HUD to support that. There's a whole system for view materialized view maintenance which is entirely asynchronous and typically is done by users defining those views in JavaScript functions that they upload into the server and get executed in the sandbox environment. And the big one that I think probably a lot of people continue to adopt CouchDB for is the flexibility of its support for data replication. Not just active-active replication between a couple of cloud regions but also systems that might be disconnected for extended periods of time, systems running in constrained environments. There's a whole set of workloads that I think take advantage of CouchDB in those kinds of scenarios. We've been a member of the Apache Software Foundation for the past 12, 13 years. It's a database that is largely written in Erlang which has certainly pros and cons. If you try to do heavy duty numeric processing in Erlang, it's typically you drop into C pretty quickly. On the other hand, the crash isolation is really nice. It's rather difficult for one misbehaving connection to do much more than crash their own connection. And that gives us a nice model for isolating different users, isolating different workloads. So for today, what I'd like to talk to you about sort of three-fold. One, I want to go through some of the fundamentals. How does this system actually durably store data on disk and why does that matter for us as administrators and developers using it? I want to go through that event of an architecture, the view engine and the replication capabilities because I think these are the things that kind of make CouchDB what it is in the broader DBMS ecosystem. Then I want to jump into the world that we introduced with 2.0, the world that my team and IBM spent a lot of time on when we were at Cloud and that's the clustering system. Talk to you about how we built it, what it's done well for us and also what it hasn't done so well for us, which motivates where we're taking the project going forward, which I'm super excited about. The most basic job, how do we actually store the data that comes in? We do use B-trees under the hood or some approximation of a B-tree. It's not a pure B-tree by any stretch of the imagination where we write JSON documents and we don't spend a ton of time shredding them or anything like that. We really don't optimize much of the actual field storage at all. But once those JSON documents are on disk, we create a tree structure that allows sufficient retrieval of them by the ID of the document. All of that goes into a single file in the original editions of CouchDB. And it goes into that file in an entirely copy and write, append only fashion. I write the document down and then I write the updated leaf node of the B-tree and then the path up to the root, all sort of appending to that same file. Then I sit down and I have to write a header. So we do a durable sort of write barrier with an F-sync, write the header, write another F-sync, and boom, you've got your document durably stored. So that's my copy and write update path. This does a few nice things for us. One of the things that it does is snapshot isolation falls out of that in a rather straightforward fashion. A reader who starts their connection grabs the last header that they can observe in that file and then they use everything that that header points to for all subsequent reads associated with that operation. And any concurrent writes that are occurring naturally happen after that point in the file and they'll never be observed by the reader who grabs that particular header. So that's nice. The other thing it does for us is if we do end up in a situation where a write fails halfway through, the crash recovery gets way simpler. I still see folks in IBM who were authors on the original Aries paper. I still run into these folks and that's a great robust design. But when I think about the steps that DB2 has to go through for the other redo and all of the sort of replay of everything to get it back up to health and I contrast that with this sort of brutally simple approach, all these folks have to do is open the file, go to the end and seek backwards until they can pattern match on a 4K boundary and find a header file. Say, great, that's the header that I'm going to use for the rest of my operations. The rest of the stuff afterwards gets truncated and I've got a healthy database. Essentially, every prefix of our database file is itself a valid point in time, you know. Do you have a separate infrastructure per table or is it for the entire database? Yeah, that's the other semantic nomenclature thing. For us, there isn't a notion of a database that contains multiple tables. A database is a lightweight construct in Apache CouchDB. Maybe you could treat it as more analogous to a table except for the fact that there's no cross database querying. What we oftentimes do is users create a database and they'll dot type things. Documents will have different types. You can create views that correspond to different tables of your data. But we have systems where you could have 100,000 databases in a single instance of the server. Okay, in defense of the Aries guys like Mohan and so forth, they're doing transactions across tables. Absolutely, it's far simpler when your boundary of isolation, your anonymity is a single record. Absolutely, I think we'll talk about some of the work on that side of things and down the line. But yes, I think the Aries stuff is great work. Our constraints in this original world of NoSQL were quite simpler. We're throwing out a lot of features and as a result, you end up with a system here that allows for some fairly quick recovery. The downside of this approach, of course, is as you can imagine, I'm seeing all of the IO that we're doing to update one single document. These files can become quite bloated over time. If I have a random pattern, I'm updating documents across the space I got random primary keys. I'm rewriting these B3 nodes very, very frequently. And the pure append only nature of this means that the old nodes that are referred to by an earlier database header, they continue to exist in that file in perpetuity. So we have to have a vacuuming process that goes through and takes all of the entries that are accessible from the latest database header and writes them to a new file and drops the old one. And we have to do that on a fairly regular basis. So it's a good trade-off for us, but it did mean that we ended up spending an awful lot of time over the years sort of optimizing the throughput of that process to ensure that it could stay ahead of any actual right workload that was coming in for clients. And maybe you just talked about your secondary indexes are just more Bbustries. Yeah. You're asynchronously updating. That's right. There's two indexes that we're atomically maintaining on right. Everything else is asynchronous update. Yeah, so that's actually a good segue because the other index is this one. So we, as I said, we do maintain these two indexes atomically. Every time you write a document, we're maintaining the primary key index, you know, finding documents by their document ID. But the other thing that we're doing is maintaining this changes feed. And so this is an index of all the documents that have ever existed in that database in order of their most recent update. So if you're inserting a new document, then that shows up the end of the changes feed. If, on the other hand, you're updating, you know, document Baz here referred to in sequence number three, we'll remove it from sequence three and insert it again at sequence five. Because we're paying this overhead of having a second atomic index, an atomically updated index, we try to drive a whole bunch of different bits of functionality off of it. The database compaction processes, they walk the sequence index and, you know, write things to a new file. The materialized view engine powered off the sequence index. The replication capabilities are powered off the sequence index. The other thing that we chose to do, you know, one of the sort of last API changes we made before releasing 1.0 of the project was to externalize this index as, you know, a JSON endpoint, the changes endpoint in the database. So any client come in and get, you know, so a line delimited list of JSON records that, you know, have been updated since the last time they checked in. All they have to do is remember the last sequence position in that feed and they can incrementally ask the database at any point in time for the future. Tell me all the records that changed in the database since the last time I checked in. So the nice thing about this is there's no situation where if you don't check in frequently enough, you have to do some sort of full re-sync of everything, right? The database isn't going to end up in a situation where you've, you know, the bin logs have fallen off and you no longer able to actually replicate. You can just, you know, continue to ask what's changed since the last time I checked in. We find that people use this directly, you know, they'll set up an AWS Lambda function or something like that to just listen to that change capture feed and, you know, drive their own business logic off of it. It's also something that powers integration with external indexing systems, you know, there's a, what used to be called a river and elastic search. If you wanted to create a full text search index off the side of your database, you could just have it listen to this change capture feed off you go. So, you know, it's nice that you don't have to have another piece of functionality, another piece of code sitting somewhere that translates, you know, a commit log sitting on the node into an API accessible list of change capture events that just falls right out of the API directly. Deletions importantly, that's the other point I was going to make here. It is every document that has ever existed in the database. We need that because you need to know if a document was deleted to remove it from your external index. Unfortunately, that means if you've got a workload that creates ephemeral records and removes them with reckless abandon, it can cause this index to become very, very large over time. So it's not the set of all documents that exist at the moment. It's the set of all that ever existed. Is this also compacted? Yes, it is, but those, those little tombstone entries do not get removed. All of the extra metadata gets removed, but the individual tombstone entry that says this document once existed in is now deleted stays. Okay. So if I said, give me everything since one, I wouldn't get two, but I would get six. That's correct. You'd get one, four, five, six. Thank you. So topic three, the view engine. This in 1.0 was really the only way that you could do any efficient querying of couch DB by something other than the ID of the documents. The way it works is you create a special class of document called a design document. That design document can have one or more JavaScript functions inside that create views. Those JavaScript functions get executed against all the documents in the database and they can choose to emit zero or more key value pairs for each document that they process. Right. So that gives you the opportunity to, if you want, it can be as simple as just creating an index on some property of the document. It can be a little bit more sophisticated. You can fake a certain basic type of join. For example, if you had a posts table and a comments table, you could create a view of a blog post and all the comments associated with that blog post in one order, presuming your comments have an attribute that says this is the ID of the parent with which they're associated. Um, as I mentioned earlier, the views are not the kind of thing that gets run in the commit path. They are all done asynchronously. You can, you know, when you go to query couch TV, you say, all right, I'd like to query this view. It's going to refresh that view automatically for you and hand you something that is updated to the, you know, current sequence of the database by default. Um, but you know, that's the basic gist of it. And as, you know, Andy, I think you asked earlier under the hood, this uses the same basic B tree structure that we're using for the primary key index and the changes index. Um, speaking of that B tree structure, I mentioned that it's definitely not a sort of purists B tree. One of the more exotic things that we do with that B tree is we actually store aggregations in the inner nodes of the tree. Uh, so in the main indexes of the database, that's just some basic statistics. That's all sort of server controlled. There's no user controlled, you know, flexibility on that front, but in the view engine, we have more flexibility. You can choose to define on a view in addition to your JavaScript function that sort of sets up the index itself, you can choose to say, I would like to come, you know, maintain these statistics. And there's a bunch of built-ins on that front. You can have it do sums and counts of various things that you admit. Uh, you can have it do hyper log log sort of count distinct, you know, approximations, that kind of thing. You can also give us a JavaScript function that will get run over, you know, sort of everything underneath a particular node, all the direct descendants of a particular node in the B tree. So at the leaf layer, you'll be running your reduced function over all the key value pairs that were emitted from all the, you know, sort of that chunk of the B tree, but your reduced function also has to run at the parent, you know, nodes in the tree as well. And there it's reducing the previous output of the reduced function. So you incrementally build up all the way to the, to the root of the tree. Um, this can be nice because it gives you a really low latency incrementally maintained statistics. It's expensive to maintain these statistics, but it's cheap to query them, you know, at the root node of the tree, for example, like getting the aggregation over the entire view is something that like will always be there and be up to date for you. You also though can use it to do reductions, you know, aggregations over every unique key in the tree, if you like, you know, so all the events, you know, that, that share a particular event ID or something of that nature. You can even do something a little in between, which is, you know, if your key happens to be a JSON array, you can get aggregations at multiple levels of granularity. You know, the whole unique key, the whole time stamp, you can do an aggregation of everything that shares that exact key. You can also do a prefix and say, well, I just want, you know, the sum of all the sales for a particular date where my key was year month date hour. And this system will efficiently recompute that for you. I should probably have a little better of a picture on the side here, but what ends up happening is as it traverses the tree, we pick up the aggregations at the highest possible level that we can pick up. So in some cases, it may be, you know, a fairly high parent node because you're interested in the aggregation of everything underneath that, that inner node. In other cases, we may have to drop down all the way to the Leafs because you're asking for an aggregation over a boundary that doesn't cleanly map to some portion of the B tree. And that's okay. We'll rerun that, just that portion of the aggregation at runtime, merge the results and give you the final result. What does the query look like for something like this? Because you explicitly at the right, like I want to do a lookup. Like you have these underscore sum, underscore count, and I understand those are special cases that you guys handle this, I mean the covers. But then when I write a query on it, I know that explicitly say, I know you have this aggregation per computer. Yeah, it really is so explicit. Like, you know, this is not the talk to come think about cost-based optimizers or really even declare any language of any kind. It is, I know that I need this aggregation. This aggregation is going to have a unique endpoint, a unique URL, and I can be flexible in terms of the ranges that I want to query, the number of results I want to get. I can skip ranges and things of that nature, but like I'm explicitly going after this particular index that I have to find. Okay, that's right. Okay. Yep. I noted that there be dragons in the do-it-yourself JavaScript function world in part because we oftentimes see people write aggregation functions that like they do more of a projection than an aggregation. They just they don't reduce the data, you know, there's the best reduced functions. They're going to produce a single scalar value. They're not going to produce some complicated JSON object that has all sorts of various, you know, things computed. And oftentimes we find that that to be the case. So this is not the easiest bit of functionality to use in the world for sure, but it has proven to be, you know, kind of an important tool in people's tool box when they're building applications. Your B trees, are they variable node sizes? Are they always like, yeah, that's the worst part of it. Yes. Yes. The chunking function on the B tree nodes will not necessarily have the same number of children each time. So if the aggregation node gets too large, you end up with this like ridiculously tall B tree as a result. That's right. Yeah. That's why I found a B tree. That's right. So that brings me to replication as I mentioned at the beginning of the talk, the replication capabilities of the thing that I think keep people coming back to couch DB. So let's talk about what that looks like with all of the things I've spoken to you about so far, you know, just the change capture feed. Essentially, you can set up an active passive replication. No problem. You can get this incremental list of all the documents that have changed on one server and replay those updates over on the other server. But we also, you know, have people that use it in different scenarios. We have lots of situations where people want to take a replication of a database and then take that second instance into a disconnected environment. We have folks who are using couch DB for like airline infotainment systems. So the developers are continuing to update the catalog in some cloud database. When the plane starts up, it has, you know, an updated catalog and then it disconnects. You know, it doesn't kind of keep the one on that side and is able to serve things out, you know, to the individual, you know, headrest units throughout the flight. We also have furthermore situations where we, you know, that disconnected system is not just a read-only cache of the data, but it's something that's accepting additional updates in its disconnected state. We've seen, you know, retailers kind of do this more and more where the back of the store might contain a filtered subset of the product catalog for that store, but it's also allowing for transactions to be recorded in the event that, you know, the connectivity at the store location goes down, right? So they want to be able to continue to record updates in the store and when connectivity is restored, sync those things back. In particular, they want to be able to do that without losing any edits. And I feel like it's one thing to say, sure, you can run active-active replication between your US East and US West and your application developers should just be really careful about kind of partitioning their rights so that they don't clobber each other. But it's a bit of another thing when you're trying to do that with instances that could become disconnected for arbitrary long periods of time. We don't want to give people the opportunity to just low away their own rights with reckless abandon on that front. So for that, we need to introduce an additional piece of functionality. And that is the built-in edit tracking associated with every record, right? So now I'm talking about inside an individual document. That document maintains a history of its revisions. And a lot of people jump to that to say, oh, I can build, you know, a Wikipedia system, a history system inside the document. They're not intended for that. They're intended for concurrency control, essentially, identifying when updates to this individual document identified by its ID have occurred in multiple servers concurrently and being able to detect not only that those edits have occurred concurrently, but a provenance relationship between them. You know, did one edit, was one at a descendant of another? Were they siblings? Some other complicated history relationship? So what it is is a basic hash history. You know, we we generate a revision identifier from the contents of the JSON document itself, including the previous revision identifier. So the same sequence of edits applied in the same order to those two documents will result in the same revision identifier. In particular, there is no notion of an actor like you would get with a vector clock or a dotted version vector or things along those lines, right? Which can be a good thing. Can be a bad thing. A lot of times you'll see people, you know, sort of expect that there's an actor and, you know, do an increment operation on two documents simultaneously and then replicate those increment operations were perceived to be the same edit from the point of view of the document itself. They don't conflict. If you want them to conflict, you have to actually explicitly introduce this notion of an actor, you know, via some other attribute of the document or something like that. The other thing, you know, because it is just a hash history. There isn't actually a get merge operation here. You can't find a hash history that has branched, you know, like this and say I want to resolve revision three on the one side and revision four on the other side introduce a merge operation and make the history linear going forward. The convention is that you would submit an update to one of those branches and mark the other one as deleted so that there is only one, you know, branch of the history that is still alive and still preferred and still served out to the indexers and things like that. A couple of the bookkeeping operations, we don't let these paths grow infinitely long. By default, we keep a thousand of them. We, you know, which, which can give you this kind of situation. I said servers could disconnect for very long periods of time, but if you have a server that disconnects from another one for so long that server A has, you know, a thousand plus updates to this one individual record. When you do the replication, there's no longer an ability to kind of link together these two edit histories and the document with a thousand one edits will appear to be a new sibling of the original one. Right. So we, it's a balance, you know, between how much metadata do you want to keep and, you know, the potential for having these kind of spurious edit conflicts where we're no longer able to establish a problem in this relationship between these histories of edits that are happening to this, you know, single document in two different locations. When I clean up the history and you guys have like a dedicated background thread doing this or is it cooperative? The cleanup of the bodies of the documents is a background thread that's happening during compaction. The cleanup of the metadata itself the sort of maintenance of the revision history information happens on commit. Okay. There was I think I might have written a patch at one point to defer some of the cleanup of the metadata history for really expensive merging operations and it was not a good idea. Like it just deferred the pain and it became more painful. Yeah. And you know, the last piece of this and an area where I think we've always sort of acknowledged there's plenty of room for improvement is that all we do is preserve the final you know part of each edit branch. Right. So we'll never throw away the body of a leaf revision when the background thread goes through and vacuums the database or compats it it will preserve the last entry in every one of the edit branches and we'll never get rid of an edit branch entirely. So a certain amount of metadata that always sticks around there and we also don't do allow for you to say Hey, I I get it. Thank you for being super, super careful with all my distributed edits but can you please just resolve on this one field and throw away the others. You know, I really I think it's okay. That would simplify a number of scenarios giving people the opportunity you know to opt into that kind of server side conflict resolution behavior or heck to even you know pursue all of the stuff that's happening with these and auto merge and all that sort of stuff but that's another talk. I mean, do you find your users are and I guess this history thing is kind of easy to reason it's not like vector clocks that's more more complicated. Like do people struggle with this like concept of like I have these rich histories I have to mainly do myself or yes people like the people not know they need to do it. Ah yes and yes that it's often the case that people don't know they need to do it although if the good and bad and the ugly of the clustering side of that thing but in the case where you're explicitly setting up active active replication between two different instances yes people say okay I think I know I've got a plan for conflict resolution if there's a chance my applications writing on both sides they'll oftentimes build a view to help them power through conflict resolution in an asynchronous background task you know and apply all their logic there rather than like sprinkle it throughout all of their application but it's still not easy you know to do anything I've always wanted to do a better job of introducing data structures that are you know that do the right commutative operations and merge themselves and all of that sort of thing yeah but it's just never you know risen to the top of you know being merged into the project itself I think so you know kind of circling back to the replication piece just to summarize right now we've got this ability to incrementally query the changes be we get this list of four five and six says the documents that have been updated if I drill into document for itself I say okay you know indicating from DB one to DB two I recognize that this version three eight to five is not present on DB two so I create an edit branch in the history on DB two and if I want to do this in an active active fashion I just run the process in reverse is basically two separate replications happening you know at that point and I move three one to be and four eight nine C back over to D one and now my revision trees and I just repeat that process for every document that shows up and either servers changes feed and so that essentially was catch DB one dot L all the features that you know people use to build applications that I should show to you there they've been present for quite a long time that you know kind of mental model what catch DB is was pretty well-baked any questions on that shape of things if not I'm good cool want to jump into the two dot L world which I'm I'm so you know my company we wanted to get to a place where we were able to manage these couch DB based cloud database instances in a highly available fashion we wanted to support larger volumes of workloads higher throughputs and so on then what we were able to do just by optimizing the single server and we had the flexibility with this no SQL system to be able to do that and the somewhat simple fashion so we did is you know we took this one couch DB database splitted into shards we replicate those shards across nodes now it's still one single couch DB endpoint but there's multiple servers underneath each of those is responsible for you know some number of replicas of some number of shards of a given database throughout the databases we just used a consistent hashing system on the document ID itself when we first launched two dot L and so that would mean that some particular shard was you know the owner of an ID in the database and the other thing we did once we said hey we've got this whole changes feed and hash histories and so on that can you we can use that to actually converge these multiple replicas of an individual shard if something should happen to one of them over time so you submit an update and each of those replicas of the shard that hosts that document independently chooses whether or not it's going to accept the update this of course can be problematic because you know in this situation the response is yeah we've durably stored your document on multiple replicas of the shard so you should go ahead and assume that it's been committed and but it wasn't committed on that one copy that copy is going to have to wake up at some point in the future and replicate from its peers and learn about all the updates that missed in the interim and where that gets particularly challenging is in the secondary indexing system is now we've got multiple shards the way the secondary indexes work is just to scatter gather thing each of those shards is going to independently build locally all of the secondary indexes that the users to find in new functions and so on and if you hit the API and ask you know for a view or use this new find endpoint that we introduced it's going to ask you know each of the shards hey what's your contribution to the secondary index because the secondary indexes aren't being reorganized anywhere it's got to do a scatter gather every time you know even if you're asking for just you know a view that's going to return one record which secondary indexes going to contribute to that result and there's no quorum operations on the secondary indexes so it's entirely possible that you know the update that I submitted there that was committed on two of the three replicas when I go ask the secondary index the third replica might be the one to respond it hasn't yet replicated that change in and so you get this kind of you know potentially wonky view of things you publicly talk about like the worst case of like it's not really out of whack you know I'm going to set up you know we we the we basically had patch after patch after patch to minimize the chances of really wonky behavior if us if a replica goes down for a period of time and wakes back up it knows that it's not up to date with all of these other replicas and so it will opt out of responding to most interactive requests right so we you know it's like I said it's it's one production around and mitigation after another to give people a mostly consistent view of the world right as as really solving the problem properly yep so that clustering design is essentially what we had from the 2.0 release up until the latest release of 3.1.1 and you know the pathologies that are kind of associated with that and we started talking about some of them scaling is an issue you know we've had systems where we had you know users who are powering through tens of thousands of you requests a second and you know the bandwidth between the nodes is just going and going and going and we're adding servers you know we're we've run these clusters with hundreds of instances in them in some cases just because we were just trying to eke out just a little bit extra you know global query processing throughput here right we already talked about the sort of lagging index reads because you don't do quorum reads on the individual secondary indexes you run the potential of you know not seeing certain updates the other one that we're seeing here are the change capture feeds you know if you think about this the original 1.0 guarantee for the change capture feed was a pretty strong one you know you'd check in incrementally on the change capture feed and get the list of all the documents that had been updated since the last time you'd checked in we in the clustering scenario the sequence index is not necessarily identical amongst all the replicas of an individual shard you know these updates can get applied out of order you know there's no leader you know it would be sort of ordering those updates we do guarantee that you'll never miss one but that means you may see you know an update that you'd seen more than once and so we we would in the case of a healthy cluster do a lot of bookkeeping to say well actually it was this particular replica of this particular shard that responded to this user the last time we'd encode all that into their sequence and they would see you know of you that didn't replay updates too often but then if that replica goes down we have to replace it with another one now we've got to do some sort of mental gymnastics to figure out all right what's the least number of updates we have to replay we have to guarantee you and they won't miss anything what can we do to try to pick the sequence that is you know largest possible on this other replica that we're replacing the failed one with and so users that were building adventure and architectures didn't always plan for the idea that they might have to reprocess something they'd already seen to be fair it's not a fun thing to have to do and the last one that really gets you is back to your question for the possibility of edit conflicts when you've got two individual distinct instances on two corners of the globe they're replicating with one another it's an entirely another matter when this is you know U.S. East one A and U.S. East one B and you just happen to concurrently edit the same document and client one landed on replica one first and client two landed on replica two first the system is not going to throw either at it away it'll converge them it'll surface both but this means that designing through conflicts becomes an essential element of building an application anytime that you have concurrent writers on the same piece of data and that's no one so that led us to kind of think all right what can we do going forward as a project to really sort of solve this thing properly and we talked about well could we introduce you know sort of raft style consensus mechanisms among the individual replicas of individual shards and we worked through the effort to try to do that but we also started looking at you know to call it inorganic solutions to this problem we wanted to preserve the existing API we wanted something that was super reliable we wanted something that could scale up to our needs as a cloud service provider but didn't leave behind the community of people that are just running CouchDB on you know VPS instances to power their blog and we wanted something that you know was sort of an impedance match I mean you could always like build this on Postgres if you wanted to but somehow that didn't quite feel right you know to to do that so we ended up looking closer and closer at FoundationDB and I know Marcus came in in the previous iteration of these talks and gave you guys the rigmarole around all of the deterministic testing and simulations that they do and it's really really cool I'm here to tell you that it also you know this combination of transactions and a key value interface is a pretty flexible solution for sort of retrofitting into an existing database it is you know a complicated project you know I feel like we're faced in the situation when we've got this kind of structure we're all proud of that's like slipping into the ocean and so we got to pick it up and put it on more solid footing but the project I think is is is one that is turning out really well for us actually in particular you know foundation DB I think if you're not familiar with it it provides strict serializability in the underlying key value store right so that's a awesome building block you know with to be able to rely on and massively upgrades what we're able to do in terms of semantics we're able to offer compared to our current eventually consistent clustering architecture it eliminates those edit conflicts talking about a couch DB cluster in a single cloud region for example it lets us kind of refocus the replication on the kinds of use cases that I described rather than having it be both the solution for availability within a region and for data distribution across regions and edge locations it also lets us reorganize the way we do our secondary indexes so that whole scatter gather mechanism can go away because now I can transactionally update my secondary index and keep it organized efficiently in another portion of the foundation DB key space if key space is ordered I can build my views over there and I don't have to you know kind of have this parallelizable but super tough to scale query side of things and I get back to the face the case where the changes feed gives me this kind of totally order thing foundation DB has a little baked in feature that allows you to inject the version of the database into your key at and by doing that you can sort of have this ordered list of updates in the database just all out the bottom you know I guess that takes it a little bit beyond sort of basic key value store but having that ability for the server to inject the version stamp of it gives us you know a really nice little building block there I jumped through that you know sort of really quickly in the interest of time there but yeah I said there I've given a talk on this you know topic you know the basics of it are that the combination of the transactions and the key value interface is both like flexible enough to satisfy this kind of brownfield scenario where you've got an existing API and existing semantics that you want to try to preserve and improve and also powerful enough to make that effort of slotting it in you know a project that's worth doing with any they think again I'm jumping tenders different slides link there's that we're going to be now that have all this extra code to handle the burden in a version to purchase in public emergency fact like that I guess all that goes it's not that it goes away because the code exists to just nothing to merge anymore right so it doesn't never never do them though well yes no I think that it allows us to refocus that on sort of you know the use cases where you want to copy of the data in Europe and another copy in the US and you don't want to run a serializable transaction across those two environments there are cases where you know you actually want disconnected asynchronous updates but I think you want to opt into that explicitly as an application developer and understand you know your application is targeting an instance in this region and it might receive updates coming in from this other region and really scope in on the cases where you might have to handle an edit conflict as opposed to those just kind of falling out during the normal course of operations for one instance of your app deployed in one region I mean foundation by default is everything serializable you're basically saying like you still close lower guarantees lower cases consistency levels and then all the couch TV machinery that you have written before all that still applies in that world that's right that's right and in fact I think I you know this kind of illustrates it a little bit right so in this future world where couch TV embeds foundation DB under the hood and delegates all of the persistent state management into foundation DB and then is you know to just separated couch TV instances into geographically distinct locations running couch TV replication over the top absolutely right and the nice part about that is that things scales to all these kinds crazy topologies we've got people that are running you know couch TV instances in every one of the cloud regions and replicating amongst all of them so they've got this local cache of their data and local to every region right but you can also have the situation where foundation DB as a cluster scales itself out its layer our view right now is that most of those use cases involve foundation DB running sort of with relatively low-latency links between the different members of the cluster but FTP also has the ability if you want to stretch into multi-region use cases typically those are designed more for reads are entirely local in one region but you have serializable failover to another region in the event of a disaster right so that may be a reasonable trade-off and we've designed the system to be able to choose to run in that model if you would so prefer so that you know you had to do a failover you never had to deal with this eventual consistency edit conflicts and all that sort of stuff but you're constrained you know in the terms of the flexibility you're not going to get active active behavior in both regions in that case it's really more of a failover system in in exchange for getting those upgraded some antics the last thing that's fun about that though is you know when you push all of the state management down into that transactional key value store we're now application developers on top right we're building stateless essentially you know systems and that you know we're we're we're concerned with the generation of indexes and query mechanisms and all of that sort of stuff but we're we're not dealing with the actual state right we can shoot any one of these in the head we can use stateless load balancers on top we can just start to participate in all the fun that the app developers are having you know in terms of managing stateless clusters of instances as so long as we commit to doing everything that needs to be done to coordinate amongst these replicas through foundation dv this that was one of our design principles you know going forward on that front and you know wanted to make sure we left a little bit of time for questions so like closing it out um touch to be done a lot of kind of kind of novel things you know these of no sequel that we're just different from what the relational database management systems of the time you know we're doing I think some of that is motivated by you know the original founder of the projects working on the internal databases inside Lotus Notes and you know kind of taking some of the concepts from that world and looking to modernize them I think the things that have stuck have been that that focus on event driven interfaces I think we've seen that sort of API copied and exposed in a bunch of other DBMS is and the flexibility of the replication capabilities I think opens up a simpler way of deploying you know state management across these different environments then then what you know people would have to cobble together on their own if they were just picking up their own database and so those continue to be the things that people pick up catch you before on the other side that combination that you get of strictly serializable key value transactions is something that we've found to be a nice accelerant for us to you know really solve for once and for all the consistency issues that we have with our existing clustering system when running you know in what's logically a single couch DB database and so with that if you like working on couch DB if you like working with foundation DB if you just like developing systems that manage large fleets of vanilla database instances across cloud regions my team and IBM Cloud is hiring love to be in Dutch so thanks okay awesome haven't I will I will call on behalf of everyone else though we have a few minute for questions I have a bunch of them though but I'll I'll I'll open to the floor again I'll meet yourself and fire away oh yeah you're you're all suckers I'll do so I have a couple like I I think I I've had little questions about like sort of the no sequel market like you know you guys and mongo started roughly the same time and you know the widely successful right like you know there's a lot of people that try to replicate what you guys have done I'm taking a rethink for example we're not successful so I apparently you and mongo being the ones that succeeded if there's anything you think that you guys did write that help that happen but then on the other hand mongo certainly has a mind share that like if you want to document database you go with mongo sort of so I'm curious why why is that the case that mongo be couch so I guess perfect is why did you guys versus that's one of the equal project when others fail and then why is mongo be you in again not a performance or whatever it's yeah mind check absolutely so I think I would say that our success was that we backed into the cloud database as a service business model because we didn't think of another way to make money and you know at the time a lot of people were thinking you know who's going to trust you with their data and it turns out lots of people were actually really excited to upload the you know need to be a systems administrator in a database administrator and so we just hit the right trend at the right time in terms of offering a database as a service before you know before RDS before Dynamo DB before any of that right now and we found a way to kind of offer that value proposition and have that scale for mongo absolutely important and now you see all of these other vendors that have the huge market caps and so on you know they're they're shifting as into sass as fast as they can because the market really values that kind of revenue model for sure yeah so that was you know our success was being early in the databases service market improving that out as a business model you know proving out the the low churn the good recurring revenue like that annuity business just worked for us right we didn't need to have tens of of thousands of customers in order to to build a a viable business outfit and something that became an attractive target for you know the choir like IBM what mongo did well I think honestly when I saw the size of their client libraries and developer experience team like it was much larger than the core database team for quite a period of time right they every new framework had a way to support mongo DB and that was something that couch DB neglected you know to our detriment we so said well it's a web service API calls you just interact with the database that way pragmatically we saved on staffing that way you know our development costs were lower as a result but it meant we weren't able to create really idiomatic experiences for developers and really just improve their velocity and at the end of the day no sequel you you could you could mimic the scalability of the system and all sorts of other things you know you could reconstruct those properties in other systems but the speed of development and the iterative nature of just sort of saying yeah the database is going to put a ton of constraints on me as an app developer is something that I think is sort of particularly you know attractive and and a sort of differentiating characteristics of these document databases and I think it's something Mongo prioritized more I think another uh I I think Mongo promised a bit more than you guys as well yeah the all the scaling stuff right like is which again like you know it's is that a good thing about thing I mean I remember the days right when when when sort of Mongo was promising much more than it could deliver that's that's not the case anyway it's a much more solid system I I was a point in time where you sort of wondered if they were going to build fast enough to keep up with the claims that were being made before too many bad blog posts landed right yes yep yep okay cool um so now so how can you start one 2008 2011 if you had to start doing all over again would you like would you still go with Erlang and if yes why if no why because you guys the the most famous Erlang David's out there right like react is dead uh I don't there's maybe like a few others that are like active in projects but you guys are the the main ones yeah yeah um Erlang did some really great things for us when we were the startup that was like releasing a new version of the database every week and not always entirely battle-tested when we did like the the runtime saved us several times and you know that people talk about the sort of isolation model and stuff the runtime debuggability of the system was also a thing like you could log in and sort of interrogate the individual processes and start changing them if you really want to you could do some crazy stuff in that you know environment never would recommend it you know for a mature system but in the early days like you you could move pretty quickly hiring is an issue you know you could there's not a massive pool of developers who simultaneously are good at Erlang and good at databases right that you had to kind of train them on one or the other I think the community now is doing interesting things you know you've got like the elixir project which is a much more sort of expressive easy to use language for building you know APIs on top but runs on the same underlying runtime and you've got you know the what's apps of the world and a few other major production users who are driving some pretty good you know enhancements and innovations in there would I do it all over again maybe have like a more opinionated view about what types of functionality go where you know adopts lower level system stuff and rust or something like that and user line for the things that have to do the concurrency and and you know connection management and all that sort of stuff right wouldn't try to do the whole thing Erlang yep nice stuff okay um what are the biggest energy your challenge you're facing now with porting CalGD to foundation DB like great question so on to us really probably having been to lax in terms of introducing limits into the database right we weren't real prescriptive about like how long a transaction could run or how you know large an individual revealed could be and so now we've got this deployed a state of all of this petabytes of data all over the place and somewhere somebody has done something that runs a file of the limits that we're trying to introduce like the the horse is out of the barn and you know in another state like is a challenge for us as we try to not only introduce a more powerful system but you know something that foundation DB is fairly restrictive about what it wants you to be able to do and some of those are you go a you going about this in a principal way or is it whack a mole it is currently principled but as we get more urgent in desires to move the estate I can imagine that it will be a little bit whack a mole like we're pretty good now about sort of static analysis tools look at your data model and say yeah this is compatible like we can we can upgrade this the runtime behaviors can be a little bit different it's pretty hard to know when somebody might be like somebody's request pattern might fit into a single short-lived foundation DB transaction or not right got it got it got it got it right so before we go obviously if I ask my last question everybody else have a final money super awesome I appreciate you being very candid with us I guess my last question you're high up in IBM so you you have these discussions now right you're not you're not a puts order in the trenches but are there any concerns that you can publicly talk about of like building a major product on foundation to be which is is open source but is controlled by another giant corporation that that's beyond IBM is is known to be you know a bit you know who's easy with with the lawsuits and not who's easy but you know why hard to the lawsuits same with Apple right you're you know the major major revenue source for you guys that are built and on a poor technology that you don't have full control over that you can always force it I suppose but at least other issues that is a good question we're no strangers to the value of open governance and I think you see in our open source efforts is oftentimes a concerted you know push on our part to make sure that I'm nothing committed on that front it's a topic we've discussed at the moment you know we have a solid positive working relationship across all the teams but we also have to recognize as you said the strategic risk and be prepared to kind of carry it forward ourselves if we need to you know so I that's really all I can say about at this point is yes it's it's you know it's something that comes up I don't think Apple that you not Apple people merge convince to them right still at least that was the last time I felt that that's no thank you that that was the case there so there's been steps on that direction in part because you know the for a commit to get into foundation to be you've got like just this massive set of simulation runs that after you run and the framework for executing those wasn't actually open it is open now okay so like I think you know you're seeing the kind of practical steps that would need to happen in order for a more open you know mechanism to get to these things to the point where they get the green lights yeah okay awesome all right with that again Adam thank you someone for being here and this is an awesome talk thank you this is a good deep dive to copy you was actually in the system I've never actually looked at sort of her lying so just like I'm not going to read that but actually I I didn't really actually know a lot of the internal so I really appreciate it other I knew what the copyright B plus you but that was tough. Super helpful. I also say to you have the clattice like background of any of the speakers of the hat right it doesn't look like you're in a bus terminal is it looks it's good so okay hi guys so again thank you Adam for being here