 My name is Chris Anderson. I'm one of the co-founders of Couchbase. I've been working on these same technical problems since about 2008, pushing forward the data structures and the code to build a mobile technology for NoSQL. I guess my question for the folks in the room is, who here is a mobile developer? And then, who here is a database technologist? And then, I guess that means I'm in the right place. You're probably familiar with a lot of these industry trends. It's been three years ago, you could wow audiences by showing these kind of numbers and trends. And now, we all know it because we're living it. So everyone has a phone. Everyone has a smartphone. These set of graphs are all from a simco.com, which is a great place to look at industry trends for mobile. And I think what's most notable about this is if you look at that pinstripe up the left-hand side of the orange graph, that is the date at which more than half of smartphones were purchased since then. The rate is so high that most smartphones are about six months old. And any time you have an exponential curve, you're going to get that sort of thing. That plays even more into the trends, of course. My latest tablet is like a quad core. And I don't even look at how many gigabytes are on the drive when I go to purchase a tablet because there's enough now. So the capacity and what you can do with these mobile devices is growing along the same curves as the audience. Mobile is about to beat TV in terms of audience engagement and audience size. So if TV is like the gold standard of American media, it's not going to be for long. So mobile is poised to unseat Hollywood. And in this new reality, you've got to be ready for new kinds of growth of data and of audience. So you hear a lot about big data. That's probably a buzzword at this conference. I think for couch base, at least, another buzzword that we like is big audience. Because when you have millions of users already on, say, iOS, as Instagram did, and then they flipped the on switch for Android, they gained a million users overnight. Just from people picking up Instagram for the first time on an Android device. So when you're faced with those kinds of growth curves, the old SQL technologies, they really weren't built for that. They were built for deploying to applications where you knew in advance how big the audience was going to be. Now all you know in advance is that the hockey stick is going to be exponential. So this is an oldie but a goodie about a year and a half ago, draw something launched. And at the time, and maybe even today, it held the record for the fastest growing online anything ever. So they have a mobile app that is storing the user interactions, the drawings they make, the invites and everything in couch base. And this is old media meets new media because a lot of their growth was triggered by television. Some celebrity would mention it on TV appearance and next thing they know, they get another million active users. So this talk isn't really about how to scale and keep up with that kind of growth. This is the reality that we all live in today. This kind of scale is the new normal if you're building a consumer app. If you don't plan for millions of users, then you're not planning for success in consumer driven mobile. So just a little kind of technical diagram of what happens in a couch based server when you need to increase capacity. You're able to, in this diagram is illustrating, adding two nodes to a three node cluster and clicking rebalance. The whole goal of the couch base engineering is to make it so those kinds of operations are completely online and non-disruptive so that your application continues to function and your users don't see any degradation in quality of service while you're doing things like provisioning for your new million users that just started overnight. So I won't go into the details really. I'm sure couch base has more talks tomorrow if you want to see how all the mechanics of the rebalance operation work. There's plenty of places you can learn about that. But the gist of it is that we move the data to rebalance it across the cluster and do atomic handoffs of which node owns which piece of data. There's no eventual consistency happening here. Given piece of hardware is in charge of a given data item at any time and so your application servers are directly accessing that hardware to talk about that data item and there's no proxy or a right going to some old machine and then having to be cleaned up after the factor merged with the new right. So another on the developer side of things, JSON is everywhere. These days, maybe five years ago, three years ago, you could have said there was a contest with other formats for web APIs. So here's just the inspector view of the Facebook API. Of course, it's JSON. Has anyone here written app that interacts with the web API and the web API was not JSON in the last year? Was it fun? So I think that ship sailed JSON won that battle. And I guess one of the key takeaways to JSON having one is that we have a whole new kind of flexibility that we didn't even really have with XML. And I think it goes beyond being schemalist because when we talk about schemalists in the context of databases, we're really talking about do the database and application agree on the format and certain databases are designed to reject stuff when it doesn't fit the format. But in a schemalist world, in a world that is consuming JSON from different services around the web or from the back-end database, your applications are typically not going to be picky about data they weren't looking for being there also. JSON is very friendly to new and unexpected fields being alongside the old and expected fields. So very rarely do I look at some example code that says, here's how you use the Facebook API or here's how you use the Twitter API, where it's not a script. What you do in these examples is you download some Twitter data into your script and you annotate it with additional fields. So you may add your own internal timestamps or your own internal identifiers on top of the Twitter schema and then dump it into storage somewhere. And your application is not going to be impacted if Twitter adds a new option to their JSON feed. So it makes this resilient system where the schemalist properties aren't just at the data layer, but they flow all the way out to the application. And application can kind of safely deal with older and newer versions of the data as the back-end evolves and as other requirements, other code modules may be adding fields that your code module doesn't know or care about. And even in XML, certainly in a relational world, there's a lot more handcuffs on the developer when they're trying to go through those kind of steps. So where I think this becomes a practical concern in the mobile world is you've got an app and it's out there in millions of hands. And you want to change something on the back-end. How practical is it for you to ask all your users to upgrade right now? I mean, you can't do a two-phase commit across the world to have users all run the latest and greatest version to deal with your latest and greatest data formats. So we don't do that. We just let the app be comfortable with this schemalist JSON data. So I think that's, for mobile, the real benefit to having these relaxed types in schemalist data formats. So the other thing we know, at least from the web world, is that users don't want to wait. And we know that so much so that when you go to Google's company philosophy page, they have 10 points. And one of them is that fast is better than slow. And they've done a ton of research to back this up. Especially on the web, there's no shortage of figures like this that can estimate that if your site is one second slower, if you were to just put an artificial one second delay into the kind of site that makes $100,000 a day, well, that one second is going to cost you $2.5 million over the course of a year in terms of decreased user engagement. And Amazon sees 1% revenue increase for every 100 milliseconds they shave off of their response times. So we know if fast matters on the web, we're only just now beginning to understand the impact of that in the mobile environment, where the device in your hand is that much more personal than the device on your desktop. If it's slow, you're not going to multitask over to something else and then come back to it later, because you don't have multitasking. If an app is slow, you're going to go to a different app and chances are you'll never come back. So if someone bounces from your app for performance reasons, this is bad for business. And that means any investment you can make into performance is going to pay off if you have the kind of application you want people to use. So it's worth it to do things to keep it fast. This last graph here, this is one of the few pieces of research that I was able to find that was really definitive about the impact of perceived performance on user retention. So Google did some accidental research. They released a Gmail mobile website. And it had a slick new UI, and everything was really cool about it. So you see they got this bump in traffic when they released the new UI. Everyone was like, hey, let's check this out. But it was slow. So users didn't turn out to like it. They tried it, and then they went back to check in their Gmail on their desktop. But then they did the optimizations to that new UI. So it didn't look different. It was just faster. And as they were able to trim latency from the user experience, they saw that the engagement went up. And so you can see here they sort of hit a threshold where all of a sudden people started using it again. So tuning for latency is worth it. It may seem like an afterthought. And in this case it was they released something slow and then made it fast. But if you build for latency and design for it from the ground up, then you could beat even their fast numbers. And how do you do that? That's kind of my personal obsession. So local data is the fastest data. We're not blessed with completely perfectly reliable mobile networks and backhaul carriers and everything. So as fast as your server can be on the back end, kind of doesn't matter. I mean, it's always better to be faster, right? But it's not up to you that the user just drove into a tunnel or got on an overloaded local cell tower and their latency numbers spiked. But the user is not going to blame that on AT&T. They're just going to say, oh, this app is broken. And they're going to go use somebody else's app. And by then they're out of the tunnel and your app was the slow one. So you can get some insurance against that kind of problem by not using the network for every single click. And you can do that. And it's kind of the best practice these days as a developer, you're going to jump through as many hoops as you have to in your code to make it so that when the user clicks something, they get that feedback right away. One of my personal hunches about why I draw something saw that great growth curve was really almost not even technical, but in terms of how it felt to use it. I think what they had done is that on touch start, instead of even on touch end, your finger touches the glass, they played audible feedback. So it felt like it was really snappy, even if there was something happening on the back end that took a while to wait for. So with this hypothesis that local data is fastest and that developers will jump through all kinds of crazy hoops to get the benefits of local data, my team is building technology to make it so that local data is easy, so that you're not having to jump through hoops, so that you get the benefits of local data and also the benefits of being lazy, because as developers, we all know that laziness is sort of the source of all innovation. You get all your user interactions on the device. The developer rarely has to deal with reachability. Is this resource available? Is the network slow? Was there a timeout? That should be some sort of infrastructure problem, not some sort of user interface problem. And then you can get this increased reliability. Instead of sending a write off that looked like maybe it acknowledged, but something happened far away where a server went down or whatever happened on the back end, you do all the writes locally and it plays it to the cloud when it can, opportunistically. And then the data comes back down when it can too. So with that as the baseline, let's just work with this local data. And I like to say that your phone, people sometimes call these always on devices, but I don't really believe it. I say they're occasionally connected. Every once in a while, your device is lucky enough to have a live connection to the cloud. And so if most of your interactions assume that you're not necessarily connected, then inevitably you're going to have scenarios where two people are collaborating on the same data item. And they've both modified it at the same time. And that is kind of the hard problem for sync. So if you go talk to the Dropbox engineers or the iCloud engineers about what they scratch their head over, it's about this conflict management and what should we do when there is a conflict, when two people have been messing with the same data at the same time. I don't know about it. And they both think they succeeded to run their update. So there's a lot of different approaches you can take to conflict management. Most of the simple ones, or most of the ones that I've looked at, basically boil down to trying to make an intelligent guess as to which of the two conflicts should be chosen, and then picking one of those. I don't believe that should happen in the data layer. I think that's not the database's job. A database's job is to never throw away data, by surprise especially. So I'll talk in a minute here about the data structures. And this is a part where we get to be really geeky. So I hope you all like graphs of trees and stuff. Talk about the data structures that we use for conflict detection and management. So the technology we're building is called couch-based light. It's an embedded database. Our aim is to build the best database for mobile devices. I think this is an achievable goal, if only because there's not a whole lot of competition out there. But also, we take it really seriously. So it's a no-SQL database with JSON documents and binary attachments so you can do social media. And it's a full-feature database you can query it to build your user interface. We've been working on this for a long time. Like I said, I've been working on it since about 2008 and then with a focus on mobile since about 2010. And so we've had some prototype versions out there, open source available that people have been using. And the biggest feedback we got from our early versions is they weren't lightweight enough. So I think the first thing we ever got running on an Android phone was like 10 megabytes, maybe 15 to download, and that's just not going to cut it. So we've got the whole thing down to less than a megabyte. And the other thing about mobile-first, you'll see this a lot, not just in the context of databases. But it has to be native because every ounce of performance matters. Even if you've got four cores, you're burning down battery you don't have to if you're running some kind of virtual machine. So we've written a native database for iOS and Android. And in order to cut down on the size, we're a database company. We've got more storage engines that we know it to do with. We could plug our own storage engine in there just fine. But SQLite's already on there. The operating system vendors ship it. So we use SQLite as a storage engine because it makes our download, in addition to your application, just that much smaller. And the goal here is to make it so that as an application developer, when I talk to mobile developers, the thing they want to do is make an awesome UI and make awesome user experience and workflows and just concentrate on that thing in your hand. So when you're an application developer, you don't want to think about the cloud. I think that's a big part of the appeal of all these back end as a service companies that have been doing so well lately. And they all do a pretty decent job of making sure that you don't have to deal with the cloud. But you still end up writing essentially the same stack as you would deal with the traditional three tier architecture. You've got a database. You've got some business logic controller, logic application server thing. And then you've got the mobile application that's connecting via a web API. And that three tier architecture, even with some of those back end as a service providers that try to hide the database from you, and you're just really thinking in terms of web APIs, even those force you into this request-response paradigm where maybe you can do a little bit of work on a local object. But at the end of the day, the developer still cares whether or not that local object has done what it needs to via the cloud, rather than letting it be truly transparent. So we spent a lot of time with those prototypes. The couch-based slide is the third generation of technology we've built around this. And the first two we just released as open source, essentially experimental projects to see what the community reaction was. And did a lot of user work and interviewing with people who had stuff. There's plenty of stuff in the app store using that technology. And what sucked about it, what was good about it. And it turns out that despite the 10 megabyte download of the ancient first version of our stuff, people still loved that experience of developing to a NoSQL database on the device. But the thing that was lacking is the back end server was still too heavyweight. It was not simple enough to compete with the other back end as a service vendors. So we're not a cloud company. So we still have that barrier where you've got to at least install the thing on a server somewhere. And that's kind of hard. So we had to make up for it by making the back end APIs just so simple that your overall complexity tax is lower if you use our stack than it would be even if you used a back end as a service provider. So we came up with this sync gateway. And I won't dwell on the API details too much. It's all on GitHub, so go play with it and take a look at the APIs and let me know how you feel about them. But my goal was to make it so that rather than having that three tier architecture, or even the kind of compressed back end as a service architecture where you're still thinking in terms of request response to completely do away with one of the responsibilities of that application tier, which is typically in an application tier you're going to be modifying the format of the data. It comes in from the client. You're going to add some fields that only the cloud knows about and then save it. And then you're going to load it from the database later and censor those fields before you push them over the wire to the mobile device. But if you throw out that and just say these data items are these data items, and you can either see that data item, the JSON document, or you can't, then the back end logic becomes vastly simpler. It's just a data routing problem. Who's allowed to see this data item? That's essentially the only question you have to ask yourself. So we came up with a really simple access control model that allows you to build complex applications. I'd say I'm not targeting the 99% case with the simple API, but you should be able to do 80% of a collaborative, like you should be able to build an Evernote kind of application with 100 lines of JavaScript for your whole back end and not have to have a bunch of complexity in different moving parts on the back end. So that's what the Sync Gateway is about. It uses couch-based server for storage. So you get the same scalability implications that you get like we saw with Draw Something. They're doing the traditional three-tier architecture, but with couch-based server for storage. And then any kind of custom logic you want, if you want to go beyond that 80%, it's all sort of an asynchronous background connection. So that's why I've got that green app server up in the corner. It's listening to events coming off the Sync Gateway and processing them, and maybe sending a push notification or doing some more custom security lookups in an LDAP store and modify in some user permissions, that kind of stuff. So the hope is that this simplifies the development enough that mobile developers who just want to deal with the stuff on the phone, the stuff in your hand, can write that app without really thinking about the cloud. Like I'm putting together a demo app here. It's just a little to-do manager where I can have a list and have the grocery list I share with my wife and have the list of open bugs on our technology I share with the team or whatever. And those kinds of essentially CRUD apps, which is what most of these three-tier apps that support a mobile device turn out to be, the goal is to have those be supported with, you almost forget there's a back end. You just write this one data routing function, and everything flows out of it. So I wasn't planning on it, but I got a special request from the conference organizers that I should do the couch base rap. So there's a few copies. I'm working on a beat, but I don't have it here. One of these days, when I'm not busy writing software, I might actually put out a recording of it. So the genesis of this is A, couch base is an acronym. That's the half of the story. The other half of the story is that when you land in Frankfurt and you have no idea what time it is, and you get on stage in your super jet lag, you have to do something to wake yourself up. So yeah, look for the acronym. Cluster of unstoppable commodity hardware, B, allocations, enterprise, low latency, thanks to memcache, D, watch a disk, write Q, couch base, keep it clean, index your data with JavaScript, hit a couple lines of code, pick the keys you emit. Port 8091, 11211, for couch base, data is a serious obsession, simple, fast, elastic, sync, granades is size, is to your workload, we are the reliable bits. So I hope I have enough time to talk about the geeky data structures. This whole section of the talk is essentially meant to answer the questions that come up if somebody's, I hear from a lot of people, like, oh, sync isn't that hard, we're writing our own sync engine, and we'll just like a couple more bugs, and it'll be great. It'll be like no problem. Or who really needs sync anyway? But I think the big one is if you think you can write your own sync, it's not that hard, but I've been looking at the problem for a long way, and there's kind of one way to do it, and it involves not compromising, because you can't throw data away. So if we're going to not throw data away, what kind of data structures do we need, and there's a few other requirements that we'll talk about along the way. So the synchronization problem, there's kind of two halves to it, and we'll take them independently. So the first half is I've got a million documents here, and I want to make sure that any time one of them changes on this end, it changes on this end. So this is going to be the same kind of problem that you see with a clustered NoSQL database doing cross data center replication. So we actually use the same approach for couch-based server across data centers as we do between couch-based server and a mobile device for the collection management. The collection sync. And then the actual tracking of the state of a given document, that's the part where conflicts come up, because it's only when two users touch the same document that it matters, that you could have a conflict. If two users put different documents into the same collection on different data centers, that's a trivial merge. That's not a conflict. But if two people both update the same contact record for a customer, you want to detect that and not throw any data away. So we're going to look under the hood here, and you'll see how we manage conflict detection. And if you want to write your own synchronization engine, and you try to take any shortcuts and not build it this way, you've avoided the warranty. So let's look at the data structures. So I want to compare three data structures for collection sync, or rather, these are, I guess, supported by data structures. What they are is approaches. So there's a brute force approach. Like right before you ran our sync, you deleted the target directory and just recopy everything every time. Obviously, that's going to be reliable, but it's not efficient. There's also Merkle trees, which various databases have used for a long time to make sure that two collections of data have the same items. And then the one that we use for our cross data center replication and also for replication from the cloud to the mobile device is a sparse sequence. And Merkle trees and sparse sequences are kind of at the same level of efficiency for certain use cases. But then there's an additional requirement that I'll talk about that makes sparse sequences the preferred data structure for what we do. So the initial sync is always brute force. You've got to copy everything if the target is empty. So we'll copy everything. It takes a few copy operations. Maybe you could batch it. I'm not showing the potential batching you could do in these animations. So now we have the same thing on both sides. And let's go ahead and do some mutations on A. So now we've got 2 prime, 3 prime, and 5 prime. We modified those documents. And we want to make sure that B is the same. So for the brute force approach, we're going to copy 1 over. And we're going to copy 2 over. And that actually made a change. We're going to copy 3 over. We're going to copy 4 over for no good reason. And then we're going to copy 5 over again. So obviously, we can do better than that. We like algorithms. And the point of algorithms is to do the same work and less number of steps. So let's look at the Merkle Tree algorithm. This one is clever and interesting. I'd be a fan of it, except for it doesn't meet one of the requirements that we'll have. So with a Merkle Tree, what you do is logically you divide everything up into, like, say, a hash ring or something. And then you compare segments of the hash and make sure they're the same. And any segments that differ, you have to copy. The way that that ends up being implemented in database engines typically is in the B Tree. So if you've got a B Tree with a bunch of leaf nodes of data items, and another leaf node over here with more data items, another one, and then you've got internodes built up on top of that, each leaf node takes a hash, like a shaw one hash of its leaf items. And so it has basically a value that if that value is different, you know something in there has changed. And you roll that up as you go up the tree. So if you've got an internode that points to 10 leaf nodes, you just hash those hashes. And now that internode has a hash. And you do that all the way up to the root. So you can really cheaply, with a Merkle Tree, decide that you're done syncing because the root hash is the same. And that'll only ever happen if the hash is all the way down recursively or the same. So it's pretty easy to reason about. And it's fairly efficient. So in this case, we're simplifying, just let's say we've got four leaf nodes here. And you can see that we're only going to have to copy half of them. So we touched three-fifths of the objects, but we only have to copy half of the nodes. And this works out well if you have your replication happening fast enough that the set of items that you're copying in any given batch is a really small proportion. But as soon as you have a ton of, or even not a ton, but evenly, randomly distributed throughout the data set, number of updates roughly equal to the number of leaf nodes, then you lose the efficiency gains, and it turns into a brute force copy. So sparse sequences are not as elegant mathematics. You don't get to run a hash function or anything. Essentially, you just keep a data structure around that's like an ordered list of everything that happened. Except for when something happens, anytime something happens, you also remove the old record for that item. So right now, when we first did that insert to fill up A, the sequence was 1, 2, 3, 4, 5. But now we touched it again. We touched some of those documents. And so it would be 1 and 4 would be the lowest, the first things you'd hit when you follow that sequence. And then you'd find the things that have been modified since then, so 2, 3, and 5. And I guess the thing that's important about that is that it's not, worse than brute force would be if the sequence was 1, 2, 3, 4, 5, 2, 3, 5. And if you had those repeated in there and you just have to copy some things twice even on the next one, that's not fun. So instead, what we do is we always push the latest update to an item to the bottom of the stack, essentially. And anytime one of these items is modified, it loses its old position, and it takes up a new position at the bottom. So the number of live sequences in the database is always going to be equal to the number of objects in the database. But because that order is monotonically increasing, you can always pick up from where you left off. So you've got 1, 2, 3, 4, 5 copied on the brute force copy, and then we're going to copy sequence 6. I'm starting to realize that if I'd named these documents with letters, then I would not be saying that number 2 has number 6. But number 2 has number 6 in the sequence. And the seventh item in the sequence is 3 prime. And the eighth item in the sequence is 5 prime. So you could see you did three logical copy operations to copy only the changed documents. So it doesn't have quite the elegance of the Merkle Tree. But the requirement that it adds support for that a Merkle Tree can't do, which is really the deciding factor here, is the ability to synchronize subsets. So if you've got a social game or some application with a big group, a big audience, you've got millions of users, and you've got a bucket in the sky with all their data, I just want to sync my data to me. And you just want your data. We might have some data items that are shared between a handful of users. But for the most part, users don't want everything. You probably don't want to give users everything for security reasons. And even if you did want to give users everything, you can't, because it won't fit. So you have to be able to support subsets. That's a really key requirement. And if you were to try to do subsets with a Merkle Tree, you would end up in a position where that root hash is never the same. So you'd have to have a bunch of different logical Merkle Trees all overlapping. And so in this case, we can sync some items to some devices and other items to other devices. And at the end of the day, still have the system acquiesce. So for collection sync, we have these three we looked at. And for the mobile use case, we decided that we like the sparse sequences. So that's the winner there. If you didn't have some of those requirements, maybe you'd pick another one. If you just did brute force, sure. Essentially, the sparse sequence is an optimization. But it's an optimization that you need to make, if you want, to avoid recopying the entire data set every time. And because Merkle Trees won't quiet us. So now we can talk about the item level, logical tracking, understanding who touched what and when a document is in conflict state when you have updates that are not coherent across the system. So vector clocks, they're like too complicated for me. I've been doing this for a long time. And every once in a while, I go back and read what I can find about vector clocks. And I don't quite understand how or reason about an application using them. The thing I can tell you is that in this picture, we're looking at it from the perspective of that dot right in the middle that has the red coming out of one side and blue out of the other side. And that anything that is in the white space is a conflict. It was made without having a causal relationship to the update that's in the middle. And so vector clocks can allow you to detect when things have gotten out of skew where you're going to have to have something that's smarter than your database do something about it. So it satisfies the requirement that we can detect conflicts. And it satisfies the requirement that it's not throwing away any data. But it has a scaling property that isn't quite what you want when you're dealing with millions of users distributed around the world. And that's that the logical clock that's stored on each document scales with the number of processes that have touched the document. And so if you've got a document that's really popular with lots of users, its vector clock is going to be like 100 entries wide. This works great if you know how many clients you have. And so there are database systems that use vector clocks that fix the number of clients by essentially, rather than having some end process, an application server get the client ID. It's like a database proxy that you hit that has a client ID. So if you do it that way, you can bound the scale a little bit. But it's not suitable for mobile applications. So we never want to throw data away. That's our rule that we're not allowed to break. And this is how we do it in couch-based light. Maybe there's a more clever way. If so, please let me know so that we can use that instead. But the revision trees are essentially you have, on each copy of the document, you have one little counter that's got, or it's a tuple that's got a generation counter. And then it has a content hash. And so logically, it's that thing I've got at the top, a counter and a content hash. And at the bottom, that's what it looks like on the wire. So we have the counter be a number and a dash and then like a SHA1 or MD5 or whatever of the content. And then the way that I've chosen to represent them in the slides we're about to step through is that let's just imagine that those hashes that we stuck on a color wheel. So green is a different hash value from blue is a different hash value from red. So when you see two things that are the same color, that means they have the same content. If they have the same color and the same generation, then it's the exact same document in more than one place. So let's play through a conflict scenario. First of all, I guess I should say most of the time in real applications, conflicts are rare. You can make an application that's going to be conflict heavy by having coarse grained documents, like a small number of documents that are under high contention with lots of users. You'll end up with conflicts. But a lot of the time, developers will just tend toward making smaller, more fine grained documents that aren't typically going to be contended over across multiple users. But still, even so, you don't want to throw data away when a conflict occurs. So to set the stage, we had a document that was three generations old that, let's say it originated on the iPhone and got synchronized over to the Android device. And so now both devices agree. And maybe the Android device has never done any edits. Maybe the edit was ping-ponged. But there was no conflict. So we've got the same history happening. The yellow circles mean that if you were to do a fetch, like a naive fetch as a client that doesn't really know what is, it doesn't care about conflicts, you just get three in this case. You just read that. And when you do a write, it'll be based on that. So both of the clients go ahead and create a fourth generation. And they disagree about that fourth generation. So we have a purple one and a green one. And now let's synchronize. Let's just do a one-way push from the iPhone to the Android device. So now we have the yellow one and the purple, I mean the green one and the purple one. And I've also, I said the hashes are mapped to a color wheel. So typically what we do when we have two conflicting head revisions that have the same generation is we pick the one arbitrarily that has the hash value that sorts electrographically later. But let's just say that lighter colors win. So the green is lighter than the purple. And so that means that the end user, or the application developer, if they do a naive fetch of this document, they're going to get the yellow version, the yellow circled green number four. And the nice thing about the lighter color hash winning is that you can agree about that without talking to each other. You don't have to do a two-phase commit to decide that green is lighter than purple on both devices. So when you synchronize back the other way, now you have two devices that both agree about the current state of the document. And so let's bring in someone out there had a third device that sort of before we started the story had actually differed about the third generation. They also decided that three was going to be red instead of black. And so they've also built a four on top of that that's blue. So let's share that data with the Android device. So now the Android device hasn't thrown anything away. It's just accumulating. Now it's got three different versions of that fourth generation document. And it's got two different versions of the third generation document. So how can we get out of this mess? Well, in this case, we have everyone who knows about the blue four. Blue beats green according to the scheme we described. It's lighter. It's a lighter color. So a naive read is going to see four. One of the nice things about this is that a naive database client is going to read their own writes. So you're going to read off that green four, and you're going to stick a five down there. And when you go back to get the five, the five will have the yellow circle. So in this case, this iPhone made a fifth generation. And maybe they were even doing, I think in this scenario, they've also deleted. See how the purple four turned white? It turned into a ghost four. So they've put a deletion on that other leaf node. And so according to the iPhone, that document's no longer in conflict state. That document, the new five is the undisputed winner. If they hadn't marked the purple four as deleted, then we would have the purple four and that five. And the five would still win because it's a later generation, but we wouldn't have reclaimed the space. So what happens when we share that information again and propagate it back to the Android device? So what we get is now the Android device knows that that purple four is no longer relevant. It's got a delete on it. And a naive client there is going to read the five. And if they do it right, it'll be a six on top of the five rather than going back into history and messing with that four. So what you can see is that the iPhone was able to supersede this Ubuntu phone here just by persistently working on the same document. Basically, whoever does the most updates wins, at least for the naive client. But we're still not throwing any data away. It's not stomping on anything. It's just deferring that until you get some conflict resolution. So after we fully replicate, the one little neat thing I'll show here is that the ghost four didn't have to sync. It was just the, so now the Ubuntu has the most sort of clean representation of it all. And we can do cleanup of old generations. We just stem the old revision history. So eventually that's how you can sort of trim and avoid having too much space. So to wrap it up, we chose revision trees over vector clocks here because they more accurately model what happens in the data structures that you care about. So yeah, I think we're out of time. But hit me on Twitter. I'm Jay Chris, or find me afterward. I got a stack of business cards in my pocket, so thanks for listening.