 My name's Ben Engber, I'm with a company called Thumbtack Technology, and I'm here to talk about how you make some of the decisions in trying to determine what NoSQL database to use for an operational workload. This is one of those topics that seems relatively straightforward in theory, and then the more you look into it, the more complex it starts to be, and the more questions start to come up. And you end up starting realizing that at the start you are asking exactly the wrong questions. And so I'm hoping with this talk to give you a framework for evaluating these things and show you some of our experience and some of the things we've discovered, and hopefully save you a lot of time in understanding how to think about these kinds of systems. So very quickly, just about us, we're a consulting company with a focus on scalability. We've been doing NoSQL-like solutions, as has anyone who's focused on scalability, for quite some time. And as the market has matured and these technologies have gone mainstream, we've jumped on this bandwagon with all our hearts. And as such, for the past year, we've had a team or a set of teams working nonstop researching how these systems perform in various circumstances, failover and so forth, as well as a whole bunch of benchmarking studies. I'm not going to go into the benchmarks today because that's another whole complicated thing, but we can talk a bit about that at the end. What I want to talk about now is something that's coming up from our clients more and more, which is they want to use NoSQL. They have problems with big data that they want to solve, and they want to understand how best to try and solve this problem. So this thing I have on this slide right here, a list of features just from a couple of the websites from TenGen and DataStacks, and this list of features is completely fair, but if you're looking at websites to try and understand what NoSQL database to use, it becomes pretty hard to make a decision because, as you can see, these features tend to focus on things like usability and then sort of the generic scalability. Scalability in particular can mean a lot of different things, and it's very hard to know what trade-offs each individual database is making to achieve different levels of scale, and they're all making trade-offs. So it really helps to take a step back and ask, first of all, why am I interested in NoSQL? When we go to clients and we ask them, why are you interested in this, far in the way the most common answer we get is, because I've heard of it. And it's really not such a stupid reason, right? They've heard of it because there's a lot of buzz around it, and there's a lot of buzz around it because they're real problems that are starting to be solved. But it is a pretty stupid reason to just throw something into production. So the next thing that comes along is rapid application development. This is something I'm a big believer in. I'm a believer in agile development, and I'm all for tools that speed your development process. You gotta be a little careful about this when you're talking about your data tier, because this is data you're gonna have to live with and understand. And all the benefits of schema-less storage is gonna come back and bite you when you have to interpret it. It's also a little bit hard to quantify. So I think the most interesting reason why is people say, well, I wanna do something with big data. What does big data mean? Well, to the extent it means anything concrete, I think you can kind of divide it into two categories. One is sort of in your analytic workload. Things like we've accumulated terabytes or petabytes of data. We wanna find patterns in this data. We wanna match things together and we want tools that enable us to do this. Tends to focus on batch jobs, things like Hadoop. And the other side of this is a completely different kind of workload. That is our production servers, we want them to handle more and more types of stuff, capture more and more data. And see what we can do if we have all this extra data. And these tend to be much simpler transactions but a huge number of them. So I'm gonna focus on the operational workload today and talk about how to evaluate which database will fit your workload. So traditionally, people would divide these databases into sort of three major categories, key value stores, document databases, and column family databases. And I think this is really not a terribly interesting way to divide it. Mostly cuz the lines between these are just blurring. This is a very rapidly evolving space. And for example, probably the two leaders in the high performance key value store within the past less than a year have added secondary index capacity. What does it mean to take a value and put an index on it? How is it different from taking a document and putting an index on it? Maybe there's a little bit of difference there, but it would be in the margins like the type of queries you're planning to run. But the main thing is these things are all sort of converging towards one kind of database backed by key value storage. And so once we've made this decision that we want to handle more operational load than a traditional relational database can give us, we're really sort of asking these kinds of questions. The kinds of questions like how do I handle all these transactions without going crazy? Am I sure that once these, when a node comes down that this cluster will continue to function and will this happen automatically as advertised? And all these kinds of questions all focus around the key value part of the database. So all these databases, column store, document, they all have at their core the ability to take keys and rows and put them in various places and perform this plumbing for you. And the questions we want to focus around for our operational load is whether the various guarantees of various databases meet the needs we need to have for our applications when we want to scale them. So the way a lot of these companies started talking about this early on was in terms of the CAP Theorem. And there's actually been a number of presentations here so far about why that's not the best way to think of it. So I'm not going to spend a lot of time talking about that. But it boils down to the fact that trading consistency and availability is a real thing. It often doesn't mean what you think it does because a lot of these terms were defined in very academic terms. Also, your application probably lives in all kinds of places here. So even for all of you, even if you're doing purely real time stuff, I bet you somewhere there's a cache that you're running somewhere and you're doing something inconsistently. And the databases themselves live in multiple places here. So I think a much more interesting way to think about this is not in terms of consistency versus availability, but in terms of speed and reliability. And by reliability, I mean something that depends on what your application thinks reliability means. It might mean something like immediate or eventual consistency. It might mean something related to transaction isolation or how we resolve conflicts. Or it also might be much more focused on the durability of that data and how much data you can afford to lose in the event of a catastrophe. And each of these databases allow you to trade off different dimensions here and live in multiple places on this spectrum. To talk about how these databases make their changes, make these kinds of trade-offs and their guarantees. I'm going to spend a little bit of time just sort of walking through some topologies of how these various types of databases achieve some of these things and what they do. For these upcoming diagrams, it represents a cluster, a hypothetical cluster of six nodes, and then six virtual shards on these nodes. So by shards here, I don't necessarily mean physical shards. I mean each piece of data is going to be living in various places in this cluster. And for each of these topologies, I'm going to discuss a couple of the possible scenarios in which these databases will make that trade off between speed and reliability. The easiest one to talk about, I think, is MongoDB here because it's very similar to how we're used to thinking about relational database replication. It really is pretty much a sharded solution and each shard has masters and slaves. So when we have a client, by default, if we want to run in the fastest mode, we're going to write to a master node and get a result back and then that master will replicate asynchronously at its convenience to any number of slaves. Now, if you think about this, this is an immediately consistent solution because every client is always writing to the master piece of data. But we have a durability problem here because if that one node goes down, any unreplicated data is going to be lost. If we want to add some durability to this, to this, of course, Mongo will let you do this. You simply request that Mongo delay until that replication is complete. That will, of course, increase your latency but give you some added durability here. Something like Couchbase works in a similar fashion. Couchbase doesn't have dedicated masters, but for every row in your database, there is one place that functions as a master row for that database. Again, you'd write, in the fast case, asynchronously replicate. And again, the process for making sure you have reliability is to wait until at least one of the replicas has this data as well. When we're talking about systems like Cassandra or React, they take a different approach without any single master. So the fastest case for these kinds of databases is to write to whichever node in your cluster they happen to feel like. So client one might be writing row A to node one, client three might be writing row A to node three. And since they're replicating asynchronously, there's no guarantee that these things won't conflict with each other. This is the traditional eventual consistency that we hear about. And the way to bring additional consistency into this type of arrangement. We have a number of options. We can request that the client write a copy to every node in the database. So that when we read one, we're guaranteed to get the latest version. We can do the converse. We can write always to one node, but every time we read, hit every node and then just take the most recent row. Or we can do a quorum and just do two and two and make sure we're guaranteed to get one copy of that latest one somehow. And this is the, you've probably heard it before, but as long as you have the number of writers plus a number of readers being greater than your replication factor, you're guaranteed to be consistent in that you'll get a copy of that data. And then we have some databases that build on top of this a bit. Sort of the same basic idea with multiple nasters in some level. And then in various ways, they will make some additional transactional guarantees to get a level of acid support into their database. So in a general sense, what this means is that systems like Mongo and Couchbase will tend to be making that trade not with consistency, but with durability that you can go fast or you can get more durable. I say quick and dirty here because, for instance, you can request that Mongo read from a secondary node and force yourself to lose consistency as well. And we have some benchmarks around this as well. It actually doesn't really buy you anything and often hurts you. But in a general sense, that's how those work. Whereas Cassandra and React take the consistency approach. They can sacrifice consistency for speed. And then Aerospyke and FoundationDB take that, build upon it and provide some layer of acidity to their transactions. Okay, so the question is what's the difference between durability and acid and the answer is the D in acid is durable. So let's go back a row, right? So MongoDB and Couchbase can trade speed for durability. Well, they can get durability there. Are they acid? Well, probably not, depending on that. Now, acid itself can mean a couple of things as well. Actually, I'm about to talk about what that means. But the durable piece of it, well, I'm gonna show you an example of where you're durable but not acid. Well, both. So the reliable case for Aerospyke does have that consistency guarantee. Does have that durability can guarantee. But there's also this transaction isolation component of it. So there's additional guarantees beyond just adding that durability there. And this is actually what this slide is about to show. So I'll give you, this is just an example of exactly that. So this definition of consistency here comes right from the data stacks documentation. It says that in a database like Cassandra, something's consistent if after your write completes, all the reads see at least that most recent write. And that's clearly a guarantee it makes when we do this. Reads plus writes is greater than your application factor. But if you think about the context of, well, for example, the first example I gave, where we have the number of you write to all the nodes and then you read from one. Once that write is complete, obviously, you're gonna get a recent copy of the data. But while those nodes are being written, there's no saying what's going on. You can get an old value followed by a new value followed by an old value followed by a new value, you're just getting dirty reads. And to what extent is this an actual problem of dirty reads? Well, it depends what you're trying to do. So this graph here is just one of the tests we ran. The incidence of these dirty reads was pretty small, but we artificially injected a little bit of latency into our cluster. And you can see that very quickly we started getting between 10 to 15 and as high as about 50% of all our reads being dirty. Maybe that's fine, relational databases allow you to do dirty reads as well. But if there's a good chance it wasn't what you were thinking about when you were thinking you were working with consistency. So that's where this acidity question comes into play. So there's this isolation level. And this itself too, there's a spectrum here just in terms of how acid your database is. So at the basic level, you can get a row level optimistic lock that compare and set operation. That's something that's being added in Cassandra 2.0 and a number of databases have it. You can go a step further, and you can sort of apply the same acidity, the same isolation level across multi key transactions. And then for something like in FoundationDB, even longer running transactions across multiple queries, a long transaction. I'm sorry, I'm trying to understand that question. So the question as I understand it is does Cass, does compare and set count as acid? Is that sort of your question or? So I do get an exception, so. Right, well it depends what you're comparing against, right? So if you're comparing against 10, all right, so I'm trying to repeat this question, right? So to what extent does compare and set provide this acid property? Well, the idea behind compare and set is that you can do this optimistic locking. So if you lock on, if you have the example was, if you have an account at $10, and you lock on the value of $10, and two separate transactions try and lock on that same value of $10 and set it to 9, you're gonna end up with a problem, which is true, I agree. And I would argue, don't do it that way. Lock on something else instead, and then lock on a version stamp, right? And then you're not gonna have that problem. What you will have is when that second transaction tries to commit, you're going to get an error because someone had already modified that out from under you. Well it's an optimistic lock, there's no locks in the database, your application just says I need to retry this because you see that some other transaction had hit it, nothing's being locked in the database. Now there are actually other approaches here too, so React for instance is doing something that's called CRDTs. So there's different types of data types that converge, right? So one way of doing this, if you have an eventually consistent system, and you wanna do something like decrement. Well decrement operations will eventually converge, and so you could define your field in this way so that all these decrement operations, they might each report nine in their respective local zones. But eventually they're gonna combine together and get you the right value. So that's another approach of doing that. I'm not gonna spend a lot of time on this chart in the interests of time, but what I want to show here is basically what you can start to think about. There's just too many dimensions to really fit comfortably in a slide. And you need to think about what your application cares about in terms of consistency, in terms of durability. And then sort of flesh out all these things and then you can sort of create a framework for how you want to benchmark it for your own workload. Because benchmarking the abstract is going to give you misleading information. The last major topic I wanted to discuss was around this availability. I sort of glossed over this a little bit earlier when I talked about CAP. But availability is a real thing. And as far as we're concerned in terms of our business, it means is the cluster there for our application to connect to? And so we've run just a ton of different kinds of experiments taking different databases and just messing with them. And this picture here represents sort of the hypothetical way we'd want our NoSQL database to behave. The red line represents the throughput of the system. And if we mess with a node and take it down, we would expect that cluster to go down for a bit. But we also expect it to recover. And then simply when we add a node back, we'd also expect something like that to happen. And what we would hope is that these little gaps in availability are quite small. So we ran a lot of these tests. I'm going to highlight a few cases that I think illustrate some of what to think about when you're planning your capacity and so forth. The simplest case is just taking a cluster and just collaborating it. Put all the load you can, have it go as fast as you can, and put as much load on it as possible. It's not the most interesting case because what you see is probably what you'd expect. If you knock away 25% of your capacity, your throughput goes down by about 25%. Unless you're something like MongoDB, where you have master slave, you haven't lost any capacity, you just had the slave take over. So I'll go back. Yeah. Right, so in the Couchbase graph, what you're seeing here, yeah. So one node goes down. You've lost capacity, of course. And then when we reintroduce the node, you actually see some distortions as the data gets replicated to that node that's coming back up. In Couchbase, in particular, that node is treated as junk, basically. And so it just replicates everything over. So there is a long tail with this sort of, you could see it's a little bit volatile as Couchbase gets that node ready to hit the maximum capacity. But you could, well, if you think about Mongo, Mongo is using strict master slave. So Mongo never actually lost capacity because it wasn't using half of its capacity, right? In the case of Aerospike, yes, for this scenario, it recovers faster, as you can see. I think actually a more interesting one is not when you run it at full capacity, because that's not how you're going to provision your servers anyway. You're going to provision them with the assumption that you're occasionally going to have failures. So if we run it at 75% load here, this is essentially the worst case scenario in this event, like you're just saying. If I lose a node, my cluster should then be maxed out, but be able to handle it. And the nice thing is that for all these databases, it appears to work quite well. There's very little downtime, and they are able to function at very close to that initial capacity. Now, I think it gets more interesting. We mentioned Aerospike in that replication. Once you take it away from that fastest possible mode and start adding some of these reliability guarantees, and this is with synchronous replication and writing to disk, you can see it still functions, but this replication delay becomes very, very significant. I mean, there's a lot to do here. Hundreds of millions of rows have already been written, that now this new node needs to be brought up to speed on. But this should affect how you're thinking about capacity planning here. Because in theory, again, you have enough capacity for this 25% outage. But in practice, you've got this other traffic that needs to happen. The Cassandra case is sort of the typical case of no availability. If you're telling Cassandra, I need to write to every node, and one of those nodes goes down and that contains, and you're trying to write data to something that lives there, your cluster is just not going to respond. And so the bottom is the same example using just a quorum, where you get some effect, but it tends to work. All right, and I'm gonna- The middle chart is showing Cassandra using that write all quorum. So if you're trying to write a transaction with write all mode, and one of those nodes is down, the database is just not going to be available for you. Yeah, that's sort of the easiest consistency availability trade off you could see, and so what does it mean for Cassandra? In practical terms, it means don't use a replication factor of two if you want to be consistent, that's what it means. Go down, get more nodes in there, and use a quorum, because this can happen. I mean, if you care, you might not care because you can bring your cluster back up, which is what most of us do with relational databases anyway. We manually panic and get it back up. So it's not ridiculous. So these charts are talking about availability. The question is where do we see data loss on these charts? So there's sort of two answers here. This is with synchronous replication. So there's no data loss in any of this particular slides cases. If we go back to one of these ones where I'm doing asynchronous replication in this fast mode, when that node goes down, there's some implicit data loss there. The amount of data loss is a little bit difficult to quantify, because it depends on what happened. Did your machine burst into flames? Or did it just power down? And how much was in RAM versus how much on disk, and so forth? I like to think of data losses. Well, let's just assume the whole machine caught fire, right? So in which case, your data loss is your replication backlog. All right, so I really don't have a lot of time, so let me just quickly wrap up. The main takeaways here is because this is such a rapidly evolving marketplace and these vendors keep adding new features, the right way to think about this is not in terms of those feature sets, but in terms of that underlying key value storage. And the right way to think about that key value storage is in terms of the kinds of trade-offs that particular database can make and whether those kinds of trade-offs are in line with what you want to do with your application. Once you have that set out in your mind, that's when you can really focus on benchmarking your particular use case and thinking about all these other features that makes the system usable or desirable. But that should be put off what you need to do the first part first. So I did mention at the start that we've done a bunch of benchmarks around this. These benchmarks, just trying to get a baseline between databases is a pretty significant task. I'm happy to talk about it. If you want, we have some papers that are fairly long, that describe how to do it and some of our findings. If you have any other questions, I'm still have maybe a minute or two. Yes? Yeah, so the question is, what does this periodicity and the aerospike replication mean? This is one of those things that I think might be, I can tell you roughly what it means, because if you think about what's going on here, is the database needs to recover the node. And it needs to write all this old data that's been saved on other nodes to this new node. At the same time, we're clobbering this cluster. So what pretty much all the databases allow you to do is some configuration of how much priority the replication should have versus servicing current requests. And I think what you're seeing here are these batches of replication traffic. And there's a lot of settings that can change the shape of that a bit. If you want, you can come up later and I can get you in touch with the engineers who actually ran this, who might be able to give you lower level insight into what was going on there. Any other questions? All right, well, thank you very much. Oh. Did you show the contact information? Oh, yeah, I'm sorry. Here's my contact info. Please feel free to reach out, drop me an email. I'm happy to discuss this in more depth. And if you have specific questions, I can get you in touch with the various people who actually ran the various tests. We have a ton of these kinds of results, and I've just done a smattering from different studies we've done. All right, thank you.