 Thank you all for coming. Thank you Quintero for hosting. Hi, my name is Philip Thompson. I'm here to talk about specifically distributed NoSQL databases. So to give you some context, I'm a software engineer at DataStacks, where I work full-time contributing to Apache Cassandra, which is a NoSQL distributed database. It's going to be one of the case studies I'm going to be examining in my presentation. I'm also the maintainer of an open source project that developers and companies use to mock full Cassandra clusters on their laptop. That's CCM. So we're all very familiar with a client server architecture. Here we're talking about a database. So really it's more of a database server app server architecture, where the app servers are the clients of the database. And then those app servers have the users as their clients. This is a really simple abstracted architecture where we have a single machine. That's sort of the traditional way we look at it. Historically a long time ago, that's how things were. You had your one database. You build it on a very large machine. You build a lot of redundancy into it, so it never went down. And it had to handle your entire applications load. So why would we leave that architecture? Why would you want more than one server? Well, there's load. It comes down to the fact that scaling up has diminishing returns. If you have a single server installation and it needs to handle twice the load, what changes do you need to make to that box for it to handle twice the load? There is no formula for that. Doubling the memory the disk and the CPU might not get you twice the load. It might get you 150%. You might need to quadruple the size of that box to handle double the load. Eventually you will be buying for very large companies, you know, a Fortune 500 company, the absolute biggest best box you can buy can't handle their entire applications load. Failure. As redundant as you can make a single machine, something can and will go wrong to it eventually. You will suffer downtime as a result of that until you can fail over to your backup, until you can fix the problem, until you can restore it somehow. And ultimately maintenance, if you need to bring down that machine to perform manual maintenance, you're causing that downtime. For a lot of the customers I deal with at work, downtime measured in minutes equates to millions of dollars. That's a situation that many companies don't want to or cannot be in. And thus that's the biggest reason for moving off a single server architecture. So this is a sort of a settled issue that happened a long time ago is they came up with a master slave architecture. You have one master, which has many slaves. All rights go into that master, all rights come from your slaves. The master is responsible for forwarding the incoming rights to all of those slaves. That allows you to take all of your read workload off of the master. So we're in a multi machine setup, but really one machine is still responsible for all of the work. Potentially in some master slave setups you have failover built in. So if your master goes down, a slave becomes the new master. There, however, are issues with this setup. The biggest one is slave lag. If the amount of data you have coming into your master node exceeds matches or even approaches the amount of data it can pump into slaves, eventually due to network hiccups or downtime, those slaves are going to get further and further back in time to the point where it's possible they might never catch up. There are real cases of people who have to take a physical disk, plug it into their master, fill it with data, walk it over to their slave simply because that's more bandwidth and more throughput than they can get over the network to catch those slaves back up. That's something you don't want to be an expected part of your IT workflow. Scaling, ultimately a master slave architecture can only handle the amount of data the master can take in. You can scale reads infinitely, you can add as many slaves as you need, but ultimately write data is still bound by what the master can handle. There's partition tolerance, which is to say if a partition occurs where the master is unable to communicate with the slaves, you might have a situation where reading from different slaves is going to give you different results. That could cause issues in your application if that's an assumption you made that it would never happen. And finally, it's really hard to add more masters to this setup. I don't know if any of you have ever had to take a master slave setup and horizontally shard it manually, double it. I've had to do that on my SQL. It is incredibly difficult to set up an active active system on a database where it was not intended to do that from the start. You have to manually split your data so that half of it will always be going to one master, the other half will always be going to the other master. If you want to split it into more masters, you still have to keep manually sharding your data. There are external tools to do this on relational databases like MySQL and Postgres, but ultimately it's a lot of work. So where do we go from here? Well, I want to take a quick aside to discuss the cap theorem. I'm sure this is something some of you are familiar with. It was presented by Eric Brewer in the year 2000 and it discusses a certain incredibly important tradeoff that you have to consider when building any distributed system whatsoever. It is regarding what you do when you have a partition. The C stands for consistency, the A is availability, the P is partition tolerance. It's considered a pick two system, which is to say you can't have all three of these at once. Really, that's a bad way to look at it. The way you should look at it is if you have a partition, what does your system do? Does it choose consistency or availability? Because to not have partition tolerance is not an option. To say I'm not partition tolerant means you're choosing consistency because you're saying whenever there's partition, my machine just shuts down. That's what it means to not be partition tolerant is I refuse to answer the client application. That's choosing consistency over availability. So what's going on here is, this is actually the great workplace situation, is if you have even a two machine set up and there is a partition, they're completely unable to talk to each other. So these are two databases, database machines in your system. A client sends a right to the database on the left side of the partition and it says, here is some new data. What is that database going to do? It has two options. It says, okay, or says I can't take this. Ultimately, it's sort of a force binary there. There's no option there. So if it says, okay, you're choosing availability, you're choosing the ability to respond at the cost of consistency, which is to say, all of your machines agree. So if we accept this, this machine still has an old state. It has a state that no longer matches what the client thinks has happened. The client has sent a right, it's gotten acknowledgement, it is positive that whoever it talks to will agree that that data is there now. That's not the case. However, you are inconsistent here. You could have chosen to say, no, I can't accept that right, because I can't make myself consistent. That's choosing consistency over availability. You are not responding in the case of a partition. So you might think of a partition as a hard stop, absolutely no connectivity. But partition is anytime there's a lot of latency. If you have to wait for a all nodes to reach some sort of consensus protocol, whether that's Paxos or Ramp, or then dealing with some distributed lock, you're choosing consistency there over availability. You're saying, I have to wait to answer, until I am positive that I am consistent. That availability is not a full loss of availability. It's a increase in latency for the benefit of consistency. So are there any questions about that? I think I have a question. Yeah. Pretty much. What would be the safe bet? Would you prefer consistency availability? That absolutely depends on your application. There are many databases that make the consistency choice. There are many databases that make the availability choice. And they're both equally valid options. It really depends on your needs. Can you give examples of applications that would like to prefer minor data? Well, I work with, I'm going to give examples of databases that follow on the other. That works. Okay. So I have four NoSQL databases here to discuss MongoDB, Apache HBase, Aerospyke, and Apache Cassandra. So MongoDB is, by many measurements, the most popular NoSQL DB. It reached that popularity for a number of arguable reasons. One of which is it is incredibly easy to set up, is incredibly easy to use from a developer standpoint. If you are a web developer and you're dealing with JSON arguments, this is the most friendly database you could ever have. You just open up a connection, you hit save on that and pass in your JSON object in whatever language you're using with your MongoDriver. It's in the database. Totally schemaless. You can search on any subfield in your JSON document. Beautiful. The issue you're onto is performance. But we're here to talk about distribution and what choices each of these databases made. So MongoDB, despite being such a popular NoSQL database, didn't really experiment with distribution. So the way MongoDB works is it splits into replica sets, which are just master slave subsets. Each one handles a section of your data, just as we discussed, charting with a typical relational database. The difference here is Mongo will handle that charting itself, which is nicer for you, a bit easier to set up. But ultimately, you're going to run into all of the same issues you had in master slave setup. If the master goes down in a replica set, you are going to, it has automatically failover built in. So you have to wait for an election to occur where they choose of all the slaves a new master. So this is going to cause, for that period of time while you wait for that election, a loss of availability. You're not going to be able to put in new rights from the database that that replica set is in charge of. So we gained here automatic charting, but we didn't gain any increase in availability. We kept the same amount of consistency. So MongoDB is absolutely a what we call a CP system, it chooses consistency and partition tolerance versus an AP system, which is availability and partition tolerance. So just to describe the high level architecture of Mongo, your client, your application server using its driver is going to talk to a specific Mongo node called a query router. The query router knows which replica sets are in charge of which sets of the data. It's going to forward your query to that or those replica sets, they're going to answer. It's going to send that back to you. If you are depend some queries, it is unable to determine which replica set is responsible, and it's going to ask every replica set. So when you have replica sets that are down, due to a loss of master, you're going to see problems with those queries, you're not going to get responses. That's going to come back to your application as a time out error or an unavailability error, some sort of response that it could not meet the consistency you needed. One thing Mongo has, which we will see that everyone I'm demonstrating, but HP has is configurable consistency on a request. When you send a write or a read request, you can tell it how many replicas in a replica set you want to check. That is because essentially the slave like problem we discussed earlier with a master might not have gotten all the data to the slaves yet. They might be at different points and catching up to what's on the master. So they might have different ideas about the state of the data. So you can increase the consistency, which is to say you can check more replicas of the responses from those replicas. You can find the one that is the most correct, and then that is the response that will be returned to you. On a write, you're saying, I want to make sure that at least this many slaves have received the right from the master before my application feels comfortable knowing that this data is in the database. So next is Apache HPACE. HPACE is also a consistency first system. HPACE has a pretty complicated architecture. It has a master or a number of masters. Masters are responsible for region servers, which are not quite slaves. Regents servers all hold up to they recommend 100 regions. So regions. So the best way to start of it is the bottom up. You create a table in HPACE. You start running that table. That table has one region which corresponds to all of its data. As you keep writing to that table, it will realize this is too much data. It will split that region into two regions each containing half your data. That process continues forever as you write data. So the regions are split onto region servers. Regions are files on disk region servers are physical machines. The master or masters of your HPACE cluster are responsible for assigning and moving regions between region servers. So if you have certain region servers which are handling too much load, the masters will take regions off of those region servers, assign them to other region servers to balance your cluster. If you have a multi master setup and what you really have is a master and backup masters which are managed by Zookeeper. Zookeeper is an external distributed locking system. You use it to achieve consistency in distributed setups like this. HPACE is sort of the most popular user of Zookeeper. So what would happen here is if a master goes down, or if Zookeeper believes it goes down, it would contact backup master and format that it's the new master. The new master would then begin taking over that role. While your master is down and new master is being chosen, you can continue to read and write from your region servers. So the loss of master does not cause a loss of availability. However, the region servers do need to continue to be managed so that master cannot be down indefinitely. Ultimately, you'll you can have downtime there on the master, but that's on the matter of minutes, maybe hours, absolutely not a longer than that situation. The region servers hold many regions each, each region belongs to only one region server. So you might consider here that data is only stored on one machine at most. Well, HPACE sits on top of HDFS, which is the Hadoop file system, which HPACE uses for replication. So HPACE will be reading and writing to and from HDFS. HDFS is handling the replication of data. So you can rest assured your data is durable and replicated across multiple sheens, even though at most only one region server is ever handling your clients interaction with that data. So ultimately, because HPACE of data is only on one region server, there's no configurable consistency amount on queries. The way we'll see with these other databases. Any questions on HPACE or Mongo? Yes. Okay. You said MongoDB was most used no SQL database? Yes. Isn't it Microsoft's Active Directory? Isn't that the most used no SQL database? Active Directory is definitely a non relational data store. But I would say it does not market itself as a no SQL database. I'm not defining no SQL as everything that isn't relational here. Yeah, I'm thinking it specifically as the databases that define themselves as no SQL. It's sort of an opt in buzz word. Okay. Aerospike. Aerospike is an availability first database. It, I feel at least has a much better architecture than HPACE or Mongo. So it has exclusively masterless, completely identical nodes. So you go back to them. They have a hash, hash algorithm. I don't know exactly which one Aerospike uses. But what it does is the output of that caching algorithm, you can think of as a circle from the minimum value, maximum value, module all the way around. Each node in your Aerospike cluster holds a value. That value matches a value in a range of the caching algorithm. When data comes in to the Aerospike cluster, it can be sent into any node that node will hash the primary key of that data. It will then look on this ring, see the hash of this value, it will go, this is the node that owns that data. It will walk that ring, define the ring with who owns the slice of data, who owns the slice of the caching output that your rights primary key fell into. That's how data ownership is chosen automatically, completely algorithmically. So you can horizontally scale this just by adding nodes will take up new slices of the token of the token ring. And you can other way around, you can also remove nodes, your data will be distributed back out to the machines that now own what previously were their slices. Thank you questions. Is it really like this is a primary key, really primary key like is it unique for that data or can you help do this? I would say you should not think of it as the way to think of it for relational database. There is only one row with that primary key. I don't know that Aerospike's on that architectural potential. I would say I'm more than 80% sure it works exactly the same as Cassandra, where there are no different bits of that primary key. But I can't guarantee that. When you say identical, it doesn't mean that data is identical. Yeah, the data is not identical. But no, no, it holds a special designation for any other node. There's no master, there's no, I have a special responsibility. Every node is just a member of the cluster. It holds, it owns a subset of the action ring, and it gets whatever data falls into the hash ring. You then specify a replication factor that is to say, in addition to the node that owns me, how many other nodes should I be on? This is for a few reasons. One, durability. If you lose that node, you want your data to be in other locations. Two, availability. If that node goes down, you still want access to the data it owns. If you have set a replication factor of one, you don't have access to that data until that node is restored. So if this is where our data fell, you specify replication factor three, just choose the next three nodes. It's a very simple system. Easy to reason about where your data sits on the cluster. Easy to reason about why you have hotspots if a certain part of the ring is getting too much data. So they're this? Yes. So they use a tool replication factor? So you specify your replication factor. You decide how many machines each row of data should be on. So this is all happening within a data center. Aerospike allows you to define other data centers, which exist as separate clusters. You then run a separate process on each of your aerospike nodes that are informed of a sort of larger data center topology, which I've demonstrated here, you can set up active passive, you can set up active active, you can set up a really complicated start topology where you have three active active active databases and each one has its own passive data centers. Sorry. So all of these data centers are going to be holding all of the data that is in your aerospike cluster. So if you have a row of data replication factor three, it's going to be on three replicas in every data center. You also have configurable consistency upon requests, you can say write to two replicas before you return right to all replicas before you return. The default is to write to all replicas and read from only one that ensures it guarantees consistency with the defaults there at a loss availability. If any single node goes down, you can still keep reading but you cannot keep writing. You can configure that consistency to use fewer replicas that you stay up even on note failure. So in this, they use the distribution hash table? Yes, it's absolutely distributed. Including the index within the cluster or only data? I don't know how. So the primary key indexes are is how it's distributed as far as secondary indexes like on other columns. I don't know how I respect doesn't. I would expect again, same way as Cassandra, they're very similar in that each node is responsible for indexing the data it owns, only locally. So finally, I have Patrick Cassandra, same master list, identical node setup. And again, partitions data via hash algorithm. I know which one we use it's murmur three. The biggest difference is the cross data center replication. Cassandra, all data centers are active, which is to say they can all accept rights at any time and they will forward those rights to any and all data centers. The cross data center application is built in. It is not something you need to set up a separate process for. The reason it's a separate process on aero spike is nothing technical. It's simply a the open source doesn't have it and the enterprise version does. So they sell you that separate process. A limitation Cassandra has that air spike does not is is the active passive and the start topology. You can set up something similar and simply treat one of those data centers as passive by not writing to it. But everything is always active. Here, same thing as air spike, you specify replication factor. Here we're using replication factor three. So it's going to three nodes in each data center. And same system, you can configure the consistency of your requests. So on rights, you can go to as many or as few replicas as you need. Same with rates. Cassandra is also an availability first database. So you can like our spike take down nodes still continue to serve requests. They operate on what's called a ventral consistency, which is to say that all that the database is not consistent that every replica might have different opinions of the state of the data it owns, they might disagree with the other replicas. That is because you might have chosen only to write to two before getting a response. And the third one was down. And when it comes back up, it is missing data. Both systems have ketchup mechanisms built into them. Some are automatic. Some require manual operator intervention. But it's something you need to build into your application, the assumption that if you're writing them with a configurable consistency, that might mean that replicas disagree. You have to assume replicas will disagree all the time. And what are the difference? HDFS, it will write once and delete many times. Cassandra is reciprocal or C? Cassandra does not use the Hadoop file system whatsoever. So HBase sits on top of HDFS. HDFS is I's typically used for bulk analytics. You know, it is the Hadoop file system. I don't know what HBase does to get it up to transactional database speeds. But HDFS there handles the replication of the data. But it does not handle serving the data. HBase is doing that. And every single HBase slave, which is a reason server, controls, has a monopoly over its section of the data. Because they would say in the Cassandra, the use of knowledge and data sets, the faster the scrums, because they write faster? Yes. So that that is unrelated to the distribution mechanism. But in Cassandra, the reason the rights are faster is it's a log based right system. So it does not read before right. So when you insert it, it depends to a log and it will handle checking for duplicates on the read. So rights are faster than reads. So any questions? So if you have an application, so like, how do you basically verify that the consistency is where you need it to be for your application? It's sort of like critical importance, I guess. Yeah. So typically, what users will do is read and write both at a consistency of quorum, which is the same more than half of the nodes. Get that information. If you read a quorum and write a quorum from, you know, CS 101, the pigeonhole principle, you're guaranteed to always get the correct data back. Because there must be an overlap of your rights in your reads. So when you get that read back, one of those replicas must have the most up to date information. So what you get back is the most up to date information. That allows you to suffer a loss of one fewer than half the replicas in your cluster at performance setback. Obviously, you just lost a significant number of nodes, but no availability or consistency loss to your application. Very, very rarely do users use the all consistency levels. Simply because that adds high P99 latencies and a lot of difficulty when you're performing maintenance. So these seem to have wildly varying complexities and probably administration. Absolutely. So can you give ballpark numbers of like, if I know I'm going to start with one machine and I need five, I use Mongo. Absolutely. I know what the party line is here. But I would say the IRC users in the wild, say if you're not going to, you're never going to have more than five nodes, don't ever ever bother with the Cassandra. That's simply because they recommend just staying on postgres in that situation or Oracle or whatever your relational DB is. If you don't need to, if you don't need the scale, if you don't need the multiple data centers, and a very nice thing about Cassandra and Aerospike is these data centers can be anywhere in the world, all around the world. If you don't need that, it's better. We recommend and most people recommend you stay on a big box system, a master slide of architecture. If you have, if you're using Hadoop heavily, if you have all your data in HDFS, if HBase is going to put up the performance numbers you need it to, because ultimately it is constricted by the speed of HDFS, HBase might be right for you. If it's a personal project, or if it's okay if it goes down, or if it's not going to handle a lot of load, you're okay to stick with Mongo. But I would, I would never point a customer at a Mongo installation. I was watching, I mean, I'm absolutely biased in every way. But I was watching a presentation earlier today by Macy's. They were examining a number of different NoSQL solutions for a new system they were setting up. All the numbers were coming in really close to each other. It was a tough decision, except for Mongo, which completed none of their tests at the scale that Macy's was operating. So, AirSpike's documents say 10 to 100 node clusters, 5 to 1,000 nodes for Cassandra. HBase clusters, I have no idea, I think it's in the same scale of 10 to 100. That's mostly because HBase is capable of being incredibly, incredibly more data dense than a Cassandra installation is. If you need 10, 20 terabyte servers, you should be writing HBase. If you are going to run 102 terabyte servers, Cassandra or AirSpike are more up your alley. Anything else? Just a broader question. What's, what's the general cause for a node going down? Is it like hardware failure? Yeah, any sort of hardware failure? Any sort of corruption? Most of the time, a partition issue is that node is slow for whatever reason. And all the other nodes think it's down, or it's responding so slowly, it's effectively down. It's never the one who's meeting the requests because it's simply taking too much longer than everyone else. Or, you know, it could be your, if the database system you're running, you've over, you've knocked it over, you know, that happens a lot. We've definitely seen people have issues with Cassandra, with Couchface, with Mongo, with really anyone where you set up a cluster of a certain size, and you scale too far past that size, and your nodes are falling down simply from out of memory errors, from too much garbage collection, from getting caught up in IO, and they're, they can't handle it and it appears that your own system's down. Yeah. I remember hearing, at least a while ago, that people kind of screwed up with MongoDB by overwriting their same data too much. And, you know, it was a really bad use case. It's in the documentation. Yes. If you're, if you write a document and you write a longer document to the same key. Yeah. It doesn't like that. Or, you know, are there similar branches with the other systems? Yeah. So a lot of that with Mongo is they have write locks to achieve consistency. Also, when you're overwriting, you're rewriting the entire document. HBase and Cassandra are both node-wise Google Bigtable architectures, which is to say they have column families, which is to say that overwriting is great, really. They have that log-based architecture. So an insert is always going to just be written immediately to disk. Then on the read, what's going to happen is they're going to find all of the results for that key. And they're going to say, which one is the newest? They're going to be back to that. Then as an asynchronous background process, they're going to be compacting the files on disk together and removing all the old duplicate data. There are, I mean, absolutely gotchas with those systems. But I guess I would, I'd like, need you to narrow it down more than just gotchas. Nothing jumps to mind, does he know here's a beginner mistake you should connect with? Oh, okay, then that's definitely easy. Deletes. If you have a queue-based workload, you're inserting data and then deleting it later. Due to the eventual consistency these guys have, they can't just delete data on disk. Because if two nodes delete data on disk and the third one didn't catch it later when you read, the data is still there. So what they do is they write a tombstone. That tombstone is to say that at this time this data was deleted. If someone else thinks it's there, tell them they're wrong. Those tombstones persist for a long time. In Cassandra, it is a week. I don't know how AirSpike handles it. So what I'm saying is deletes are building up data on disk. They are creating data that the processes in Cassandra and AirSpike have to read through when doing reads. They have to handle those tombstones on compactions. And essentially you're bogging down the database with not data. So if you have a delete heavy workload and eventually consistent system, it's not going to be for you. You're going to want a strong, leak-consistent system. So also if you have a legal requirement to delete data, that more or less borrows the systems over there? No, so the data is deleted. Oh, just, okay. So the data is But it writes the tombstone just in case another machine is missing it, like hasn't gotten that delete yet. And then the asynchronous catchups or the operator intervention, which is not an unexpected intervention, it's like a once a week, you kick off this process that goes and catches up everyone and everything, that will cause the guarantees of the delete. But if you have some sort of delete SLA measured in minutes or hours, not days or weeks, it's quite probable that it couldn't meet that. Yeah. Absolutely. Anything else? And after Google how old air spike is to answer that with any authority on the replication systems are incredibly similar. But that's the same with react and dynamo DB. These are all children. So actually, yeah, they're both stealing off Amazon's dynamo, white paper, dynamo DB. It was the originator of this cross data center, this token ring management distribution style. It is arguably for cross data center replication, the best. No one makes it as easy to set up and handle automatically, multiple data centers around the world all active the way that the way dynamo and it's, you know, children databases do on on node, like how the query language, how each node handles data, their capabilities, errors by Cassandra react dynamo, they all get really different all of a sudden. But it's at this higher level where they all do really identical. A big difference is vector clocks. So when you have multiple rights coming in at once for the same key that are conflicting, Cassandra is a last right wins policy. highest timestamp is correct. If you need a transaction like system, you say, I need to absolutely guarantee that this is the right that works. What we allow you to use Paxos as a consensus algorithm, but that's intended only for very small subset of your workload where you're like, I like usernames like I need to make sure only one person has this username kind of deal. react and dynamo use vector clocks, which is where they get both and they give them vector clock values and they sort of return it to the client on the read to say, you sent me both of these deal with which one is right, let me know so I can update them. We wow that may be more useful for most workloads. Ultimately, it's a case by case basis. And we felt last right wins was better. Okay. Do all these database systems have PAM support? I don't like well authentication module? Yes. Yes. Yes. I don't know about I was like, but I want to I want to say almost certainly I mean, it has an enterprise edition. Usually cancel that without authentication of some sort. All right, thank you.