 Thank you, so I'm First of all, let me ask just by a quick show of hands. Who's an engineer? Who's a programmer developer? All right. All right. This is my my kind of people. All right, so So I'm going to switch gears a little bit here Joe's been talking primarily about the the left-hand side of Big data that the analytical side. I've got lots of data from different sensors or from Users contributing things I'm going to crunch through that and spit out some recommendations or you know action items for me to work on and I'm the no sequel side of things is more about you know, I've got this web application or software as a service And I need to scale that from you know running inside one company to Running for a whole country of users so so that that's that's the the space that Cassandra plays in and that I think is interesting about no sequel So what I'm going to do is I'm going to talk about five questions to ask as you think about And evaluate different no sequel solutions that will let you Make sense of the different Offerings out there because there's there's a lot of things the problem with no sequel as a term is that it's it's incredibly broad and you have things from Cassandra as a scalable system of record to Neo4j, which is a graph database that is mostly interested about data sets that fit on one machine So very very different kinds of products there So I wanted to kind of break it down into some questions. You can ask to think about the differences between these solutions and And what they are good at and and if as I go through you start to think wow Cassandra is good at a lot of these things then then I'm doing my job, right? but because I actually come to Cassandra by the way as Kind of a second generation of people working on it. So it was open sourced by Facebook back in 2008 and I came to it from Rackspace and Became the first Committer to it in the Apache Foundation outside of Facebook. So I had the opportunity to evaluate Cassandra Against things like H base and Voldemort and MongoDB so I was able to to do this evaluation myself and and Chose to get involved with Cassandra Because of some of the the reasons that I'm going to discuss today So the first question I wanted to talk about is this one. How do I model my application? What does? Building an application on different no sequel databases look like So there's a bunch of different options of which I think the most interesting are the document databases and The tabular ones the ones that have tables of rows and columns key value databases. Those are out there Some of them are reasonably popular such as Voldemort but a key value you know a key value is kind of like a very simplistic Tabular database, you know where you can have a row with a single column. That's that's my my value, right? So I'm going to focus on those to the document and the tabular models So the document model says I'm going to let you throw in balls of JSON you know JavaScript object notation where I have Fields that are named and have values in there and those can be strings It turns out all my data types here are strings, but they can also be integers At the bottom here. We have a list of email addresses. You can have Documents that contain other documents if they can be nested that way so very Nice to use when you're prototyping something because you don't have to go in and say altar table Add column or anything like that. You just start you know You just throw in a new field to the JSON that you're handing to the server Now Cassandra kind of started off with this kind of Schemeless design, but what we found was as applications evolve you start with you typically start with one You know kind of customer of the database one one code base. That's interact interacting with that database But as as you evolve that you start adding on different parts That are often authored by different teams other than the original one so you might have some Python to go with your Ruby add in some analytics with maybe hive You add in a Java component and so and the database is shared between all of these and so what what you have is You know with with the schema list approach all of those Teams Besides your original one try to have to go back to that code base to figure out what it's doing with the database So I might claim is that having a schema that lets you ask things like you know What does this user's table look like is actually a good thing? So that's the direction we've moved with Cassandra actually to towards Allowing and encouraging you to Tell the database what is that what you are actually putting in it so that that can be the source of truth for everyone Who touches that data? so We've also found that SQL as much as it gets There's some people who came to know sequel because they don't like SQL And that's not me. You know, I think SQL is actually very good at what it's supposed to do Which is giving you a domain specific language for defining and accessing data So what we've done with Cassandra is We've actually created a subset of SQL And all of these all of these statements on this slide here are valid cql valid Cassandra query language create table create index Select star And give it some where predicates those are all valid cql now an interesting Kind of mismatch between SQL and what we're doing with Cassandra and other no sequel databases is that SQL is very much Organized towards thinking of your data as normalized Tables what so if I wanted to create a Relation in the relational world between users and email addresses I want to allow users to have multiple email addresses I would do something like this where I have a Table of email addresses. It has a many-to-one relationship with my users table And then when I access it I access it by saying select Natural join or no left outer join something like that to get those email addresses out and So this is this is a non starter in the no sequel world Join joins our dirty word for us because We are we're targeting a space where we're scaling across multiple machines We don't want to have to fetch data from machines across across the cluster to join them together That's going to kill our performance. So what we do instead is we denormalize So in the document model that I showed you earlier the denormalization looks like looks like this where in the red here I have email addresses as a list and I can just put that into the document like that with with Cassandra We're taking a more schema oriented approach And so we we have this Syntax here for the create table in the red again. We have email addresses is a set of Text and so we're telling the database that this is what that's going to be and then I can write an update statement like the one below that that says Append or or create a union of email addresses with this new set that I'm giving you So The Cassandra query language It does away with joins as well as sub queries and then a little bit maybe a little bit less intuitively We also don't give you Aggregation or Any of that kind of nature because what we're saying is Cassandra is strictly about Being the system of record for your for serving your application if you want to go analyze it later and say You know, who are my top top 10 users in contributing new content? You go ahead and do that offline With with our hive integration that let you run Hadoop Map reduce jobs against it and come back with that data So we're not we're not encouraging you to do this kind of analysis analytical query Against your live database And I'll show you how we split that up in a towards the end of my presentation The last thing though that's interesting is is how we deal with order by It's kind of like What Henry Ford said about his cars you can have any ordering you want in Cassandra as long as you pick it up front So what that what that means is I'll give you I'll give you an example so in in the relational world Here is here is what a Toy Twitter application might look like I have users users make tweets and users follow other users So what I might the main query that I'm concerned with In this Twitter application is what tweets have my friends made So so here we are looking up the tweets for a username drift X We're saying what tweets have his friends made so what we first need to do is look up in the red who his friends are and Then for each of those friends we look up in the blue what tweets they've made right? So so here we're doing a join through a through a sub query So I've already said that we don't do joins In Cassandra, so the way we would do this in Cassandra would be I would I would create a separate table In which I will put the tweets of every user's friends If that makes sense, so what I'm going to do here So the I've got create table timeline user ID is Going to be the user whose friends tweets we're trying to ask about so drift X in this example And then each of those next four fields the tweet ID the tweet author and tweet body Those are that's going to be the tweet data That we're going to denormalize into this new table. So every time someone makes a tweet We're not just going to insert one row into the tweets table We're going to insert potentially dozens or hundreds of rows Into the timeline table one for each person who's following him and Then the interesting part about this is you can see that I've created a compound primary key at the bottom of the table here That I have a compound primary key on user ID and tweet ID so When when you give Cassandra a compound primary key it breaks that up and says the first part the user ID That's going to be the partition key. That's going to determine what servers this row gets replicated on the remaining parts of that Primary key the user ID. Sorry the tweet ID here I'm going to cluster the data within that partition on On that tweet ID, so I'm going to sort or order that data by tweet ID So that's what what I've got with the orange arrow here That's what I'm trying to indicate here is that each of these orange groups is a partition They have the same user ID and Then the orange arrow is denoting that I'm sorting those rows within that partition by tweet ID So that when I do this query at the bottom here in Cassandra, and I say select star from timeline where user ID equals drift x I Can say order by tweet ID if I want It's redundant because it's doing that automatically. I can say that I can also say order by tweet ID Descending so Cassandra will get can do that efficiently either forwards or backwards ascending or descending But I can't I can't say select Order by author. I can't Cassandra will say no I can't do that I would have to do I would have to sort potentially thousands or hundreds of thousands of rows at runtime And I might not be able to do that Performantly so if you want if you want another ordering you should denormalize that into another table ordered in the in that in that way if that's what you want so Then the next natural question then in our progression is to ask how do these different no sequel solutions perform So one one way to think about that is is well, okay. I I'll run a benchmark so a group of academics at the University of Ottawa in Canada presented a paper at the very large databases conference this year and They they ran and they ran about half a dozen different workloads And then and they measured that as you scale the cluster how does the performance look and And this is this is just one of the workloads Cassandra Did very well in performance but We actually want to look a little bit deeper than a graph like this Because what what you have to do to create a graph like this. They're comparing Cassandra and H base which are tabular databases with Redis, which is Sort of a key value database with my sequel, which is obviously a relational database So you actually have to make a graph like this They used a benchmark suite called YCSB the Yahoo cloud serving benchmark Which actually takes a lowest common denominator approach and says We're just gonna give everyone key values pairs To read and write so It's a very limited Kind of test when you're comparing systems as different as these which is why just looking at a graph You know, I like this graph because Cassandra is winning But but you'd want you want to look deeper and you want to understand as I move beyond keys and values and try to use the database the way it's meant to be used what kinds of Performance can I expect? What is it good at? What is it not so good at? so one of the things that that One of the most important aspects In a scalable database is what is it? What is its approach to locking? You know as as you start scaling across more and more machines as you have more and more Clients hitting it What is the approach to locking and concurrency? Cassandra's approach is to use what are called persistent collections Rather than having rather than locking rows explicitly for reads and writes We use what are called put persistent collections, which means that I have my row my copy of a row that I that I know about is actually immutable and When I go to change it, I actually clone that row and make the changes in the clone which then becomes the new Find the new row of record once my changes are complete And so what what what this what this diagram is showing is how these persistent data structures You can actually when I clone that row It's not an expensive operation because my new copy of that row actually shares Most of the information with the old one. It's a copy on right design is basically what it is we're using a Library called snap tree map is is the actual specific implementation. We're using there Which is very it's very useful because not only does it give us this copy on right? performance, but it also preserves ordering Within a partition, which is one of the the things we want to give users was within your partition key Everything is clustered on the rest of your primary key Another important part of performance is efficiency and this is what I mean by efficiency if I go through and If and say update users append this new email address Do I do I rewrite that entire user record? Which a lot of no sequel databases do or do I just append that new data? So obviously that that's that's going to affect how many Operations per second of this type that I can do if I'm having to rewrite everything that was in that in that row or in that document Another important question that's related to performance. It's not about performance specifically, but it it impacts your performance is Durability so this is the same durability as in acid in relational databases if I give you my data and say Write this data and you tell me yes, I have written your data and then You lose power is my data still there? That's durability So so to answer these questions for Cassandra Cassandra uses what's called a log structured storage engine So what we're going to do is when when a right comes in in the upper left We're what the first thing we do is we append that right to the commit log So that's the thing that's in the left here in the middle. That's that's my commit log I'm going to just append append append append to that commit log similar design that Oracle postgresql etc all of these use commit log like this They might call it a transaction log or write a head log is the same thing So so that's what gives Cassandra durability is we always put your data in a commit log first So that it's on disk And then the next thing we do is we we put it in this thing You might not be able to see the gray on gray there, but it's a structure called a mem table And on the upper part of the screen So I'm I'm going to as rights come in I'm going to batch them up in this mem table I'm not I'm not doing any random IO You know, I might be changing data that already exists somewhere, but I'm just appending that I'm appending to the commit log and writing to the mem table I'm I'm not doing any random IO on disk when the mem table gets full Then I will write it to disk and I will I will write out those rows in sorted order and at which point it becomes What we call an SS table file on disk and can be Quarried on disk Whereas before we were just querying the memory so The the importance of this is that we're never doing random rights Which actually turns out to be the right thing to do both for rotational disks and for solid-state disks Rotational disks is the right thing to do because seeks are expensive So we don't want to do random rights for that reason on solid-state disks They're not expensive for performance so much as they're expensive for your disks lifespan So there's a concept called right amplification with solid-state disks I don't have time to go into that today, but if you Google for Cassandra solid-state disks This is a presentation that a fellow named Rick Branson gave and he'll he goes into the gory details of how right amplification affects SSDs and Why that makes log-structured storage engines a good fit The last thing I wanted to mention with respect to performance is that it's important to know How a database deals with larger-than-memory data sets? This is a graph from a company called urban airship where they were using a Non-Cassandra no sequel database. I've blacked out the name of the database because that's that's not really I'm not I don't really want to Call anyone out by name here, but I do want to Point out what happened here when their data set got larger than RAM the performance went through the floor and so you want to be careful about This and and and ask how does this database deal with data sets larger than memory because if If your data set fit in memory then you might as well be using Oracle You might as well be using my sequel, right What then the interesting part about no sequel to me is not about getting rid of the relational data model. It's about Going beyond the relational data model to scale. That's the important part So if we don't need to scale, you know, who cares if you do need to scale then you need to know You know, how did how does the database deal with this situation? Another thing you want you need to know about is how do you deal? How do you handle failure? Because Once we've spread our data across multiple machines in a cluster It's not as simple anymore as when my database is down. I can't get to my data, right? Hopefully we're in a situation where if I lose one machine The rest of the cluster can deal with that and I still have my data. That's what we want so there's there's a couple ways to scale out and Design for this Kind of the classic model that you would probably end up with if you were if you started with My sequel and then needed to shard across Multiple my sequel machines. You'd probably end up with something like this where you have here I have four partitions or shards each of which has a light blue machine Which is my master and then the master replicates to the dark blue machines that are that are just read-only replicas They don't handle rights. You know, they're they're just They're as backups basically so in in this situation if if the master node goes down Then I need to have a new master election. I promote one of the replicas to be master and I carry on from there There's a couple problems with this Which philosophically I think this is the wrong approach first of all because While you're doing that master election and and failing and doing the failover To a to a new replica You have unavailability you can't handle new rights while you're doing the failover Maybe your failover is super quick. You know, that's what you that's hopefully what you're shooting for but you still have a Small amount of downtime The second problem is kind of what what rick's referring to in this in this second quote here Excuse me Which is that in In a master slave design Failover is a rare occurrence If it's not rare then you have lots of downtime during failover so but but since it is rare It's relatively untested and you have lots of corner cases that you need to deal with for instance What if my my blue my light blue master? It's not really down, but it lost network connectivity. So it still thinks that it's the master So I fail over to a new master the first master regains connectivity now. I have two masters So not not a good situation. So dealing with that kind of corner case is is where the Pain comes in with this kind of scenario Google App Engine that where they run kind of a data store as a service as part of App Engine they've had Two day-long outages because of master failover With this kind of scenario so even Google with you know six years more experience with this than most of us Still struggles with this and I think that's inherent to the design So What does Cassandra do instead after I've explained why I think this this master design is bad What do what does Cassandra do instead? We have a fully distributed system by which I mean Every node Is a master in the sense that I can always accept rights for any replica and and forward that data where it needs to go So in this diagram, I have the client talking to a node in the upper left for data that's on these other three nodes so the the The machine the client is talking to Does not own the data Which which is a good thing because if that node fails I can reconnect to any other node in the cluster and it will also be able to route it to the right place So one of the things that this gives me by having every node be a master in this sense is I can actually have clusters that span multiple data centers and Clients in each data center can Talk to a Cassandra node locally which will then forward it where it needs to go Anywhere in the cluster by the way what I've tried to do with this diagram here is show two data centers that are on-premise And two data centers in the cloud We actually have customers with this exact scenario where they have It's spread between their machines and Amazon's machines and Cluster Cassandra connects all that together in a single cluster We do this efficiently by the way where if I'm forwarding a right to a new data center and that new data center has Three copies of the data. I'll just send one copy over the WAN link and have that One of those replicas forwarded to the others locally, so it's efficient like that Cassandra also takes care of Healing itself I in the in the situation I mentioned earlier where you know one of my replicas goes down and it starts to miss Updates so in a in a completely healthy system the risk the request Life cycle will go something like this where the client will make a request to the coordinator node Which will send that data to the replica node Which will respond to the coordinator node that everything went fine, which will then send the response to the client Now what no when things don't go right if the replica fails Then the coordinator will have to reply to the client that hey Something went wrong that the replica didn't acknowledge the right, so I'm good I Waited for a while it never got back to me, so I'm gonna have to say that I'm gonna have to tell you that it timed out But what I also do before I tell the client that it timed out is I store a hint And a hint means that I'm going to To store a copy of the the update that the client gave me and so when that replica Regains health or connectivity. I will then forward to that replica the update that it missed So that's that's one way that that we deal with Self-healing in Cassandra We also have a more heavyweight repair, which is called anti-entropy repair And what what that's useful for is you know in our in our diagram here where the coordinator has hints on it Well, what what happens if the coordinator dies to? Before the replica comes back up and it wasn't able to send those hints to it then we have to do what we call Anti-entropy repair with one of the other replicas. It's basically our sink for databases Where we go through and we say what data do you already have here is the data that you don't have? Maybe more interesting is How do you deal with partial failures? so here's a situation where I have three replicas of this data and The node in the upper right is 90% busy. So it's going to be slower Responding to my requests and so what Cassandra does is it actually tracks how each replica is performing and Will route requests to the replica that's least busy so often what what you'll see happen Is that you know disc failure will often if the disc won't just die it will get really slow first and so what you want to be able to do is Handle the scenario where the the node is nominally alive, but it's it's you know, it's slowed down It's sick and and route requests away from that. So Cassandra does that As part of its normal operation so that no matter what it is that's making the node slow We deal with that situation naturally So, you know, I kind of talked a little bit and waved my hands about Cassandra architecture and why that makes it good but the the proof is in You know what what the boots on the ground see what What people running Cassandra actually experience So I really I like Bills quote At the top here that Cassandra is kind of indestructible You know that that's really the design that we're shooting for that. There's no single points of failure You lose machines. It keeps on going The middle quote here is is about a an Amazon outage You may have noticed that US West if you're deployed across American data centers US West US East keeps going down for Amazon So when Netflix Uses Cassandra extensively When US East went down no problem, you know, the other data centers kept going so related to this Designing for for failure is the scalability question. How does it scale? What's the design for scale? Again coming back to this slide that I showed you earlier You want as as the as the number of nodes grows You want the performance to keep growing and ideally you want that performance to keep growing in a straight line Now 12 nodes isn't a whole lot. That's as far as the the VLDB Benchmark went so Netflix actually did a benchmark scaling up to 288 nodes At that point they were doing about 1.1 million writes per second so again, we see that the The line is nice and straight. You know that that's what we want to see and interestingly this this was They were doing three replicas for each of these rights, so this cluster is actually doing 3.3 million physical rights per second To handle to handle this workload So I'm not going to spend too much time on scaling because if if you have a fully distributed system Then scaling is mostly a matter of not doing the wrong things You know, you don't want to have you don't want to have a metadata server that clients have to connect to to find out Where their data is You you don't want to have Router nodes that that become a bottleneck What about when you're adding capacity? This is actually a tough one. The third one here is actually tough when you add capacity To your cluster that means we have to move data to the new nodes How do we do that while creating the least impact possible on the existing nodes in the cluster? This is something that that we we worked on for a while with Cassandra. It took us a while to get it right So the last thing I wanted to talk about is flexibility and and there's a lot of Ways you can look at flexibility but One of the ways that that we tackle this with with Cassandra and at data stacks specifically What we do is we allow you to Do what I was talking about earlier and and run analytical queries against your Cassandra data In the same in a single cluster and the way we do that is In this diagram my blue nodes. Those are my my live Application real-time ish nodes and we're going to keep a copy of the data on these green nodes So you may not be able to read it that says analytics for Hadoop on those green nodes So those are going to have their own copy of all my data. I don't need to have multiple copies I'll just put a I'll put I'll just put a single copy on those They'll probably be machines with more disk Than than the blue nodes Because you know, all I'm going to be doing is basically doing sequential scans doing Doing my analysis against those So as an example, we have a demo application That we ship called the portfolio manager demo and what this is is for for each of these Each of these boxes represents a user's stock portfolio So he has a certain allocation of each stock in his portfolio and we we know how much the portfolio is worth It's gain or loss for the day, but what we don't know Because it's it's something we can't compute without doing a bunch of joins and sub queries What we don't know is the largest historical 10-day loss So what we do is we actually use The hadoop integration to calculate that the largest historical 10-day loss and replicate that back to the real-time side For presentation in this manager view So the way we do this is we have three tables in in the Application that we're going to deal with here. We have a live stocks table that says what is the Price of a stock at this exact point in time right now Then we have the portfolios table that says this is how many shares User x owns of stock y And then we have a historical table of what was the stock's closing price For for these Prior dates So what we're going to do Is we want to compute Historical 10-day loss that's going to look something like this for every portfolio What was its largest 10-day loss and when was it? So we're actually going to compute that in in several steps first. We're going to compute for each stock what was its 10-day return on each closing date And so we're we're going to do that with this is this is a hive query That that you can run against Cassandra So it's a dialect of SQL. We're going to be doing a self-join from the stock history table to itself Across across 10 days worth of data and then we're going to we're going to sum that up Uh The next thing we're going to the next step in our computation is going to be for each portfolio What was its 10-day what was its 10-day return on each date So here we're going we're going to join that 10 days return that we just computed We're going to join that with portfolios and and query that Then the the final step is to take that portfolio and Portfolio returns table that we computed and join those together to get a single row for each portfolio with the largest 10-day loss So this this is what we bring to Cassandra at data stacks is integrating the Real-time system of record side with the hadoop analytical side with management tools And I don't have time to talk about today, but we also integrate it with solar search To to give you a full text search against your Cassandra data Today my best guess is that there are over a thousand Cassandra deployments in put production out there These are a few of them You know, we're we're really seeing A ton of adoption from companies that you know need to move beyond You know, they're they're my sequel or their Oracle installation and make that scale So I think I have time for one or two questions He's not glaring at me. So yeah, I'll take I can repeat your question So by replicating the data over So many master nodes and yeah, I'm not having any slave nodes. Aren't you just moving the Memory issue from one side to another side So now I got great indexes and I can go and look up my data faster But I have to process more data on the back. Yeah, so the question is Well, let me let me ask a slightly different question. The slightly different question would be This is the question he meant to ask Does does Cassandra replicate all your data to each node? And the answer is no You you tell Cassandra how many copies of your data you want in each data center So I didn't make this clear that that was my fault I said that each node is a master and by that I don't mean that it needs all of the data replicated to it I just mean that it's able to handle and route Queries for any data and send it to the right nodes. So I could have Frequently people have clusters of dozens or hundreds of nodes and they will have Three copies of their data or or sometimes three copies in each of two data centers for a total of six copies So that that's a more common way to do that One more question Yes, I didn't understand how do you avoid Random iOS. Could you explain me the game, please? Catch me outside Yeah, I don't I don't really have time right now, but the basic idea is that We group we group your updates In memory before we write them out as data files So we're not actually if I have user jbls with Password x and I change my password to password y I don't actually go to that old row and overwrite it. What I do is is that new Password becomes part of a new data file that I that is that I write out sequentially And then what I do is in the background If I have older data files in the background I go through and I merge those data files together also sequentially Which I can do because I've written them out in sorted order and so And that background process is called compaction. So that lets me avoid doing the random writes And if that doesn't make sense catch me outside