 All right, let's get started. So I'm going to talk about some of the top five questions you should be asking as you're evaluating big data technologies. The first one is, how do I model my application? This is going to directly affect your developers and even your ops team as your application ages and adds new features. So some of the popular options are key value store, kind of the most basic, tabular data, graph databases, maybe sort of kind of. I don't know of any graph databases that are kind of native big data platforms. But there are graph databases like Titan, which is built on Cassandra, that you could legitimately say is providing a graph interface to big data. And the last one is a document data model. So I come to Cassandra after evaluating it as an outsider. So I wasn't part of the original team that created Cassandra, but I evaluated scalable databases for rack spaces needs and settled on Cassandra. And one of the things that I noticed as part of my evaluation is that schema is a good thing. So document databases and some other databases, as you schema and avoid it and say, you can put anything in a document that you want. This could be one user record. The next user record might have an integer for his user ID instead of a UUID field. You can have different fields across different record types. And the flexibility can be useful as you're prototyping and evolving your application. But what you see as applications get successful and as they get longer on in their lifecycle is you start having multiple code bases that start to touch that data. The data is like the common room of the dormitory where everyone meets together at the database. And so what you see is that you might start off with a Ruby on Rails app that's creating records like this, but then someone wants to write a cron job in Python. So he needs to know what kind of data is in your documents. There's no way for him to tell other than maybe grabbing a sampling of those documents and trying to see what kind of common patterns there are, or he has to go dive through your Ruby source code. So a schema, having a formal schema that says, this is what kind of data is in these tables and these parts of the data is a good thing. It can help your development as your application gets to that stage of having multiple languages and multiple code bases interacting with it. So with Cassandra, we take the approach of we have a mostly tabular data model and I'll get to the mostly part in just a second. But this is legal Cassandra data definition. So we have a language called the CQL, the Cassandra query language that gives you a subset of SQL that is appropriate for doing data modeling and data access with Cassandra. So one of the things that is good about document databases that we've wanted to adopt in Cassandra is that they do make it easy to make collections of different data types. So that maps very naturally to programming languages where I will model my user record as having a list or a set of email addresses. So in the relational world, if I wanted to say this user table has, or the user records in this table have multiple email addresses, I would do it like this, where I have many to one relationship between the email addresses and the users. And then I would query that by doing joins. Well, you know, Cassandra does not support joins because it's a great way to kill your performance in a distributed system. So we don't support that. But what we do support is creating collections of data in a field. So in red here, the email addresses is declared to be a set of text. So this is how you would do that in Cassandra where you would say, add this, I would take the union of this new set that I'm giving you and add that to the existing set of email addresses. So in a sense, we've taken the best of the tabular, formal schema world and the best of the document world and we give you both of those. So I said that we don't do joins. There's a lot of reasons for that. We also don't do group buy. We don't do sub queries. And there's an interesting part about, there's an interesting limitation on order buy. So what we want to do, our goal here is we want to be able to give you your data in the order that it's appropriate for your application but we don't want to have to sort millions of rows at runtime to give you that. That's gonna kill your performance. So what we have you do is we have you declare your ordering or your clustering of your data ahead of time and then we preserve that for you as you update that in Cassandra. So what I'm gonna show you is how we would do a query like this in Cassandra. What I'm envisioning here is modeling a tweet data where I have users who follow other users and users create tweets. So what I'm gonna do is to find the tweets that my friends have made. So I've got the user whose friends were trying to find in red here. So in the second user ID, that's the user whose friends were trying to find. And then the blue user ID, that's going to be the friends who made the tweets. So we're doing a sub query here. So what the query plan would look like would be something like this at the bottom where we're grabbing out those red users which are the friends. And then for each of those users, we need to go and find the tweets that they've made. So you can see how this quickly becomes bad for performance because look at all the random IOs I'm having to do. I didn't even draw all of the blue lines for all of the red ones. I just drew them for one of them. So we do lots of random IO for this kind of query. So what we do in Cassandra is when I would, instead of doing that sub query which could also be expressed as a join, instead of doing that, I'm going to denormalize the tweets that each user's friends have made into a table called timeline. And here's how I would declare that. I would say that I'm gonna have the user ID there which is going to be the user whose friends I want to query. And then for each row, I'm also going to denormalize the tweets that those friends have made. So I'm gonna have the tweet ID, the tweet author which is gonna be another user ID and then the tweet body. And then what I do is I declare that there's a primary key, a compound primary key on the user ID and the tweet ID. And so what that tells Cassandra is that first part of the primary key is going to be my partition key. So what I have over here is that I have for each user, I have basically a partition in orange of the tweets that his friends have made. And then Cassandra will sort those partitions on the rest of the primary key, that tweet ID. So each of these are going to be in sorted order by the tweet ID column. So what that means is I can say select from timeline and then where my user ID is this partition and then I get the most recent tweets his friends have made without having to do any extra computation at runtime. I just do a sequential read. I get it back in the order that I want. So this is the key concept when data modeling with Cassandra that you give up joins in exchange for denormalizing into sorted partitions. Second question to ask about data stores that you're evaluating is how does it perform? So this touches on a lot of things. One thing that I wanted to highlight is does it handle data sets larger than memory? So this is a graph from Urban Airships data store monitoring system from a talk they gave a few months ago. I've blacked out at the top what database service this is because that's not what I wanted to talk about. I just wanted to point out that this storage engine when the data set exceeded RAM, the performance went through the floor. This is not the, of course, if your hot data set exceeds RAM, you're going to be doing more random IO. Your performance is going to be worse. That's normal. That's not what this is talking about. This is saying when the total data set size exceeded RAM, my performance went through the floor. So that's something to be aware of. Does this product really support data sets larger than memory? Of course, there's some that are explicitly saying that's not our goal, like Redis. They're saying we're an in-memory storage system. We don't care about data sets larger than memory and take it or leave it. That's fine sometimes that has its place, but you want to be cautious of systems that ostensibly support data sets larger than memory, but then you get this when you actually push it past that limit. Another thing that's important to know is how does your database deal with locking and deal with multiple threads, multiple requests coming in simultaneously? The classic, in the relational database world, kind of the gold standard is row-level locking. You also have MVCC with Oracle and Post-Crystal. It does the same thing in kind of a different way. In the no-SQL world, you have to be a little more careful because a lot of products do kind of the equivalent of table-level locking. So that can really cause a lot of contention when you're throwing a mixed workload at your system. And I'll talk a little bit about how Cassandra's storage engine, we actually don't even lock on writes. We have a different way of reconciling conflicts on writes. We don't need to lock on writes, and we use lock-free concurrency algorithms to give you isolation between the reads and the writes. Another thing that you want to know about is what's the efficiency story of my database? So where I'm going with this is, in particular, a lot of document databases, in fact, all of the ones that I know about, when you update one field in a document, it actually has to pull the entire document back into memory, update that field, and then rewrite the whole thing back out. So it's highly inefficient. So again, I'll show you how this works in the context of Cassandra. With Cassandra, even with our collections that I showed you with the email address is set, when I add new elements to that set, I'm only writing out those new elements. I'm not having to rewrite any other parts of your data. And the last thing that's important to talk about in terms of performance and how that affects your storage engine is durability. What's the story? When I send you new data and you say, okay, that's been written, and then there's a power failure, is that data still there when you come back up? So that's another important thing to keep in mind. So how Cassandra addresses these aspects is we use a log-structured storage engine. So what I'm gonna be showing you is here in the upper left, we have data coming in where I've got the primary key in light green and then the data in that row in a darker color. And what we're gonna do is we're going to first, Cassandra first appends that update to a commit log. And then we're gonna put it in this in-memory structure here called the memtable. So it's important to note that this is what happens for any part of a row that gets updated. If I update a single column, there will be this tuple of primary key and the single column that's being updated that comes in or it could be an entire row at once. It's flexible that way. So we're going to add those to the commit log and the memtable. We're going to start putting more data into that. It keeps getting appended to the memtable. Here we've updated another column in the same row. So there's a separate entry in the commit log. Commit logs just append only. In the memtable though, we've combined those. So we combine those because they're part of the same row. So we say, okay, we'll just combine those into a larger entry there. We'll add some data from other rows and some more data to the first row. And then when the memtable gets full, we flush it to disk. And so we're never going to do a random IO against data that's already been written. We're never going to do update in place. And as a corollary, we never have to go read what the old value was for a column that you updated. We just put the new value into memory. Eventually it'll get turned into a data file. And then what we do is when we do reads, we'll examine the existing data files and we'll keep the most recent version of every column that you're requesting. And then what we do in the background is we go through and do something called compaction where we merge the data files and throw out obsolete data. So throughout all this, we're only doing sequential IO on writes. So that's why we don't need to do locking on writes because it's just the same case as I had an old version from several minutes ago versus I have an old version from a microsecond ago. The same code path that handles those. So because we're not doing any random writes, it turns out that Cassandra is extremely well suited for modern solid state disks because you don't have to worry about write amplification that happens when you're doing small random IO against SSDs. There's an entire presentation about this that Rick Branson gave that you can Google Cassandra solid state drives. It'll come up. That's as much details I wanna go into that today. One thing I did want to bring out here is that because we're focused on doing sequential IO on writes, good write performance kind of came naturally to us. Good read performance took a little work. So early on, kind of in the dark ages of Cassandra, the 06 version a couple years ago, it did have a well-deserved reputation as being slower on reads than on writes. So we were doing about 40% to 30% as many reads as writes in Cassandra 06. Cassandra 1.0, which we released last year, we knew that we had to narrow that gap. For balanced workloads, people need fantastic write performance and read performance. So we put a lot of effort into narrowing that gap. And so even though we increased write performance another 30% over 0.6, we increased read performance 300% and really caught that up. The third question to ask about big data solutions, how does it handle failure? When you have a 3% chance of a hard drive failing per year, you can kind of get by with RAID and backup servers and you cross your fingers that you don't lose another disk while you're healing the first one. But when you're talking about multiple terabytes and petabytes spread across clusters of dozens and hundreds of machines, you need a more systematic thought out solution to that. So kind of the classic approach to scaling out is to say, well, I'm going to partition my data across different replica sets. Each of those replica sets will have a master in light blue here and we'll have some kind of router that tells the clients what partition, each row that he wants is in. And then the master will replicate updates that come in to the slave nodes in a darker blue. And then if the master dies then we'll have some kind of failover to promote a slave. Well, there's a couple of problems with this. One is that fundamentally, if I'm doing a failover process that often involves replaying some kind of right-ahead log, then there's a gap there during which the data in that partition is unavailable. You can't get to it because they're busy deciding who the new master's going to be. The other problem there is that failover isn't inherently, it should make you nervous because it's not something that is happening all the time. It only happens when something has already gone wrong. So it tends to be less well-tested. You have corner cases like what happens if the master was part of a network partition. So it was still alive but it couldn't talk to the rest of the cluster. So you failed over, you brought up a new master. Now the network, the router gets fixed. The original master can talk to the cluster now. You have two nodes that think they're masters. So there's complex corner cases that can go on there that can bite you. Even Google who's been doing this kind of system longer than just about anyone else has had two different day-long outages of Google App Engine where different corner cases of master failover bit them. So a better approach is to design a fully distributed system and this is what Cassandra does, where you have all the nodes in your cluster are equal. There are no masters. Here I have, I'm writing data to partition number one here but the client doesn't have to talk to any of the nodes that actually store partition number one. It can talk to any node in the cluster. It will know how to route the data to that partition and if that node that the client's talking to fails it can connect to any other node in the cluster without having to wait for any kind of failover. So very robust system. This extends to multiple data centers as well. Once you have a fully distributed system it's much easier to take it into a multiple data center world where clients in this data center down here can do local rights that get propagated asynchronously to the rest of the cluster. And so if you're gonna have a master slave design everyone has to talk to the master. So if you're 200 milliseconds away in another data center, your update is gonna take an extra 400 milliseconds round trip to complete. So this is a much more powerful design. And what I've tried to call out here by the way is up at the top in the clouds those are representing things like Amazon EC2, Rackspace Cloud, down at the bottom we have on-premise data centers. Cassandra's totally comfortable spanning that gap like that and this is an actual design that Cassandra users have deployed in production. By the way, when you're doing rights across multiple data centers Cassandra is efficient about your WAN link. You're gonna have less bandwidth on that link. So what Cassandra will do is if I want to have three replicas of this row that I'm updating in this data center Cassandra will send one piece of that row one copy of that row to one of those replicas and then tell that replica to forward it to the others. So we wanna be efficient with your cross data center link. And it's very easy to configure Cassandra to do this. All you need to do is tell Cassandra which nodes are in which data centers. You can also tell it which nodes are in which racks and it will take it a step further because what you want to avoid is you want to avoid having a rack failure take out multiple copies of your data because cooler failures, network failures, power failures all of those can affect an entire rack at once. So Cassandra is smart about placing your replicas across multiple racks and avoiding those correlated failures. Fourth question, how does it scale? So closely related to how does it handle failures? Because if you have a design that handles failures well it will probably scale well as well. Scaling well is kind of more of avoiding anti-patterns than it is doing anything particularly right. If you put more machines in your cluster they ought to be able to do more work unless you screw it up somehow. That's kind of how I see the scaling problem. So ways you can screw it up is if you have a central metadata server where you have to go talk to that metadata server to know where all the replicas are for a row for instance. Or in my original diagram about sharding across multiple partitions kind of manually the routers can be bottlenecked. So what you want to do is you want to just be able to add machines to your cluster and not have to have any special roles that need special care and handling. The more peer to peer, the more uniform your cluster is the better that's going to work. So that works totally well in Cassandra. Netflix did some testing on this late last year where they took it from about 50 nodes to 288 and got a nice flat line of scaling which is exactly what we want to see. The last thing to ask is how flexible is your database? And what I mean by this is you can have hyper-specialized databases and you do see some of that in the NoSQL world where you might have a database that is extremely well tuned for time series data and doesn't really do anything else. So what we've tried to do with Cassandra and largely succeeded is make it into a general purpose tool that can solve multiple problems. And we can even take it out of kind of the NoSQL real time small query kind of world and start doing more analytical kind of work with it which is what we've done at DataStacks with DataStacks Enterprise is we've integrated Hadoop with Cassandra so that you can get the best of both worlds. To illustrate that, I want to go really quickly, we have a demo of a portfolio manager. So these different pie charts here are how much of a given stock I have in my portfolio. And so what we want to do is we want to compute for each portfolio we want to show the user his largest historical 10 day loss with that mix of stocks. So we have on the real time side we're handling live stocks which is what's the current price of each stock in the market. And then I have a bunch of portfolios that for each user and stock that he owns I record how many shares he has. And then I have a historical table that says for each stock I'm gonna have what its closing price was on every day. So what I want to compute is I want to compute the historical loss for every portfolio where this is when your worst your worst 10 day loss was and how much you would have lost with that mix of stocks. So I'm gonna help you manage your risk that way. So what we're gonna do is we can use a Hadoop tool called Hive against this data in DataStacks Enterprise. And so it's gonna be a couple intermediate steps. First we're gonna compute the 10 day return for every given stock ending on a given date. And you can see that we're using full SQL we're doing joins here. That's totally fine. The next step we're gonna compute the 10 day return for each portfolio. And here we're doing aggregation we're doing group by totally supported again. Finally we're going to bring that together and pick the worst of those portfolio return rows. We're going to do a sub query and another group by another join to compute that. So we've gone beyond the strictly real time side that Cassandra gives you out of the box and we've married that with being able to do analytics against your data without having to dump it into a separate system without having to manage HDFS or anything like that. So what we do in DataStacks Enterprise is we give you the Cassandra real time side we give you the Hive and Hadoop analytical side we also give you in the upper left enterprise search through solar and then we give you management tools that help you make sense of all this know when to add more capacity and so forth. This is all we've written all of this on top of Cassandra so you don't have to manage multiple systems. If you once you know how to manage Cassandra your ops team is good to go for the entire thing. So the not all Cassandra users use DataStacks Enterprise but there's a healthy amount of overlap we can't talk about all the DataStacks Enterprise customers but I can show you Cassandra users there's a lot of people running their business on this today and one of those we'll be talking at two o'clock about big data in healthcare and how they're using DataStacks Enterprise there. So I have time for one or two questions. The question is what do you do when you want more incremental updates to things like your historical 10 day loss in this portfolio example. So Cassandra and DataStacks Enterprise give you the tools to do that. For instance Twitter built a real time analytics service called Rainbird that they monitor ad click through and so forth on top of Cassandra. So you have the tools there. There isn't anything that out of the box where you get something like Esper where you create a streaming SQL query that gives you the updates as they happen. So we give you the tools but we don't really productize it yet. So the question is if I have short lived data he's giving example of 60 days worth of data how does that affect compaction performance? So we have on the drawing board so first of all let me back up. You probably have used Cassandra or know a little bit about it because Cassandra actually has a concept of expiring data built in where you tell it I want this data to last for 60 days or six months or whatever and after that Cassandra will automatically throw it away so you don't have to explicitly delete it which is one source of extra compaction work of merging those deletes with the original rights. So Cassandra will throw that away automatically without you having to explicitly go through, you see what's old enough and delete it. And a further step that's on the drawing board for a future release would be to actually take it a step further and say I can just throw away entire data files worth of data all at once. I don't even need to scan through them and see who's old enough because I know all of the data in this file is obsolete and I'll just throw that entire file away which would be even more efficient. So the question is what's the efficiency loss from connecting to a node that doesn't store the data that has to route the request to another node in the ring? Roughly 10 to 15% people have done experiments with smart clients and in fact we support one called the storage proxy client that will it's Java only but it will route directly to the data that has to the replica that has your data and it's about 10 to 15% more performance. In general I would say that that's premature optimization that it adds enough complexity because now you have to tell your client which about your cluster topology and so forth that we recommend just having dumb clients that just know to connect to any node let Cassandra route it. If you get to where you have hundreds of nodes and that's starting to be real money that 10% is costing you then you do have that path to optimize it. So we basically need to have a fully connected graph anyway because even if I'm talking to one of the replicas there's gonna be other replicas that I need to talk to so there isn't really any penalty in terms of extra open sockets. Last question. In one sentence use Cassandra. There it is. Yeah I mean what I've tried to do here is give you these are the questions that you can ask to evaluate those and decide when something may be more appropriate to use case. Well so what we target with Cassandra is data sets that don't fit on a single machine whether that's because of data volume or requests per second that we're talking about. So a lot of the other players aren't even interested in that. They're saying we're best at single machine. We're giving you developer productivity. We're giving you acid semantics but we do that by being on a single machine maybe that replicates to some others. So we're playing in that other space. Thanks for your time.