 Okay, so to start out, I just want to ask really quickly, how many of you have actually used Cassandra in an app somewhere? I mean, even just playing with it. So we've got a couple here. How many of you used it with Rails? One or two? Okay, let's see if this works. Okay, so this is a little bit about me real quick. I've been doing Ruby and Rails since 2005. I was doing tech support. I couldn't get my boss to buy us a tool, so I started building one and realized really quickly that management wasn't as fun as coding. So I switched. I went freelance last year. Some of you may listen to some of the podcasts that I do on Teach Me to Code, Ruby Rogues, Avdeez on our panel most of the time, and Rails Coach. So those are a few of the things that I do, and I like to play with that kind of stuff. Now, when I was getting ready to prepare this talk, I have a client, and he came to me and he said, I want a Twitter clone, right? And I looked at him and I said, you know, Twitter isn't making any money. So this probably isn't a great idea. And he explained to me that he had this unique selling proposition. He wanted some of the functionality that Twitter offered, but he didn't want Twitter itself. And I figured that that was probably something that wouldn't kill him. And, you know, he might actually be able to make it work. He had some interesting ways of advertising and stuff on the site. And so I said, go ahead. I'll do it for you. And so he offered to pay me a bunch of money to do it. So then a few months later, you know, we're getting into this, and his brother-in-law is one of the founders of Dentrix, which is dental software, and kind of keeps up on the tech scene. And his brother-in-law said that Twitter was using this NoSQL solution to handle all of its tweets. And so he says, I need a NoSQL solution. And I looked at him and I said, you don't need it now, but you might get to the point where you need it. And he insisted that he wanted it right away. So I was like, okay. So, you know, I said, fine, if that's really what you want, then we'll go ahead and do it. And so he continued to pay me money to work on that. And so then I had this brilliant idea. Well, if I'm going to be learning how to put Cassandra into this Rails app, I may as well talk about it, right? So then a few months ago, like right after I submitted this talk, he comes to me and he says, I really want to get this into beta so I can get people using it. I'm like, okay, well, you've got to cut some stuff out, right? What does he want to cut? The conversion to Cassandra. Okay. So anyway, so this is kind of the conversation. This is kind of the conversation we said. So then I'm thinking, okay, well, I'll just build my own Twitter clone. How many of you are freelancers? How many of you have time for a large project like that? I didn't see any hands. Yeah, me neither. So I've been working on it, but I have a semi-working, it's not completely functional prototype that I cannot show you because there's not a lot to show. I've got the schema worked out. I've got it mostly working, but it's not good enough to really demonstrate here. Anyway, let's talk about Cassandra for a minute. We had a few hands about Cassandra. Most of you know it's a NoSQL solution. Basically, I was first confused by people talking about it as a column-oriented database, and I was like, well, as opposed to what? A row-oriented database? And so I was like, how would that work? Do you just call things different things? And we'll explain the schema here in a minute. But basically, the column-oriented is really just how they think about the data, as opposed to the structure of the database. It was started by Facebook. They open-sourced it in 2008. The Apache Foundation picked it up, and they've been supporting it since, and it's been in kind of this rapid development mode since then. So the whole idea behind Cassandra is you have the cap theorem or Brewer's theorem. And basically, the way that it works is you have availability, consistency, and partitioning or partition tolerance. Availability means that your client can always make a connection and get data back or send data in. Consistency is if I query the database from several different clients, then I will get the same answer back every time. And partitioning is at some point your data is gonna grow so large you can't keep it on one machine. And so if you split it up, how robust is it? Can it stand up to that? If one of the other servers goes down and my hose, things like that. So then what happens is, the theorem is that you can only have two of these or you can only be really good at two of these at a time. You can't have all three. And so your relational databases tend to work more toward the consistency and availability. Most of the NoSQL solutions like MongoDB and Cassandra work more in this area in the availability and partitioning. So why would you use it? How many of you have kids that watch PBS? So you know what this is, right? Super why. Okay. So why would you use Cassandra? I mean what does it give you that maybe a relational database doesn't? Because I hear a lot of people and they're talking like, you know, the SQL solutions are no good or they're dead or you're better off using the NoSQL solutions. And I don't necessarily agree with that. I think it's more about choosing the right tool for the right problem. So why would you want to use something like Cassandra? And basically there are a couple of things that you're going to want to do to use it for. The big one is large deployments. I mean that's why Twitter is talking about using it for their tweets is that they have millions or billions of them. I mean to the point where a lot of the other solutions just can't handle that many records. And so that's what they're doing. If you're doing a lot of writes like for statistics, analytics, things like that, Cassandra works really well for that because it is right optimized. For geographic distributions, it works really well too. You can add any machine or set of machines to a cluster and it has functionality that allows you to replicate an entire cluster off-site. And so there are some reasons there for doing that. And the other thing is if your schema is constantly in flux, if it's changing a lot, then Cassandra's good for you because it doesn't have a set schema. You don't have to change the fields that are on a table. Or in our case, we're going to talk about the schema because it's not database table row like you're used to in relational database. The top level is a key space. And you can think of a key space as kind of a hash. It's like this giant hash in your cluster. Okay. And inside of the key space, I keep pointing it at the screen and it doesn't work, you have a column family. And the column family again is like a hash. So the key space really, it just manages the consistency levels for that set of data. And then the column family has keys that give you a reference to your row. And so you can think of the database rows in Cassandra the same way you would kind of think about them in a relational database. That's your record. The main difference is that you have the columns. So your columns, a column is just a key and a value. And so in your hash, you have your key value pair and that is a column, both of those, not just the value. And that's how they manage the data. They sort the columns and they sort the rows. And we'll talk about that in a second here. So all of your queries occur by the key. So if you're looking something up, you're going to look it up by the key, just like you would in a hash. So your keys are unique. And one thing that I figured out really quickly around Cassandra was that when you're setting up your queries, you're not querying against some table that has all the information that you can look things up by. Because again, you can only look things up by one key at a time. So what you wind up doing is you wind up deciding what you need to query. So for example, in the case of Twitter, you might be thinking, okay, well, I need all of the tweets for user X. And so what you do is you have your table and your key is your user so that you can look it up by user or you could also look it up in your tweets, call in family, and then you can set up a secondary key and you have secondary keys since Cassandra 0.7. So then you just look it up on the user, or the user identifier secondary key. And then you just pull them out and you can slice the data. But like I said before, it's all pre-sorted, so you can't do ordering because the ordering is already done in the database. It's kind of optimized that way. So it's not uncommon for people to set up an entire column family for one query. So if I'm going to be looking things up, like the tweets, if I'm going to look them up by hashtag or by user ID or if I'm indexing certain words out of them or things like that, then I may have column families for each one that have their key for whatever I'm looking it up by. And then the values inside are the identifiers for the tweets that I can pull them all out at once. Really, your CRUD operations, you have four of them, and I've found that it's easier to deal with things on the level of the Cassandra gem as opposed to trying to deal with things on the level of the Cassandra CLI because Cassandra converts everything to byte arrays. And so when you get it back, you get a set of numbers. That's not very useful, but the Cassandra gem will actually serialize it for you and give it to you in a way that you can read. There are a few methods that you use. There's get, and you just pass it the column family and then the key, and you can actually pass it the key for the column you want to, and it'll just give you back the one value. There's multi-get, and what you do there is you give it the column family and a set of keys, and it gives you all of the records back. You get them back as ordered hashes. And then there's remove. You also have insert, and insert is both create and update. So if you're concerned about overwriting data, then you have to do the check on your own to make sure that it's not already in there. So here we're going to talk about scaling, and I was sitting next to T. Where are you, T? And he thought that this was a good idea for scaling. So, yeah. You can follow him at Tparum. But anyway, so scaling, when we're talking about like replication, data replication across the cluster, there are a few things that you can do to make this work. If you just have one Cassandra server, you're kind of defeating the purpose because the whole idea is that you can partition your data. You just pull machines inside of your cluster, and then you set up a replication factor. And the replication factor is set up in your schema on your key space. And what that does is it tells it how many copies of your data to keep in the cluster. And it does all this automatically. So if you write something up to the database, then it will go ahead and send that up to, you know, two or three or four or however many other servers are in the cluster. And that way you have multiple copies. So if one node goes down in the cluster, you have the other three or four that you can still do queries to, and you can get the information back. The other number, there are two magic numbers to this, and this one is more for consistency, which is one of the C in the cap theorem, the one that we ignored when we were talking about Cassandra. And you can tune this so that you effectively, on your query, I want this to have a consistency factor of three. And let's say you have a replication factor of four. Then what that does is it tells Cassandra that you don't trust the answer until three of the nodes that have that information have responded back and said this is the right answer. And it will then reconcile all of that on the read and make sure that it all works. The trade-off is if you have, let's say you have a replication factor of four and a consistency factor of four, that means that it has to check all of the records and make sure that they fit and that'll slow down your reads. So you wind up trading efficiency for consistency. And it does do all of the eventual consistency stuff for you. So as long as you get kind of a majority support, then it'll work. But if you use the consistency factor when you write and set that to a consistency factor of three, then it will wait on the right two until it gets three responses saying yes, I have it. So your consistency factor basically says this is how concerned I am, this data is correct and this is how I want it verified. So Ruby and Cassandra, there are several things that you're going to learn if you start looking into this. There is a Cassandra gem, it works really well. I've been really happy with it. The API is a little bit ugly on the Cassandra gem. I'm sorry if any of you work on that, I think it's ugly. But at the same time, it reflects the Thrift API. And so if you think about the job that it's doing, that may be appropriate. So it just depends. But I found that it was a little bit inconvenient. So I wound up writing an ORM that kind of sits on top of it. So these are some of the ORMs that are out there. I started looking for ORMs when I started preparing for this talk after I started writing my own. And as far as that goes, that's something that I highly recommend that you all try. If you have some kind of technology that you like, that you want to learn, then I highly recommend that you take the opportunity to do an exercise like this, you know, figure this kind of stuff out because you start learning about not only how these systems work, but you start learning about how to write good code and how to write good code that does certain things, get a clean API and things like that. So the one that I started writing was Cassandra. And I put build then look because that's what I did. I started building one and then I went out and saw that there were other ones out there. I'm using Active Model. How many of you out here have used Active Model in some way? How many of you have used Rails? Rails 3. Then you're using Active Model because they took all the good stuff out of Active Record and they put it in Active Model. So far I'm doing validations and I don't even remember what else I pulled in, but there were a few other things that I had to pull in to make it work. Using the Cassandra gem, and I've been learning a ton about API design, just trying to make this work cleanly. So why an ORM? And this is where I get into the code. So the red is what you have to do with the Cassandra gem to make it work. And you'll see right over here, this is something that took me forever to figure out, is that in order to define a column family, you have to use the Cassandra thrift gem and create a CFDef object, and then you pass that into the Cassandra gem in order to get a new column family, which is really awful. And so this is what I came up with and the migrations are going to be really simple because you don't have to do anything with fields, so really it's just creating and modifying column families and key spaces. So you have a key space, and then once you have the key space, then you create a column family, and you're good. So here's another example, and this is more along the lines of doing an insert. So you have an insert, you give it the column family, you give it the key, and then you tell it what data to stick in there. Now, the problem with insert, like I said, is that it'll overwrite something if it's there, and by overwrite, I mean, since it's column oriented, it'll replace the data on the column. So if there's another column that's not listed in this hash up here at the top, then it won't touch it. It'll just stay there and it'll leave the value the same. So it's effectively just updating the attributes that you send in. And you can remove them with a remove command. You can actually remove all the way down to the data level so you can remove a column out of a row with a remove command. So anyway, so what I wanted is I wanted something like this, and this should look pretty familiar because it looks like rails. And so that's kind of what I've been working on, is something that'll look like that. Now, you do have to keep in mind, though, that at the top there's no way of knowing if that's an update or a create. So you have to build that in when you're dealing with it if you don't want it to clobber other data. So a few considerations here. When I was building the ORM as a schema list, right? So if I want my classes to be free-form as far as the attributes go, then I could use method missing. The problem is that if I reflect on the class or reflect on the object, it won't show any of those methods. It won't show me that I can change any of that. The other problem that you have is that if you have method missing, then you can't name any of your attributes after any other methods that exist. So you could name attributes when you instantiate your objects, so you could just create them based on the keys that are already there, or you can tell your class ahead of time, you know, I'm gonna be dealing with these attributes and then have the class set up the instance methods for you. And that's what I opted to do. I opted to just, when you set it up, you just tell it, I have an attribute named this, and make it work. And anyway, so these slides are a little bit out of order, but we had a talk yesterday about APIs. And, you know, the main things for me with the APIs is I wanted them to be pretty. Anthony lives in France, so I had to use a beret. I usually use a sombrero, and I want it to look a lot like rails or work well with rails. So that's kind of the approach that I took, and it's been working out really well. And, you know, the API is really simple because you're dealing with just, you know, get, insert, and remove. So it's made it really easy to kind of build an ORM around that and make it work. And the scaling is really encouraging. So that's pretty much all that I have. If you want to get ahold of me, you can. If you have a phone that'll take a picture of that, then you can get all the information on the left out of the, what do they call those things? QR codes. So I really like getting input back from people. So if you go to speakeraid.com, I think Marty's been talking about that. I would really appreciate it, you know, any thoughts. If I'd known I'd had more time, I would have put more code in. But anyway, if you want to listen to any of the podcasts that I'm involved with, teach me to code.com, rubyrobes.com, and railscoach.com are the ones that I'm doing right now. I'm pretty sure we have a few minutes for questions. So, are there any questions about Cassandra? Well, that's the nice thing about the schema with databases is that if you're changing a column, you just add it in. So he's asking me about, so you're asking about secondary indexes, for example. So if you want to add a secondary index, then theoretically you want to have data there for it to index, right? If you're trying to be consistent with your data, then all you really need to do is add the index to that column. If the column doesn't exist, it won't index it. Because, I don't know if I can scroll back, but if you look at the schema example, basically the rows aren't constrained to have any particular column. Let me see if I can go back. So each row can have an arbitrary number of columns with arbitrary keys to any of them. And so you can index them based on whatever. And the secondary key is the only key that's required is the key that references the row in the column panel. And so any of the other values that you stick into the row, they can have any value they want, they can have any key they want. One cool thing about Cassandra is that your keys don't have to be strings. They can be time objects, they can be numbers, they can be whatever. As long as Cassandra has a type that corresponds to it. And then you can set it up to sort. And that's one thing that I missed, talking about it was in my notes, but Cassandra sorts your keys. And so it'll sort all of your rows by the key. And that's preset. So if you want something in a different order, then what you wind up doing is you'll probably wind up creating another column family or setting up some kind of secondary key that does that for you. That's ordered differently. You just set up the column family and then when you do the creation, you just make sure your callbacks create the records in both places. And so usually what I would do is I would set up a column family that's like my master list and sort that by the thing that I would be ordering it by most frequently. And then I'll set up another column family. And what that will do is that key will be sorted by whatever I want it sorted by and it'll just have a reference. So it'll have one column in it and that column will have just the key and the key is the reference. So then it can say, okay, what are my keys? I want those. And you can also do super column families, which basically you have a column family. It's a column family, but it has two levels deep. So its row is a hash and then that hash has each of those hashes have columns in it. And so you can set up some of the sorting that way too. Any other questions? Okay, so Jeff just pointed out that Koala just released a chronologic, which is a Ruby and Cassandra project that, what did you say? It manages some kind of timeline management. Cool, sounds great. I always like seeing code samples like that. So are there any other questions?