 I do have the prime speaking spot right before lunch so unfortunately this I'm gonna have to try and micro machines guy this talk because it's actually really long and I'm sure you'd all be very excited if this thing you know started running through lunch so let's get started I'm gonna skip a lot of the jokes and warm up the things that I normally do and I'm just gonna say that thank you for having me here at RubyConf. I am from Portland, Oregon. I used to live here in Kansas City. I used to work at Cerner which I think half the people in this room used to work at Cerner so no surprise there. I currently work for Mongo HQ which is if any of you have a reuse Heroku and have used Mongo with Heroku chances are you've run across Mongo HQ because we're the databases and service providers. I'm also coming out of the book very soon about seven databases in seven weeks and if any of you are a Rails comp I think I said the same thing I'm like oh it's coming out at the end of the summer but it turns out writing a book is a much more arduous process than one would think so but it is coming out of beta November 30th so look for that. So the thing that I wanted to talk about is just sort of a overarching view we've only got about 30 minutes to go over this and unfortunately it's not one of those topics that's best condensed in such a small thing in the level of detail that I really love to go over especially considering that this is a Ruby conference there's going to be some Ruby code and I will make this as Ruby centric as possible but more importantly I want everybody hopefully to be able to walk away with at least a good sort of 30,000 foot understanding of a lot of the things that are going on in the database world right now and there are copious amounts and if you ever listened to or read anything about the node sequel databases you'll sort of hear the big four graph databases, columnar databases, key value or document data store that's fine but then I you know enjoy throwing in relational so there's really five because let's not forget the the databases that all of us have been using for the past 40 years so real quick just talk about the ecosystem and this map I think is a great example of why it's easy to get lost in the databases that are out there and this is by far in a way not comprehensive but it does a really good job of breaking down sort of the groupings and the classifications of databases that are out there we're gonna cover these and there's several reasons the ones that are in orange of the open source database so I only really want to talk about the open source databases there's several closed source ones out there there's ones there specifically only as a service like App Engine and Amazon RDS and things like that and those are fine but you know let's let's let's stick with the code that we can read so there's a quick history on these databases is sort of a map of the ones that we're pretty much going to cover if you notice at the very top is zoom in it kind of started around 2006 this whole big sort of database movement I mean up until that point things were more or less relational databases were king so it is still a really a new modern thing and if you look up there at the very top this is a Google project called Megastore and it is just out like now and it's only internal to Google but there's there's new databases that are popping up the Voldemort Redis those things really only came around 2009 and it's amazing in the past two years how they've become almost considered standard which is which is a pretty amazing thing for an industry that is more or less been stagnant for the past 40 years so I should notice it up the top it is a little cornices like big tables down at the month says key value these are grouped in the different type big table is is the Google implementation that a lot of these databases like HBase and HyperTable and Cassandra have have spawned off of we like to call them columnar data stores because of the way they sort of did and I'll explain why for that nomenclature document data stores all the big ones that people are probably aware of couch DB Mongo Mongo HQ you know it's a provider that spawn of that Mongo machines and other one Mongo Labs all of that and then a graph data store there's several out there graph BD and Neo4j in my opinion are pretty much the only ones worth looking at there's other ones like Twitter has their own little flock maybe that you know unless you're Twitter I don't think you're gonna find much use for so these are the sectors the database we're going to cover so let's start with relational models here's a few of the open source relational databases Postgres might so I mean you're pretty much familiar with most of these almost anyone in this room and by the way if I'm going too fast I am going to have these slides available for download so you don't need to be scribbling to write these down but these are a lot of the really good movie drivers that I would recommend if you're using any of these databases the difference a lot of people saying we know why would you postgres over my seat well I mean this is this is a fight that's been going on for for 15 years now the way that I like to split them up is that Postgres is fine if Postgres is fantastic if you want sort of an academically focused database if you want a database that sort of came from the world of database research not as interested in speed as much as correctness and I'm saying it's slow at all but like their Postgres is very much about being modular and being able to plug in if you're familiar with database indexing at all they they you can you can write your own custom indexes on their framework called just was a generalized index search search index this is something that my sequel was not concerned with at all it's in the very beginning they again as all open source projects go over time they're sort of reaching a parody so there's not a huge difference between them but the the groups of people are very different and as we all know the community is actually it can be very important my sequel is is what I recommend if you want something lighter if you don't need a lot of the details that the Postgres provides like for the example like make you know if you're not writing your own custom indexing and writing your own really custom data structures and all of that was fantastic drizzle split off of my sequel project I guess there was some political or technical reasons I'm not entirely sure why they split off but it's a it's a lighter version of my sequel volt db is is a relative newcomer in this relational database space and it's really focused on scalability I mean it has a lot of built-in support for like auto charting and things like this you can distribute your relational database out one thing and this is just a note just because of the crowd that I like to say don't get obsessed with this whole or anything whole object relational map is fantastic I mean I love active record I love a row I love a row one of the problems is this can get to be almost a little religious in the fact that the idea that you can't write sequel in code is is is just absurd sequel is a fantastic language and I think one of the best arguments for why it's fantastic is it's been around for ages and a relatively unchanged I mean it's actually one of the longest oldest languages that is still actively used all of the time and it's because it's very powerful mechanism for for querying and deciphering data and I get an argument with someone at Rails about this kind of query how you can you can definitely write an a row wrapper around these functions and I'm like to what point it doesn't make it any more portable because these functions only work in Postgres and beyond that it doesn't really make it any clearer than this I mean okay this might if you're not a sequel person this might not be clear but that's sort of a bunk argument because I could look at any language that you don't know and say oh yeah this doesn't make any sense just because I don't know it but what this does is finds the the shortest distance object within a bounding cube in three-dimensional space which I that actually made me really excited but I use new research and in high-dimensional database indexing so that's all I'm really gonna say about relational databases because it's something we all know pretty well big table columnar slash columnar data stores H base hyper table Cassandra Cassandra is sort of a weird little anomaly here but it is technically a column based data stored H base is what I would say is probably the most true to the big table concept big table is if you're not familiar excuse me the data store that Google invented about ten years ago in order to deal with their massive amount of data that they have and here are some good drivers to you that I find target hit or miss because it goes through the rest interface and if you're doing big table you're really looking for sort of the massively scalable high throughput systems so I actually recommend using thrift and I believe thrift was a project created by Facebook the neat thing about thrift is that all of these columnar in the data stores have thrift interfaces so if you can in a very real sense so thrift is almost sort of like an ORM for columnar in data stores so that you can you know write your full thrift based code and you know write your thrift based hooks and there's a lot of similarity between them hyper table is kind of neat because it's got a little hql so it's a lot more queryable I I don't think that it's community it's sort of you know as corporate backed things it doesn't have sort of the lion's share that H-face does H-face is a Java based project though and it does run some people the wrong way Cassandra is a hybrid and I'll explain what that means when when we get on to talking about a react but in a nutshell I guess I'll just explain it right now is that there was another big data store created by Amazon called Dynamo and Cassandra sort of took the architecture of Dynamo and it took the column oriented interface of big table and mashed them together and this was created by Facebook and they thought this was a fantastic idea until they started having problems with the architecture so they're actually moving a lot of their code especially like their messaging system and everything over to H-face I I don't not recommend Cassandra because unless you are you know Facebook scale or something you're probably not going to run into the problems that they have but you know I'm more of the mind of you know if you want to use Dynamo data story you know use something like react if you want to use a column or even the story use something like H-face that I don't understand the point of the hybrid at all so when I'm talking about columns though I keep throwing out these terms and I know they get confusing especially if you're from a relational world where columns mean something and rows mean something column oriented data store rather than you know when it's a table in a relational database where you might have a person table and then you have a name and it's a security number and then they're all stored together they're clumped together and rows columnary this sort of flips it on a Ted and actually stores data in columns now you might say well what's the point of this there's actually several benefits to this one of the big ones the fact that if you have multiple data centers and you have multiple systems you can you can actually put columns you know in in some systems that are dedicated to being optimized for dealing with that sort of data think about it in terms of Google because they you know they created it where they have this giant data store that is dealing with us you're scanning web pages and things like that where you might have one column dedicated to holding the titles of a page and another column dedicated to you know versioning all of the details of a page you might want those in you know two sort of different contexts because the title of page won't change very often whereas the contents of the page might change very often and so it sort of has built-in version through also a row is is sort of an amorphous thing where you say okay you know give me the most recent version of the title and then give me this version of the page body and then it clumps them all together and then that is sort of a row but then I think is it you all these commonly this sort of have this built-in built-in version control thing I highly recommend using you know for something like a wiki and here's some some ruby code not gonna go over too much about it but you can tell that by this this Apache line three included Apache Hadoop H base thrift this was created by Java people so you've sort of got this you know nested namespace but I'm gonna have this code available too but this is just a real simple wrapper so that you can then execute code within thrift and it connects to the server this was a wiki but you still need a new migrations in the same way that you would do migrations for a relational database you know here's the replicating all where you have the scanner and it opens up and the thing about a commonly in the data story I should also mention is the fact that all the keys are in order so it's really fantastic for for doing anything that you need to scan data again it makes sense in search engines on Google because they key everything in reverse nomenclature of URL so you'll have com.google.maps which is right beside com.google.com.google.zZZ whatever and so they're in order so if they just say okay well give me all of the data about com.google you can easily just start at the beginning and then just start scanning and then once Google is done you stop scanning implementation here's a way to get historical versions because as I said it has a built-in version control and you can you can set time on some new versions as well and say okay I only want this data to live for this amount of time or say I only want to store the last five records so use drift they all implement. I think we covered some of the age-based benefits Cassandra has some benefits it's you know it's it's it's it's configurable because of its architecture to be as consistent or available as you want and I'll talk about that when we get to this sort of dynamo of the style. Document data store if any of you have ever used no SQL database it's probably going to be document data stores probably have been MongoDB or or couch DB. There's another one called RavenDB I gotta say I'm if any of you are aware of a Ruby driver for RavenDB that would be fantastic because I could not find one. One thing it should be mentioned though is that RubyDB is written in .NET so I think that's part of the reason it hasn't sort of gotten very popular in you know this sort of community but I think it's a pretty fantastic data store. MongoDB couch that is what we'll talk about. If you're not familiar this is a document unlike the previous two we saw like relational online column that we're in a data store you don't migrate anything with that data store. You just stick values in. Now you might wonder what the value of that a lot of that makes a lot of people uncomfortable. The idea that you can just add fields or not have fields willy-nilly and I think it runs on this principle. The schemes are pretty awesome until they're not and when you're creating new projects a lot of times you have no idea how it's actually going to end up you know you start off writing a pet shop application and then two years later it's a social network for dogs. And I always love cell phones as a great example of something that isn't necessarily used in the way that it was originally envisioned to be used. The number two use of a cell phone is text messaging. Are there any guesses of what the number one use of a cell phone is? Checking the time. And I think I think you know like making phone calls might be number three but I think that's even moving on to number four after playing Angry Birds. So the document principle just accepts this as a reality. Now the difference between Mongo and Couch, Mongo was built from the ground up to be huge scalable datasets. That's originally what was built for. It really wasn't concerned about durability of an individual data store but the idea was well if one of them crashes it doesn't matter because it will you know just it'll just go to a different instance while we reboot that server and then all will be good. So it does auto sharding and all of this base office configuration. It has this really simple way to interface through this service called Mongo S that sort of handles the routing of all of your requests to the direct to the correct server replicates but it can be a bear to set up. Couch, the one thing I should also mention about Mongo as well is that it does ad hoc queries. If you're familiar with SQL and like dealing in that sort of environment where you just stick in data and you say okay I want to grab this value and you know with this query Mongo is a way to go. Couch on the other hand has a very heavy reliance on MapReduce and what you do is you run these MapReduce functions and you create views and you grab data and I know there's a lot of tools to help facilitate that but under the covers that's pretty much what it does. It was originally built to be very very durable so what I would say is you know if you were building you know something like a clinic in Uganda or something use Couch. Don't try to connect to some Mongo data center somewhere just install Couch because it's very federated in the fact that you'll have an instance of Couch on one device and an instance of Couch on another device and this couch on another device and they'll sync up eventually because they have this this really amazing master master replication which is something that Mongo does not have. Mongo is a very master slave replication center but there are there have been some attempts to make Couch a lot more like a lot more similar to Mongo and the first attempt was Lounge. I know it's a lot of people still use Lounge but it's quickly being overtaken by this just called big couch and I would generally recommend the big couch if you're starting from scratch and you want you want to create a cluster but Couch is you know pretty simple you start up Couch and the whole thing is rest space it's just the rest interactions it's like you put values in it and you get values out you use it HTTP URLs that's where all the data is in and it stores documents and gives it back forth. It also has a pretty classy interface. I like Couch's interface this is Mongo's interface it's not quite as sexy but to be honest don't let this fool you it's not really as powerful as you would hope but it's a little more powerful than this. So I threw out the term map reduce I didn't want to show kind of the idea of what map reduce is this is a I believe this is a Mongo map reduce. It's in JavaScript Couch and Mongo are both very heavily JavaScript centric. If you have a little trouble meeting this then let's look at it in Ruby. Really quickly actually I think there's a little more of a rails kind of a thing but if you're familiar with the very first line get all rooms so imagine you just have a hotel and you have a rooms and it so get them all. Step two is called mapping and when you map you take an array of some values and you convert them into an array of something else and that's exactly what this is doing if you're familiar with the map command all it's doing is it's just saying you know out of all of the the rooms extracts the capacity of every room and then put it into this array code caps and there's this real shortcut you might be much more familiar with whereas you know you do map and then empty and simple capacity and then starting with zero on reduce these are real Ruby functions they've been there since day one you can then add the capacity of every single item in the room or sorry every single capacity of every room and then the result is the total of the capacity out of your entire hotel. Now you might wonder what the point of doing all of that is when you could have just iterated through and had them together and because of this it's because you can split that array that the map array can a little part of each map can be done on a different server and then they can be reduced on different servers and then they can be re-reduced and then finally you end up with one single result and it works on the idea that it's much cheaper when you have these massive data systems so these very federated distributed systems it's cheaper to send the algorithm out to the data than it is to pull the data to you and then perform some function on it. So Mongo versus Calvature, we're talking about this a bit, I would say again if you're if you're not familiar and you're not comfortable with MapReduce, I would try out Mongo first because you can quickly delve into the and I'm not saying that Mongo doesn't support MapReduce, it does but that's not really its primary mode that was something they sort of tacked on at the end but if you need something that's really durable and you need something that you want to bed because you can bed couch and anything. I think there's even a couch for Android and it works really well you know you stick your little couch instance inside of all these cell phones and then they sync up to the cluster eventually this is something that Mongo has a very difficult time with and then what about Raven and it's .NET so I'll just you know a key value. Key value stores are sort of not very sexy but I hope you'll see React and Redis will hopefully change your mind on this. React and I kept talking about the Dynamo implementation by Amazon. React is a very good faithful implementation of Dynamo with a bunch of other awesome things like vector clocks and things like that for dealing with versioning. Ripple is really good. There's this ORM called Risky. I don't know why it calls it ORM because ORM stands for under relational map and there's nothing relational about this in any way but one of the you know big use cases that people use React is secondary indexes. They use them for caching, second level caching and things like sessions. You know any data just that you would use a normal key value store for but the nice thing about React is that you know again this should look familiar. This is just like big couch sort of the same server setup where you have your nodes. This is called a consistent hashing if you're familiar with this concept and the idea is that nodes can come up and down and you don't have to rehash the entire system. Just part of it. I didn't talk about CAP but the CAP theorem in a nutshell is that this idea that your server can be consistent. It can be available. It can be partition tolerant but you can only have two. It's impossible to make a system that is completely partition tolerant, completely available and completely consistent. So it can't be beaten. It's just a rule but it can be tweaked and what React is is sort of pioneered. Dynamo pioneered it but in the open source world they really took it and ran with it. Is this idea that even though it can't be tweaked it can be or it can't be beaten. It can be tweaked. The idea that you can have parts of your system that are really consistent but not as available if it's really important data like billing data or anything like that and then you get parts of your system that are highly available not as consistent but it's not as important necessarily be like session information. And you know we can go over some of the details on how and I've gotten these on the slides. Other key value stores. Memcache is fast. You don't need to use it. Kierow cabinet really durable. It's awesome. You don't need to use it. Redis is amazing. It's fast. It's durable. It's also allows for multiple data structures. I mean Memcache, Kierow cabinet, these things aren't built for multiple data structures. They're built for holding strings. Redis on the other hand can do all sorts of things. So you can store lists, you can store data structures like lists where you just keep pushing so that the first command is push, hard push, push things into the right. Lunch is the key and then pizza is the value. So you can start pushing things in. So here it's like okay I want to push in pizza to lunch. I want to push in pie to lunch. I want to push in juice to lunch. And then you know give me the range just of the first two and then it returns it. So you can do almost kind of little sub queries on something which is pretty powerful. If you've ever used Memcache there's nothing like it. It can store hashes in the same way and it can store sets. And what's neat about when I store sets is it can do set operations. So here I'm adding two people to the set. I'm adding one of the names to a different set. And then the very bottom command is the intersection. I'm saying give me the intersection of all values that are in both sets. And you can do unions and you can do you know all this you know pretty advanced, taking new published subscribe systems using Redis where you fire up one client and publish things out and then you can multiple clients subscribe to that and then they'll all get the data at the same time. On top of all that Redis is actually pretty fast. So I've got about one minute left to go over graph databases. So Neo4j, FlockDB, GraphDB, pretty much get Flock, Neo4j is one that I highly recommend if you really want to play with it. But the idea of graphDB is exactly what it sounds. It stores data in graphs. Each of these would be a node and they link to each other in different ways. This is a graph of you know the movie The Matrix and how the people are connected to each other. Neo4j is actually fantastic. It's also kind of hard to get a Greml on originally because there are way many ways to communicate with it. I prefer Gremlin. This is Gremlin. Gremlin is a groovy based language. It's kind of esoteric. But if you don't want to run Java and you want to run Ruby, then there's a REST interface that can execute Ruby scripts in the back, back end. So if you see the line in the middle that says script, that's actually groovy. And it's just passing that in to REST. So you can have your cool little Rails app and then just have that groovy that has Java. You don't need to install Java anywhere else, which unless you wanted to. FlockDB is fine. I'll check it out. But finally, I would just say just try them all. It's really easy to get started, especially if you have a Mac or something, because you can install every database I talked about here through Peru. So try and run it. Here's just a real quick sheet, the very far left side. Far right side is some examples of things that you might want to use these databases for in particular. The sites, the papers you should read. And yeah, apparently that's it. But I will have all these slides available for download. We'll have them on the website. And I hope that didn't go too long. I don't know if he has had any questions because I think we're actually four minutes long. And I don't want to stop anyone from having delicious lunch. I'm going to give them to the organizers and they will post them on website. Also, if you check on my Twitter feed, I'll tweet about it at Kota Roshi. Also, oh, I forgot. I have this amazing t shirt. So I wasn't going to give it to anyone who asked the best question. Are there any other questions though? Just a quick one. One of the things I'm trying to evaluate the databases is a lot of these databases discuss durability. But I kind of question that, for instance, my understanding is what is achieved at speed by just stating the delivery? I'm sorry, what was that? My understanding about durability like for both DB is that it just keeps everything in memory and the whole goal is not to have the entire cluster go down. Discuss issues about durability? Yeah, yeah, yeah, there will always be a trade off. The more durable you want something, the slower it's going to be. I mean, that's just it's it's like physics. I mean, because you're writing something to this or or storing it in a secondary system. So the more durable you want it, the slower it's going to be. I think Redis is probably one of the purest examples of that, where it is just like, well, it stores everything in many memory, and then occasionally forks in the right to come to this. And you can increase you can have a sort of a system of right or sorry, a right file, where every command is written to this file. And that's definitely more durable. But if you're if you're literally writing every single command to disk, obviously you're not going to get the 100,000 writes a second that you would normally get if you're stealing from any memory. I will say as far as durability couches actually probably one of them, the more durable databases that I'm not even a couch fanboy. I was I work for Mongo HQ. So in the document data store world, I should be a bigger Mongo fan, which which I am for certain use cases. But couches actually fantastic because the way that it just streams all rights for records, it doesn't do inline updates. It just when you add more values, it just depends on the end of the file. And what's fantastic about that is if the system crashes in the middle of a right, and you boot the system back up, it's just gonna look at it in complete right be like, Oh, well, and then it's going to continue on. So you're not going to ever get corrupted data. But there there's I mean, if you need something super consistent, that's when I would do, you know, something like Postgres or something with some sort of acid locking. Again, you know, one of the things about volt is right, you can't be ultra durable, because it's very difficult to be consistent. You can't be consistent and available when you're partitioning in the way you're partitioning in. And I would really love to explain anybody that's interested in why the cat theorem is the case. I just sort of just declare my feet that it is a fact. But it's, that's a big like 15 discussion in amongst itself. Sounds like a great lunch discussion. Thank you very much.