 Morning everybody, Friday, yes. It's been a long week, I'm excited, I'm highly caffeinated. So without further ado, I present an Ode to 17 databases in 33 minutes. I'm gonna mangle a large number of metaphors. There'll be a lot of animated GIFs. I've learnt that this week, you say it like that. There's Star Wars, Dungeons and Dragons, and all of that's very unfortunately stereotypical, so a bit of an indictment. This whole thing started as a joke. 17 databases, I actually did it in five minutes. 33 minutes is worse. The whole thing is just a catastrophe really, but anyway. We're gonna cover a whole bunch of different databases and a little bit of the underlying theory and hopefully you'll walk out and you'll understand why to use Postgres. I'm Toby. You can find me on the internet. I work at a company called Ninefold. Oh, no screens, is that me? Before there was no red, so now there's no anything. Hey, I have no slides. Well you missed my beautiful slides. You missed the first animation, that's a shame. You missed the list, it's awesome. You missed me, and my excellent job titles. So yes, work at Ninefold. They have very kindly flown me over here from Australia, which explains why I sound like I come from the deep south, because I do. Most of this week, this has been me. So today I'm finally over the jet lag just in time to go home and have it all over again next week. So a couple of quick facts about Australia. There are much fewer syllables than you're used to using. This is a genuine Australian politician. He's a mining magnate billionaire and he is currently running MVP Jurassic Theme Park with giant fiberglass dinosaurs, and I'd for want them for it, so I realize there wasn't enough Star Wars references, so this is just completely gratuitous. Anyway, so the thrust is that distributed systems are hard and databases are fun. Pictured here is a distributed system. You can see there's two app nodes and then there's two, there's like a master slave kind of set up going on here as well. So we're gonna talk about some of the complexities of running these types of systems, and it's really fun stuff once you get under the covers and start thinking about some of the complexities. So, no SQL is a thing. We have new SQL now. I'm gonna be covering some of these things. We've also got post-SQL, post-Rock Ambient SQL, and there's a whole gamut of these things. They all make my brain explode and I think the trick to understanding all of this stuff is to actually think about some of what's happening underneath and you can make decisions about your databases. Hopefully you're all familiar with some of the concepts of traditional relational databases. We have ACID, which provides certain guarantees about the way that your data behaves. You can update data and be sure it was updated. Things are isolated from each other. Things persist over time. Another thing that you may have heard of, this is a leap, but I needed another animation, is a thing called the CAP theorem. So this gets talked about a lot when we start talking about this new generation of databases. CAP stands for Consistency Availability and Partition Tolerance, and it provides basically some strong foundation for reasoning about the way that distributed systems behave and how they interoperate and how they communicate. So I'm gonna give you a brief introduction to how that all kind of works. So the original CAP theorem, as stated, is called Brewer's Conjecture. A guy called Brewer just sort of had this idea. It's actually on some really awesomely designed PowerPoint slides from something he did. And he was saying that with Consistency Availability and Partition Tolerance, so the data can only be two of these things at any one time. So the data can be consistent, or it can be accessible, or it can handle network failures. Some people then took this conjecture and actually made a formal kind of proof in much more rigorous computer science terms, and actually said it's impossible in an asynchronous network model to implement a read-write data object that is simultaneously available and is also atomically consistent. And so all of the stuff around new SQL and no SQL and all of that stuff is about manipulating these different variables. There's also a thing called base, but I'm not gonna talk about it because it's actually just a made up acronym that has no relevance to anything. So what does CAP actually, what are we talking about here? And why is it important? It's important actually because everything is already distributed. What we do today is inherently a distributed system. You have a browser talking to a server, an app server, a Rails server, because we're at RailsConf, and then that's talking to a Postgres database or a MySQL database or something even fancier and shinier. That's a distributed system. And as we move into more heavy client-based operations, that distribution is getting much more front-loaded. So you've got state in the browser that's now synchronizing the state on the server. So we already actually suffer many of these problems. This is a handy and completely untrue guide to no SQL systems and breaking them into this idea of some things are available and some things are consistent. So all of that is almost but not in quite entirely untrue. What the actual theorem says is that under a network failure, so you've got multiple nodes and they now can no longer communicate, you can choose whether the data is consistent or whether the data is available. And I have some demonstrations here to just, it actually ends up being very easy to understand. So here we have typical cluster of nodes working together. We're gonna model some communication between them. So there's a right on this system. It comes in, that gets replicated across and then on the other system we now have that data coming out, someone's doing a read. And so this is the kind of situation that we're talking about. So whether you're doing a master slave set up in a relational database or something trickier, this is kind of the way it works. A node gets some data and it gives it to another node and they have the same information. So when there's a network partition, that they no longer can communicate. So a right comes in and now we have to make a decision. And all of this is actually just science as you can tell from this diagram. If those two nodes can't communicate, you can talk to the one that got the right that's consistent. It got the right, it now can read out that same data. That's all cool. Or you can have both nodes still communicating and now you have someone reading data that is no longer in the right state. So we've updated a bank account. It's got $100 in it. It used to have $10 in it. These people are reading 10, these people are reading 100. That's available. The data is now not consistent but all of the nodes can send back that data. And so all of the discussion about cap theorem and people even claiming we've defeated the cap theorem our database at low, low prices is incredibly awesome. Just remember this image. Two things that cannot communicate cannot communicate. It's science. And then when they can communicate, we're back into the realm of normal operations and things get a lot easier. If you are interested in any of the guts of how these things work, definitely have a look at a thing called Jepsen which is this crazy motherfucker who is just analyzing the network operations of a whole variety of distributed systems and it will, it's just, it will blow your mind. Okay, good. That's why now I remember. So, here is our cast. We're about to go on an adventure through a tortured maze of ridiculous Dungeons and Dragons metaphors but first of all a shout out to the owl bear. Yeah, the thing I love about the owl bear is they've taken the wrong, the least scary aspects of a bear and an owl. Like if that was an owl with, if it had a bear's head and wings, that would be way more scary anyway. It's just been bugging me for months. So, postgres, as we all know, it's minus QL for hipsters. It's actually pretty good. So here's its character reference sheet. We, it's a relational database. It has a consistent model so under conditions of network partition, you know, your slave is not in contact with the master. It's essentially unavailable. That's the way we treat it. Postgres is actually really, really interesting tech because it has a bunch of cool stuff hidden underneath it. So there's this thing called HStore which is a key value store that's baked right in. So if you need a lightweight key value store and you're already running Postgres in production, you have one. You don't need to spin up any other thing. You can actually do that today. The really interesting thing about that is you can index those keys. You can do joins across in a HStore reference into across multiple tables. It looks and feels exactly like the kind of thing that you're already working with. We've got, there's some things already baked into the Rails ecosystem that make this really easy if you're doing that kind of information. But the really exciting thing about what Postgres is up to at the moment is JSON. And 9.2, 9.3 and the upcoming 9.4 have pretty much a fully baked in JSON document database. And it is crazy awesome. The new one is super high performance. If you were sort of, it's the same thing. If you're thinking, you know, documents would be easier for this use case. Let's install something else. We're actually, you already have one and it has all of those same properties. You can index, you can do joins across your normal table into the documents. It's crazy cool. MySQL. It's pretty much the same as Postgres is my answer. But there's a slight caveat. So, you know, Oracle, they're a company. Many of the same things apply. This is why, you know, they're kind of in the same bucket. For me, it doesn't particularly matter at the end of the day. Whatever you happen to have expertise in is cool. It's got some kind of interesting things that you can do. You can switch out storage engines to actually get your different performance profiles. It is everywhere. It's got a thing called Handless Socket, which is essentially raw write access through a low level socket into the table infrastructure. So, some people with really high performance kind of things. You can actually just sort of bypass the whole SQL engine, which is kind of interesting. The other thing that's happened since Oracle took over, which is kind of a really good thing, is that there's some alternatives. So, MariaDB is sort of the more open fork. There's a semi-commercial addition that has lots of really high performance features, and they basically run binary compatible patches, that's Pekona, and they have like huge expertise. And this Toku is quite interesting. It's, they're doing all of this crazy fractal indexing and things for particular use cases on very large data sets, but it still just looks and behaves in many ways, like the MySQL that you are kind of used to. So, there's some interesting things happening there. So, these, hopefully none of that's a huge surprise. That's databases, you use it. It comes in the box and ActiveRecord talks to it. So, now we're gonna get slightly off the beaten track. So, a lot of what we know SQL comes from Dynamo, which was actually a paper that Amazon released years ago. I'm not gonna labor too much on this one. The paper's quite interesting. It talks about how you make a distributed system. The interesting thing is actually that React is essentially an implementation of the underlying Dynamo theory. So, React is crazy awesome. This is what happens to you when you run React in production. I pretty much, it's a conversation I often have with people who's like, wouldn't it be awesome to have a problem that needed React? And it would just be like, yeah, that would be so cool. I'd be like the awesomest engineer. So, React is, it's just crazy well engineered. They're doing all sorts of interesting stuff. It's inherently, it just understands clustering. You add a new node, it just, it's there. With those older kind of databases, it's a pain in the ass to actually get it working. So, yeah, they're doing some really interesting things. It's got a cloud storage thing, so you've got a S3 compatible API and all of these kinds of stuff. A lot of the magic of the way this works is through consistent hashing. So, my slides are all mucked up, but anyway. So basically what it does is it just partitions all of your data into a giant hashing. Excuse me. Physical nodes then just own parts of that hash. You add a new node or take a node away and it repartitions all the rest of the data across the remaining nodes. And all of that is just completely in the background of how React just works operationally. So for large scale data and you know, you get away with, it has some really nice operational characteristics that make it quite cool to manage. And then the other thing is it's a very simple API. It's key value store. You can store JSON documents in it and it's just a bucket that has keys and then it's got other stuff on top to retrieve data, do secondary indexes and searching and all of that kind of stuff. So it's a very cool piece of tech. So the other one we've got is, Google, fucking annoying. And you'll see why in a second. So Google have this thing called Bigtable that again kind of comes out of their internal research. You have access to it through some of their cloud properties. As you can see, it's actually a sparse distributed multi-dimensional sorted map, which is good, I guess I imagine. It's awesome. The stuff they're doing with this is crazy. So this is actually a couple of years old I think now. Some of the information, so hundreds of petabytes of data, ridiculous numbers of operations a second. You do not have any of these problems. So then they took this stuff. They were like, oh, we've got Bigtable. That was fucking easy, whatever. And so now they've got two other things. They've got one called Spanner and one called F1, where they're basically doing proper sort of relational looking data across multiple data centers. They're kind of really pushing the boundaries of some of that cap stuff that's going on. But all you need is GPS in every server, a couple of atomic clocks in each data center. And great. So Google's basically telling everyone to just fuck off. So another one that I really like and have used a long time ago in Techland, Techtime is Cassandra. Cassandra is a column-oriented database. Eventually it's awesome. It's really all about eventual consistency. And you can see here, this is a man, he eventually gets it right, so that's well done to him there. So Cassandra's a lot like that. And again, the cool thing is it's a sparse distributed multi-dimensional sorted map. When I was working with it, you described your tables kind of thing in XML and hated yourself. And then every time something changed, you rebooted the server and that took a while. And yeah, the whole thing was really difficult. What it basically does is it takes the availability side of the question, like that's its world model. It has, again, a very simple clustering system, new nodes add in, the data gets streamed out. It has a data model that is really complicated. And even though I've used it, it's really hard to explain how it actually works. So column databases basically kind of invert the whole table structure that you used to from the relational world. And the advantages are that for some types of data and for some queries, it is crazy blazing fast because you can just, time series data is a good one where you have long streams of time series data and we'll actually put that on disk all next to each other and you just pull it all out. The cool thing in the new versions of Cassandra is that they've abstracted all of that out and you actually just get tables. So you can create a table and give it a primary key and under the covers it's setting up rows and column families and columns and all of these really abstract concepts and they've completely made some of that go away, which is really nice. So you end up with something that looks a lot like just SQL and a normal table kind of structure. It's just clustering out lots of nodes. It's very tunable, so you can actually set up, it writes to a node and you can say, I actually write to five nodes and that's a quorum and now we're cool. And so you can tune how much redundancy you have. So that's kind of cool. There's a reminder, thank you. And that went cold really fast. Thank you. So the next one on our list is Memcached. Memcached, there was a talk earlier in the week that was describing using Memcached and caching and it had a very interesting observation which was it just works. He didn't even know what version he was running in production because it doesn't matter. The API has been stable for ages. And I know what you're saying. It's non-database. It's cache, technically true. But it's interesting to think about because the moment you add caching, even if you've been ignoring the fact that you had a distributed system before, with caching you now really have a distributed system. You've got data in one thing that may or may not be fresh and you've got data in your database that you assume is up to date and now you've got a synchronization problem. So Memcached is actually really, it's just rock solid, old as the hills technology, completely simple. The API is everywhere. Lots of people actually have made their key value store they made in the hack night, which is a useful hobby if you wanna annoy everyone. Their API is actually the Memcached API. It's got a handful of things. You can set a key, you can replace one. It does have some atomic operation so you can increment and decrement. So there is some flexibility to actually do a little bit of data storage in a more traditional sense. It's actually a client server model. Your driver is responsible for the clustering in a way. So you can have multiple Memcached nodes and the hashing algorithm determines which node a particular piece of data is gonna be on. That has the property of making it very, very simple to use. There's no cluster state, there's no coordination that nodes have. A lot of the heavy lifting, all of these other things doing is about coordinating around all of that information. There's a whole bunch of awesome stuff just baked into Rails so you can just easily cache, inter-memcached, all your normal Rails, Fragment, View Caches, all of that kind of stuff. And there's even some things where you can actually push that into Active Record and have caching at that level as well. Redis is an interesting one for the Rails community because it's basically a queue now. Everyone seems to be running rescue sidekick and Redis is, again, one of those just pieces of technology that is beautifully engineered, incredibly simple, incredibly robust. The maintainers are just absolute, scientists, I guess, just a whole other level of crazy algorithms stuff and they make a blog post and you go, I'm so stupid, I don't understand what you're talking about. It's really fast, it's slightly hard to distribute. A lot of that's in the pipeline with Redis. It's much more simple to stick it on one node and increase the RAM. It's more complicated than memcached. It's essentially just an in-memory cache. It has a bunch of really interesting data structures though. I think if you, confused all week of now which country I'm from and whether I say data or data, so I'd now just change them randomly. So you can, you have hashes, you have lists, you have strings, you've got all sorts of other interesting things. You can do optimistic locking and have a bunch of operations that are essentially batched. You can do sort of, there's long ways of doing this kind of stuff. It's rescue and sidekick both just make it super simple to do background tasks with Rails and install the gem, have a worker and it's all just magic. It has Lua baked in, which is a whole other thing, but Lua is a really cool programming language that is designed for embedability. But one of the things that happens is you can actually write little Lua scripts that end up going into the Redis server to do more complex operations. So in this case, this is a little script that grabs something off a sorted hash and then deletes them and then returns the first thing, like then returns what we had done, but it's in an atomic kind of transactional way. And good news, everybody, we've just invented stored procedures, so that's very exciting. Except now they're much more hip because it's an in-memory database with a language no one's heard of, so we are rocking it. Also, maybe use a queue, just, I know it's crazy, but if you're actually using Redis as your queue, maybe you have a queuing problem and there are queues, they exist, they're a thing. It's ridiculous, I know. So RabbitMQ is sort of the gold standard and Kafka is another one that was talked about earlier this week and it's crazy cool. Where am I? Man. Just stretch. I've lost count, now I'm just gonna talk faster, cool. Neo4j is really interesting, it's a graph database. That's slightly hard to explain, but the way I actually think about it, we'll just jump straight here, is it's almost but not quite entirely unlike a relational database. The difference essentially is that it is optimized for the connections rather than aggregated data. So a relational database puts things in a way where you can get a sum and a count and that's kind of the heritage of that kind of worldview. Whereas what the Neo4j people are doing is actually thinking about connections between pieces of data and for some use cases, this is actually really, really amazing stuff. So you have a graph is basically a collection of nodes and those nodes can have relationships between each other and then a node just has properties. It's essentially an object database in a way. It's very similar to the way that we think about objects. So it has some really nice properties if you're working in a language like Ruby. And then it just does stuff that in a really intuitive way. So if we've got a graph of movies and actors, you actually define a relationship by name that an actor acts in a movie. And then when you are doing your queries, this is a language called Cypher. You actually, that's a first class thing. Whereas in a relational world, you're using a foreign key which has no semantic meaning at all. You just have to remember that you're an actor, you know, there's a table with an actor ID and a movie ID and we're joining across them. Whereas Neo4j actually makes those relationships first class citizens. So if you've got problems that are graph problems like social network friend, cloud stuff, some of that stuff Neo4j just makes trivially easy in a way that you would have had to do a recursive self-join in Postgres and hate your life and, you know. Couch is cool, I guess. Pretty much that's my opinion of it. It's really awesome. But you can't query it, so cool. That's it, that's a slight disservice to Couch, but you know, whatever. MongoDB, as we all know it is web-scale, that's excellent. If you think of it as Redis for JSON, that's good. 60% of the time it works every time, everyone's familiar with that. So the thing that's really, I mean Mongo, it reminds me of my sequel. Mongo is kind of terrible, but my sequel was kind of terrible too. Like when that came out, it didn't do transactions, for example, and I was working in Enterpriseyland and transactions are actually a thing and you're like, you script kiddies with your database. So Mongo feels like that and what we learned is if you make something that's awesome and useful and everywhere and ubiquitous and it doesn't work, you can make it work and eventually my sequel is a real database. So Mongo feels a bit like that, it's come massive way, I got burnt really early on with very early versions. It stores JSON, well sort of, it stores Beeson anyway. That's just binary JSON basically, and it's a really beautiful model to work with in a development cycle, which I think is why there's so much appeal. You've just got, people treat it like an object database. You've just got an object that's in there and you can pull out objects and manipulate them and do all of this kind of crazy stuff. The people who know what they're talking about though with distributed systems, if the reason you're using Mongo is because you think it's kind of CF all of this, we need to be web scale and do all of this kind of stuff, that is not a good reason to use it because there's still a lot of operational problems and stuff going on. This one is interesting, it's essentially, RethinkDB is, they're coming from the Postgres worldview because Postgres, my sequel was like, whatever, we'll fix it, Postgres was like, we'll do it right and you can't use it because it's so slow but at least it's correct and they took lots of iterations to make it usable so Rethink is kind of that school of thought. It's like, we're gonna make it all correct first and then we'll make it usable. So it's a very similar idea. Jason, they're trying to make it operationally great with automatic clustering and all of this kind of stuff. Who knows what it is and how it's actually gonna behave in the real world. It's still a very early piece of tech and that leads me into there's a whole world of databases around what I'm loosely calling the commercial fringe. So couch base is the couch guys and sort of some commercial meme cached guys who got together to make a hybrid something. Aerospike is, their marketing is great. It's about the best you can say about it. So there's a whole bunch of people trying to solve these problems in interesting ways but all of these ones cost money and you mileage varies and all of that kind of stuff. The cool thing about open source ones you get it and you try it and you hate it and you go back to Postgres so it's all fine. So Hyperdex, this is my favorite because they have hyperspace hashing and it is so cool. These guys are making some really broad amazing claims about the kind of things that they can do. Crazy fast, it's a key value store but it will index, it's not just a key that it will index the properties of a value. So now you can do genuine queries into the structure of objects that you're storing. They've got a whole bunch of papers around what they're doing. So you can read that as who knows what it means. It maps objects to coordinates in a multi-dimension Euclidean space, a hyperspace. I'm like, take my money. And there's a picture of hyperspace and I've read that eight times, I don't understand what's going on. But it does seem to be true. They're trying to solve some of these problems and they call themselves a second generation no SQL thing in a similar way to Google kind of taking all of this stuff and trying to push the science underneath it forward. So you can, it's got a Ruby client, you can use it now. It's got just normal key value. It's got atomic stuff. You can do conditional put. So this is some code that's basically is only updating the current balance if the, updating the balance if the current balance is what we think it is. Otherwise, some other thread has updated it. So there's some really interesting stuff they can do and they're guaranteeing those operations across the cluster. And it's also got a transactional engine as well. So that's really exciting. Running out of time, H base and Hadoop, you don't have any of these problems. Don't worry about it. You probably don't wanna have any of these problems because this just ends up, you need to install every fucking thing the Apache foundation has ever made. And this isn't even the full list. This is like, you probably need those. I have a friend, he's a bit of a dick and he calls it, because he works in an actual big data organization and he just, he goes, oh, you people with your small to medium data. So yeah, like most of us, we don't have big data in any sense of the word really. Like if it's got a GB on the end of it, nah, you're not there yet. So again, this is just, Facebook is using the hell out of this stuff and they're just like, this is all out of date. They're like now just, they can't buy hard disks fast enough. It's crazy. Yeah, there was a punchline at the end of all of that. But my friend, the guy who I said was a bit of a dick, he recommends having a look at this. And this is his quote, if you want to appear really cool and underground, then I reckon the next big thing is the Berkeley data analytics deck. So there's a whole bunch of people who are looking at that crazy big data situation and trying to work out what that means and what the future is. And so Apache and Berkeley are kind of in a cold war for that at the moment. And then there's heaps of people in the enterprise space because you can sell lots of products and all services to large companies who think they have a big data problem. So that's cool. That's fine. This is just a little thing that's an embeddable document key value store that you can, it's just kind of a fun toy and has an API that looks very similar to the Mongo one and it just sits in process. Our elastic search. Every time I use it, I think, why can't you not be my database? It's awesome. But it loses a couple of points there because of its configuration ability. It works when you know how to make it work and it's crazy complicated sometimes. So anyway, four minutes over technically, I think. Yeah. So that's good. That's databases in a nutshell. I've been Toby Heade where I'm around the conference if you want to talk about databases. I think of myself as a lapid, a butterfly collector I guess is the word I'm looking for of databases. Yeah, so come and say hi. Cool.