 I don't know the speaker very well, just remember about the Black Vodka at Melbourne in LCA 2007. So I'm hoping it's going to come out again at this LCA. And today he's talking about dropping acid, which is an interesting one, and actually eating data in the Web 2.0 cloud. So let me introduce Stuart Smith. Thank you. So the full title of the talk is, of course, dropping acid, eating data in a Web 2.0 cloud world. And it was pointed out to me the only thing really missing there is software as a service and on demand. So it's postively wanky buzzword compliant, which always gets you in people in here because it sounds fun. So today I'm going to talk about exploring data consistency models of systems. I'm not talking about the data models themselves or what applications they're good for. So if you wanted to come to a session to learn how to make my Web 2.0 cloud app for next generation of web fanboy technology preview beta, of course, if you looked in the program it was truncated to dropping acid. So if you are, in fact, here looking for sugar cubes, I'd say see me after, but it's being recorded. So some of the systems and data consistency models you're pretty sure was by someone on some pretty bad acid. So I currently work for a company known as Rackspace. I'm a software engineer working on the Drizzle database server, Drizzle as we call database for cloud, which is a nice little semi-title where you've forked off the MySQL server ages ago and it looks completely different and is awesome. Next to Rackspace, I worked for Sun where we did some work on Drizzle as well. Never worked for Oracle, so that's a win in my life. And before that was a company crazy called MySQL where I worked on a product called MySQL Cluster, which was a, or still is, a great distributed high availability, low latency, pseudo real-time database system that runs probably everyone's mobile phone networks. So that has a few availability requirements of when the cluster database goes down 30 million people's mobile phones don't work anymore. There's an interesting support call you never want to get. I worked on stuff there like geographic asynchronous replication, online ad node, a bunch of the online backup stuff, getting statistics out of the cluster without affecting performance, and even did a bunch of work on the Win32 port, which meant we had the best team name ever inside MySQL for that. We had a Windows task force. Yes, the team was called WTF. Yeah, on the internal IRC server slash joined WTF, it was great. So this is a bit of a follow up for a talk I did at LennoxConf 2007 called Eat My Data, How Everybody Gets FileIO Wrong. I gave this again at OSCON to a great packed room, talking about how everybody gets posix fileio wrong. And the big takeaway from this, in case you didn't see it, was most programmers cannot write a file safely to disk. Well, and that's like ignoring that. Just simply in the user space place, most people cannot write a piece of software where your file survives pressing the reset button. You also have to know that a world without failure does not exist. Computers crash, sometimes they fall off things and literally crash. It turns out that sometimes you have power failures. UPSs are not always uninterruptorable. Turns out generators cannot only fail but also run out of fuel or never be tested and stuff like that. So brilliant, brilliant things. Another thing you should always be aware of, close and rename do not imply sync. If you believe that, you believe a lie. So write close, rename is fail, you're gonna eat data. Write fsync close, rename is much more win. The other thing, of course, remember is Apple hates you and Mac OS X is a piece of shit that does not have fsync work. I'm sorry, why you should use that is just because you hate your data and want it to go away. Yes, if you're wondering why I'm OS X bashing, that is a very good reason of when you realize this database system is doing more transactions a second, then rotations of the disk going around. What's going on? So, this POSIX standards don't help us either. This is a completely POSIX compliant fsync implementation. So if you think you're fucked writing software, the standard guys haven't helped at one bit. And so you think, oh my god, oh my god, this open bunch of crazy hippies. We're getting it all wrong. Don't worry, the big people get it wrong as well. Win32, get file size, actually makes it impossible to find out if an error occurred if your file is over 2 to 32. So your file, maybe that size or maybe you got an error. And just like the replacement one you're meant to use is get file size X as in X operating system as in I broke up with my operating system because it didn't tell me how errors worked. So not everything is simple with POSIX apps. People get it so wrong when you're writing like a 15 line piece of code or 100 lines of code in POSIX API which you thought was pretty well known and simple. So let's think about the modern web. The modern web has even more parts and lots of software written in even high level languages and sometimes you get parts written by what only can be described as monkeys. And even worse than that you get some parts of the modern web written by people like me on the front end code who if you ever look at my PHP it's very obvious some systems developer not a web developer. So you get that kind of thing happening there. Now think about the modern cloud. It has many, many, many parts because now you're definitely talking about multiple machines and so we can't get it right in a single machine so obviously we're going to get it right on many. So apart from the fact that cloud was the word invented because that was the only object left in PowerPoints. Thing that we hadn't used for something. So the extra layer of cloud abstraction obviously gets everything right. And a bunch of these new systems have what I would refer to as dubious care for your data. So let's look at what we're talking about here. So if you're talking about people who've been researching this for years and implementing it and the great old database people will tell you something that is not a tale about something great that they did in the 60s, it's called acid, which is of course means not free acid in a family hotel. I don't know why someone decided to put that on the front of it. Acid and other such fun chemicals that you should totally play with as a kid. I am talking about atomicity, consistency, isolation and durability. And it turns out that even people who write a lot of code against relational database systems still don't really always rocket and different systems work differently. So it's really interesting thing to look at that. So let's talk about distributed systems. So there's a great thing called Brewer's theorem, which is known as CAP, which is a much less humorous acronym. And it's based on the theorem is that it's impossible for a distributed algorithm, so not a distributed system but a single algorithm to provide all three of consistency, availability and partition tolerance. So if you talk about consistency, you can think about that all nodes see data at the same time. So when you update one node, you can see it everywhere instantly. Availability, for example, node failure doesn't prevent survivors from continuing to serve everything. So a machine can fail and you're okay. And partition tolerance, cut the network in half, what happens? So that's kind of the fun thing there. So let's look at traditional relational database management systems. So let's look at our RDVMSs before we go on to radically different new systems that have fully buzzword compliant and all the young cool kids are using backwards caps and whatever. So let's look at MySQL slash MariaDB. MariaDB is a fork of MySQL maintained by Monty program. And we essentially can look at these in the same light. And we're going to talk about the InnoDB storage engine. I don't care about non-transactional engines. You need concurrency, you need transactions. Go with it. So we're not going to talk about MyISAM and there's an oft-repeated quote in the database we're going to hang out with, reinventing MyISAM. So you know, non-transactional, single-threaded, not crash safe. Reinventing that, not a feature. So commit. When you type commit in a database system and you get back success, that usually means in an InnoDB, it definitely does unless you've changed the configuration. Commit is commit to disk. That means commit returns success. Your transaction survives the have you tried turning it off and on again test. Whereas if you get return back from commit, press the reset button, your transaction is still there. Even on OS X, which tries really hard to screw you. And this is just like pages of code in that from OS X. You get consistency in there. You get great things like MVCC, multiple version concurrency control. And you can get a consistent view of the database. Means you can actually do a backup that makes sense. People who are doing like fast system level backups and thinks that consistent when you have software running. So you actually get consistent backups, which is a really nice feature. You can have constraints in the system to ensure that your data is like logically consistent from your application point of view. And this helps you prevent things like naughty programmers writing crap data into your database. And of course it helps prevent buggy code on the front end doing something wrong. So you have a nice level of constraints in there. So let's talk about eating data in relational database systems. How do we lose data in the RDB masses that we're using for our web apps? Well, let's look at MySQL's replication. So MySQL's replication is asynchronous replication. So what you do is you commit to local disk. So you do a transaction, you do commit, you get back success. That means that that database server has written that transaction to local storage and it's safe for a crash. If the master vanishes, for example, someone trips over the power cord, for some reason they have to be loosing your data center to have people trip over them, what happens then? What can you then to your database cluster using async replication? Option one, convert the system to read only until you fix the master and bring it back up. Lots of people do that. It's really simple. Pretty much anyone can get that right and it's really good. You could do a DRBD type setup, so do a replicated block device underneath, so failover from other set of hardware, part of the hardware fails. And the other option which people do for increasingly complex systems is something called slave promotion. So you have a master and a slave, master fails, you rename this slave the new master and have everyone replicate off it. And there's a bunch of scripts and tools around to do that. The interesting thing about slave promotion, remember how I said that commit goes to local disk and then it is asynchronously replicated. So you could have, when you're doing slave promotion, you could have transactions that committed on the master and returns access to the client. And then you have master failure, you do slave promotion and there are some transactions that clients think were committed but the slave does not have. So you lose some data. Therefore data eaten. Brilliantly asynchronously replicated. This also does not get re-synced if you bring it back up the master. It doesn't work that way. Reconciling that doesn't really exist, good luck. So, the usual thing is you're, but, but, but, but, synchronous replication will solve all that. Yeah, except for performance. To do synchronous replication, of course, you need high speed, low latency networks to make it work well. MySQL cluster does synchronous replication, two-phase commit around a number of machines. It turns out that if you want. Here's a good idea. Distributed databases are really easy to write. Distributed databases that are fault tolerant and survive node failure are very hard. So distributed systems, easy. Fault tolerant ones, hard. So, synchronous replication is one thing that no one really does because they still want speed. MySQL cluster does it, but they have disadvantages. Different storage engine, not suited to every workload. And you get a different consistency model from it in that commit does not mean it's safe to local disk. Commit means it's in the memory of multiple machines. So it does periodic flushing to disk and commit means that transaction survives node failure. So you already have a different consistency model there. So this you could say is that everything is safe when individual machines crashes. But if you lose power to your whole cluster at once, you will lose the last second of transactions. But note that that is also consistent. So it is actually a consistent point of that last second. It's not sort of random collection of whatever it decided to flush. It is actually consistent. So it does pass the turn it off and on again test and the fact that if you turn off your cluster and turn it back on again, you get back a consistent state with a known amount of things that may have gone away. What about semi-synchronous replication? So semi-synchronous replication originally started in Google patches to MySQL and Facebook patches and is in like the current stuff and finally made it into MySQL release recently. And its idea is commit to local disk but don't return success to the client application until one replica has that part of the replication lock. So that means if the master fails, the slave has that transaction. This means if you do master fails into your slave promotion, that slave will have every transaction that the client got a response saying, yes, commit succeeded. But of course what may have happened is if you bring that master back up, it may have transactions committed to local disk that the clients did not get a response to say commit was successful but it may have those committed to disk and a slave won't have them. So bringing the master back up doesn't happen. You wipe the machine and turn it into a new slave. So there's a different type of consistency going there but it does mean that every transaction that a client gets acknowledged as being committed is actually on disk. Drizzle is working, we're working on a bunch of stuff in replication. So that will be pretty exciting. One of the things that I'm really excited about is actually putting the replication log inside the database engine so we save a whole bunch of f-syncs. So it turns out if you actually want it to be a crash-safe replication with a separate replication log in MySQL, it's about four, five f-syncs per transaction. So that's way too many. It's not just a single one. So that would be really nice to actually have. Database systems, well-known problem. Who here goes, yep, yep, relational database systems. I know how to run that. Yep, really well-known problem. And you go, I have a really tricky problem to solve. Who are you gonna call? You have an idea. Really well-known problem. The big advantage is finding people who know how to handle these systems is really easy. And you also have lots of people who know how to make them scale. So, what about this other world? Rebels who think that having a well-defined, semi-kind of-ish standardized language, they're doing generic queries to do new things with your data is obviously a bad idea. It has some great things. You know, it's buzzword compliant. It's more letters, so you have to type more. Introduces the lowercase character, which SQL is screaming at you as always, an assault to the eyeballs. So, my theory is that SQL isn't the actual problem here. Existing relational database systems are the problem. Implementations are the problem, not having a nice generic query language. So, you've kind of had a bunch of systems now throw sort of the baby out with the bathwater, going, let's get rid of it all, when really what they wanted was a better implementation of a database. And you get these by having lovely examples of weird ways to query your data. For example, how do I query the database? It's not a database. It's a key value store. So, it's not a database. How do I query it? Oh, it's easy. You write a distributed MapReduce function in Erlang. Did you just tell me to go fuck myself? I believe I did, Bob. Writing efficient distributed MapReduce functions in Erlang is obviously what everyone gets taught in college. So, it's very easy to hire people who can do that efficiently, but having generic query languages is of course hard. So, you do have a usability point of view and actually a knowledge thing coming in here that makes it tricky. So, what's the reply? The instant reply. Oh, but like MongoDB is web-scale. Have not seen that YouTube video? Just do it. It's like, well, just MongoDB, web-scale is brilliant. And this is of course a clearly bullshit argument. I mean, many, many, many large-scale websites run large amounts of data and transactions per second through relational database management systems. They reliably work. People have a very known quantity and they perform great. So, you do not need to have something clearly 100% radically different to still be able to do high performance. And of course, this is my greatest test for any system. You should be able to simply tell someone, have you tried turning it off and on again and have some assurance that actually doing that will not erase all their data, that you'll get back something. So, from this system could have mysql mostly passes. There are various situations where if you pull the plug on mysql, you'll end up with some stuff you have to repair manually. Yay. In Drizzle, we fixed a whole bunch of those. So, yay me. Mysql's replication, again, mostly passes. I can still get some corruption here and there, but you can usually hand edit, text files to fix it. Mysql's semi-sync does even better and there's a bunch of patches out there in various community branches that make it much more solid. Postgres passes a lot as well. And Drizzle passes more so. So, let's look at the no-sql cases. We know that if we turn our Mysql server on and off again, we have a pretty good chance of getting it back. You know, if you write a POSIX app doing fsync, we know how that works. So, what are we gonna do with these? And you look at it and you either say this or you realize that, oh my God, you're like restarting the process after a crash is actually the equivalent of a file system check. I missed those because all of us really wanna run xt2 and have to check all that data every time the machine crashes, all right? So, my question is, does your database or data storage system have to do a consistency check and you kind of hope to repair the damage? This is, of course, different than log replay. So, if just say, if NODB, I don't know, machines, Mysql, MariaDB, Drizzle, that crashes, you restart it up and you have a really log that needs replaying and I don't do logs that needs replaying. I know you spend a bunch of time, you know, turning the database back into a consistent state. That's different. You end up in a known place and everything is consistent at that level. And I say to people, do not assume that it's instantaneous. I've seen it take hours if you have really a lot of stuff going on there. So, I was also interesting. Also, re-warming your caches. So, actually, having a buffer pool at 16 gig of all the random pages on disk that you actually access all the time, reading those back in can take a significant amount of time. So, not only do you wanna have recovery happen quickly and do something deterministic, but you also want to actually be able to serve stuff out to you quickly. So, that's an interesting thing. And I promise the entire talk will not be ripping on MongoDB. So, an interesting thing about MongoDB, you know, it's web scale. The nodes themselves are not crash safe. Turning MongoDB off and on again leaves you with corrupted random shit. It's not even that. You can say, what's going on? Oh, I'll just kill-9 the process. The reaction, if you post, I killed-9 MongoDB and now my database is unrecoverable and I have to restore from backup, you'll basically get told to bugger off. You've done something wrong, which seems kind of weird for me. The other thing it does is, which if you've ever tried to detect IO errors through MemMapDio is pretty much good luck. In 1.8, MongoDB, they have got around to writing a journal, so you can actually sort of replay and get back to a recently consistent state. Problem is, of course, that's not default and no database system has ever got shit for not having a transactional engine as default cough my SQL. So, think about it as running xt2 without with the sync command disabled as kind of the data storage thing. But, and I'm surprised no one supporting MongoDB in the room has jumped up, but you're meant to have replication. So you have a machine next door to it doing a replica and then you realize that, oh wait, you're going to lose power to the same rack of machines at the one time. You really want it to be on separate power supply units in the data center and oh, well, that could also be just the same power supply in there, so really it should be in another data center. Oh wait, the city could flood and that could have all my machines go off at the same time. So I better put it in another city. Oh wait, they have links sharing power supply between that. I better put it on a different continent because then surrounded by a whole bunch of water you're pretty sure they're not sharing power supply lines because it's not known for large chunks of the US to fall off the power grid for an extended period of time either. So you can do great replication across continents. Obviously the correct way to do it. There's also the great bit of of course, operations on a master are asynchronously replicated to a slave, so you still have the problem of having clients who receive, yes, I wrote that and then suddenly master goes away and you do slave promotion and stuff at thought had written has vanished. So that's also useful and it can literally give people the shits. Literally as in like you check in on Foursquare, someone messages you going halfway through your meal. That place gives everyone food poisoning. And you go, crap, two days later you better wonder that never to go back to that place again so you look in your Foursquare check-in history. Oh look, it's gone away. Thanks MongoDB, I have no idea which place I should never eat out again. So you think like Foursquare check-ins like not valuable data. Yes, it turns out it can be. So you could write your data to dev-null which is definitely faster, it's web scale. Of course you could remove the dev-null node and the slash dev hierarchy and that could cause you some application problems. So, you know, just admins, make sure you still have dev-null. Cassandra is a log structured column value, column family data model similar to big tables. And it's right up there, it's eventually consistent system. So it's a bit of a different consistency model than your traditional RDBMS as well as a different data model. And it's designed for more of like sort of OLTP style things and serving of interactive data. So it's not a big data warehouse thing, it's like when you're doing stuff interactively. Courage people to check it out, it's a pretty cool kind of system. It has something really interesting. As part of its API, it has this concept called consistency level. So when you actually go to an operation at Cassandra, you say what consistency level you would like with it. So you can say, I wanna do a write here and I want consistency level at all which means that you get back success when all the nodes responsible for a copy of that data have written it. You can also do consistency level one, which is basically like as long as one place somewhere has written it, return to me okay. And you can easily start to think what kind of data that could be. How many users are currently logged in? Who really cares? It's like doing something more intense update, you want more of it there. You can do a consistency level column which is basically a majority rule kind of approach to it. And with seal at all for example, this is where the eventual consistency comes into play. So if you have three node system of Cassandra, it do consistency level all and it manages to write to two of them but fail on the third. So you can still get success back from consistency level all and say you know, we've kind of, sorry. You still get failure back from consistency level all because everything hasn't written it. But your update may still have been applied. So what happens when you go to read that is in a consistency level that requires a majority. So if you do, I'd like consistency level column and two out of the three nodes have written the data. Cassandra then goes and looks like okay, well, here's the majority rules thing. Two of these have a more recent timestamp in this other node and then it does a read fix to make that other node updated then. So it sort of does not, oh look, node failure happened. I have to perform a consistency thing and do everything at once. It sort of reconstructs consistency as you do reads which is another interesting model to do. So it makes a node failure a bit more lightweight and sort of load balance out over time recovery. And of course you can do something that they refer to as consistency level any which is useful writing which is you know, as long as the write was received somewhere, I'm happy. But if that node explodes before it's transmitted to another node, yep, that write's gone. And so you actually have this nice kind of thing of flexibility between how much you care about your data and you make the trade off at the API level which is kind of cool. You'll also, if you're also looking at what happens if you're whole cluster favorite, flay it, blah, blah, blah, blah, blah, blah. Whole cluster failure in a Cassandra system, basically what it does is it will write things into an in-memory table and an in-memory log and then periodically flushes that log to disk. So it's, if you turn everything off and turn it back on again, as in like pulling the power plug, you'll get some recent consistent version of everything. There is an option for Cassandra which does mean write flush to disk right then. So then you will actually get, you know, if you turn it off and on again by pulling the plug, you will get exactly the last thing that was right was acknowledged to the client. So you have configurability there if you want a durable to machine failure or whole cluster failure, which is kind of nice. CouchDB is another interesting one that's used a whole bunch. It's log structured. It's like very consistent, traditionally log structured. So basically every so often sort of checkpoints down there at the end of a file and just keeps appending new things on the end and no one ever runs out of disk space because it's infinite. And basically when you start it back up after crash, it just goes to the end and looks back for where the marker was on the last consistent one and goes from there. So recovery time is near instant, which is exactly what you want sometimes. The downside is of course it has no auto compaction because you keep running on the end. So that can give you another type of eating your data of when your disk fills up because this is what your Twitter app was using for storage on your desktop happened to me. It's like, why? And this could of course happen in your service. You have to worry in about disk running out. Another interesting things on CouchDB typically you will query everything through views. So you want to use a document ID, you'll create a view that is a program of database and query those instead. Views are not materialized. They're not like created when you say create view. They're created on the first read. So this means the first query to a view is very slow as it goes through all the objects there and creates it. And they're also updated on read. So if you have a view that's on the access once per week, it's not updated on each document, right? It's updated when you read from that views. It's actually like not actually eat your data but eat your user performance. So people have to do tricks around that as well. So you have to be aware of how some of these systems work to really give an interactive experience. If the old adage of 200 milliseconds is about the time to have something perceived as non-instant, yeah, if you're gonna spend 10 minutes building the view before you can turn the query to your user, that's very non-instant. And essentially they will say my data's suddenly vanished. Another downside is the common web application website thing pagination, click next for the next 10 results. It's actually really hard to do in CachedDB. So that's a bit awkward, there's a few awkward APIs but it's fairly nice desktop couch has a whole bunch of things pretty nicely and you have something controls a demon which is kind of nice. The downside of course being an order compaction and no one Postgres ever got shit for having to run vacuum. So that's fine. People heard of HBase because it's kind of really neat. HBase is a database system built on top of the Hadoop file system and it has a really nice thing of being distributed and be able to do more data archiving queries kind of thing. But the data in you're doing writes into HBase is written into a write a head log in memory. So it's simply like the defocus under a thing write a head log in memory and then periodically this is flushed out to the Hadoop file system. So if you turned everything off and back on again by pulling the plug you get sort of a recent consistent version. The redo log there so the recovery log because it's stored in the Hadoop file system and it's distributed itself you get sort of backup of this log because typically HBase is structure where you have sort of for lack of a better term a single master serving a section of data. So you have this node is in charge of serving this chunk of data. And if that fails you need to start up another machine and have it assume that role. So when that happens it just goes to the Hadoop file system says what is the latest log I need to run for recovery retrieves that Hadoop file system runs that and you're done. You can also have this cool idea of data centers and you can actually have data from one HBase thing be pieced into a replication log to be replicated to another data center. So it sort of has some knowledge there of that data centers are far apart and have stuff there. Is that a question? Nope, just a waving, cool. Treat off in HBase of course in availability. So region servers HBase calls it fail and it hasn't flushed that section of the reader log to HDFS. Those transactions are lost. And also you have to have availability means you have to start up another region servers and wait for it to recover. So it's not instantaneous failover and some other systems. Cassandra on the other hand will serve data for everything remaining pretty much until you've killed every node in the system. It will keep serving a whole bunch of stuff for as long as possible. The other thing I'd like to talk about is transactions. Who here uses transactions in their database system and thinks they're brilliant? Yeah, a bunch of the NoSQL systems do away with them because they're hard, not performant or something that are the syntax. And some will go, but we've got atomic ops, you know, we've got compare and exchange. And of course everyone goes, yeah, it's really easy to build like, you know, my hotel booking application using complex change. That's why I write all my software and assembler with complex change 8B is the primary thing to synchronize between threads. Why would you use anything else? Atomic ops suck, people get them wrong. Increasingly levels of these, it is harder for everyone to get it right. I mean, transactions are a very simple concept. Type begin, do a whole bunch of operations and then commit or rollback will either commit your data or commit your operations or get rid of it and back to a consistent state. And this transaction will be applied fully or not at all. There, you explain transactions, you can explain that to anyone in a bar no matter what. Doing atomic ops or stuff like that or not having stuff in there and actually end up with consistency in your data can be quite hard. Because remember if you're having just, you know, a page doing a bunch of operations for a user and like the web server running that code crashes, if it's a non-transactional system, it's half applied. Before even your database layer has had anything wrong with it because, you know, you're just running throwaway boxes, not on UPSs to the front end. If you're doing a transactional system, it goes, oh, look, the connection went away. I'll roll it back. Everything's consistent and everyone's happy. So transactions are a really useful feature and a bunch of NoSQL systems throw them away which is a bit unfortunate. Some of them do provide some level of things in there which is kind of nice as well because of course no one ever got crap for not having transactions. It's just wrong. One thing is there, you talk about a bunch of NoSQL systems here and a bunch of consistency models. If this was supremely obvious and everyone could get it no one would ever accept to talk at a conference about consistency models in various storage systems. Programmers get shit wrong. The easier it is to get something right, the more likely it is to be. So not that necessarily something is easy to use cause let's face it SQL is sometimes hard to use to get things out of it and perform well. You know, optimizing it as a black art itself. So the thing we want is that it is hard to misuse. And so it's simple things in there like, you know, this is how transactions work. This is, you know, when you get commit back it does survive, you know, these types of failures. Having it simple to understand there is a great benefit of when you're having large numbers of people having a right code, usually very quickly to do things like this. And you can actually then reliably explain what happens and what goes on. The other thing that I would say is not so much eating your data but maybe, you know, moving it around your plate and feeding it to the dog cause, you know, it comes back eventually is schema free. Who thinks designing schemas for database systems is really annoying? But you do know where everything is. So you have a design there. The risk is with schema free stuff is that you sort of like add bits on here and there and you end up having the documentation for how your data stored is your entire app. And you just have like, yeah, we sure we saw that somewhere. I don't know, where was it? Where is it? So there is an advantage to schemas. The problem with most systems provides sort of schemas themselves now. So if you look at like MySQL for instance is alter table. Alter table being blocking and copying. So it's not that schema free is the real annoyance here. It's that when if you want to change the schema it's going to take three days to copy all the data. So again, we have a problem with the implementation of systems. And one of the great things about schema free systems is generally you can just sort of add on columns and add all kinds of codes. And we really need to fix that in database systems, which is annoying. Otherwise, there is the risk of, you know, I know where everything is at schema free. Please do not touch. So these things do require design. Schema free does not mean you get to skip that step. I'm going to say naughty virtual machines. So we noticed we had enough problems with writing the code with a simple posix API to get it right to disk in the first place. Now once you've even got beyond that and you're saying I'm not even using posix API I'm going to use a database system. I know it flushes the disk and it's all good. It calls fsync and everything there. And then it's like we're going to run in a virtualized environment. It's going to be great. Then your hypervisor decides to sort of optimize your system for you because it runs a lot faster if you disable the commands to flush data to disk. And you know, write barriers and the like, you know, is obviously a feature that makes your virtual box system run fast and probably other ones. So you have to be careful with some of these things that if you're running on top of environments like this you can do everything correct yourself but the machine lies to you and it will screw you on that front and this can end up with like very odd and subtle corruption. You know, or you could be running always 10. So if fsync can never work on the system you sort of equally screw. Also if you look at systems like EC2, like EC2 instances don't persist local storage. You have to use like elastic block devices to do that. So you have to think about that as well. So you always have to know how that's running. So running on these new kind of systems does require some thought. And if you ever wondered about something going, wait, that's way too fast. You can actually do the math to how quickly the disk is probably rotating. And if it's faster than that, odds are something's awry. So really one machine is not automatically in order of magnitude faster. Have a look. Of course SSDs make that more interesting to work out. So in summary here, I would say there's two things that are gonna eat your data. Bad software, as in software is not written correctly or does not make it easy to get things right. And I say bad people but really it's like people make mistakes. So you write code that isn't correct or you don't follow consistency model or something is so horrendously complex that you can't get it right is also the problems there. Or someone says, hold look if I mount the file system with no barrier all my file system operations are really quick. Oh look if I run the database server with no sync option it's really quick. And you have it go that way. So the two things you really have to watch here is what software you're using and how that software is implemented itself. As in is it providing you with tools that you can actually use as people to get things right. And this is a big thing it's like, right tools, right job. And sometimes you don't care about the data. I had on the dot point list if you looked at the abstract I was gonna talk about consistency model of memcache. It is a perfectly valid consistency and persistence model. There is no persistence. If the machine goes that bit of data goes but it's really fast. And sometimes that's what you care about. If you're just keeping it counter for like how many people are logged in you can easily recreate that by looking at what open sessions you have in the database but it's really really quick. And sometimes that's what you want. So use the right software for the right thing with the right consistency model for the right data. Or you have something like Cassandra which gives you like config options on based on each thing which you know means you have to think more at each step of the way but you know sometimes you really don't care and sometimes you do. And to make this distinction to actually make performant applications and especially talking about large scale websites or any website because you know they grow because everyone's gonna have a million users next week. The scary thing is you never think you were just gonna have a million users next week and of course that means you do. Let's see you plan for it in which case you don't. So think about it. Think about the different systems what they use for and then educate people and talk about do we care about this data what happens at various failure modes. Thinking about failure modes is something that no one ever does and especially what's even worse is no one ever tests for it. My current campaign is that we should never ever ever have any error handling code for memory allocation failure instead we should just dump core. Why? Because that is known to work. All those error handling paths I can guarantee you I never tested. Turning the machine off and on again I'm running recovery. That's tested. Fail only software in that way is also pretty cool. And of course you can think it's like cloud is magical everything's persisted everywhere. Of course data can be deliciously eaten in the cloud as well. So you know this is not just something that's magic to get sold for us and I believe we do have some time for questions. So some insane person wrote a NoSQL type interface to NODB that can run as a MySQL server plugin. So someone wrote something called a handler socket which is a module that you load into MySQL server which gives you two TCP ports one for reading and one for writing. The beauty of this is it does it to your NODB database that you also have access to via SQL. So you can in fact do very very fast like key operations key being indexes on it that reads updates to leads through a very fast thing that doesn't require parsing SQL for instance as well as access exactly the same data via SQL. And this is where I think the future is on relational databases. It's to have multiple interfaces. So you have interfaces that are like key base doing very simple operations do it very very fast and saving parsing SQL and optimizing it and all that when you know the small operations are and you have this great flexible query language to do you know group buy and the like which turns out to be really useful as well. So you have a dual approach here. So this is less of a question more of a comment. I'd just like to thank you for doing this talk because hopefully it highlights some of these issues. I've encountered far too many people jumping on these sort of no SQL bandwagon just because it's cool without actually understanding these data models and what they're useful for. Like I know of sort of two months ago there was a project and they're like oh we're going to use M base and M base which is like Mkd with persistent storage tacked on and they're like oh key value it's awesome and I'm like I hadn't looked at it and I took a look at it and I'm like you realize it doesn't have transactions right and they're like no no no it's fine it's got atomic operations and I'm like. It makes it much harder to get right yeah. I mean no one's ever had you know something deadlock in software or anything and you know there's never had to be a lock dependency checker inserted in any large software to find the problems there. It's like yeah locks and atomic ops are hard. Okay what I think well what I've seen with people using no SQL is that they often use it to try and avoid joins. Whereas if you have a structured type and say if you like in Postgres for instance you can have element you can define a type which is a tuple and then use that as columns and other tables and it can you know you can make it go turtles all the way down. Do you think that more common implementation of that sort of thing is going to help avoid this no SQL crates? So it's interesting sometimes denormalization is very good for performance which is essentially what you're doing. To say to perform a join operation means you have two tables that are usually in different parts of the disk because that's sort of the naive way to implement table-based data storage. Table one over here on disk, table two over here you do a join, you're doing reads from here then here, here then here, here then here, here then here which in case it worked out random IO sucks. So what you can do is just denormalize the data for very common joins just you know update it in another table there and this is a common thing that people do on consulting it's like it's quite easy just denormalize this table then it's sequential reads or use a tuple data type and then it's sequential reads and avoiding the join that way or you start to get systems there's more around now where people are trying to work out better ways of storing the data in a relational database system so that joins become less expensive. So actually analyze what's going on in the joins and then change the storage layer underneath to magically put those sequentially on disk which is also interesting. So again it comes down to the implementation versus the conceptual model and you can work around the sort of implementation problems with using the tuple data type or actually materializing it and using even triggers to update that. So you mentioned Cassandra has tunable read and write consistency. Big Couch which is a CouchDB clustering piece of software to make CouchDB do things like partitioning and shouting and stuff also supports read and write consistency. Yeah, sometimes not. So it's basically a big couch which is a CouchDB thing that does partitioning and large scale distributed couch stuff also has read and write consistency options which is excellent. I wasn't aware of it so they should possibly write that in like CouchDB docs. There's like 10,000 of these no scale databases and I know you're not gonna be able to go into all of them but I did wanna get an update on Drizzle. What was going on with Drizzle? I gave a talk at the data storage mini-conf which I believe is recorded. Otherwise I can talk afterwards and I can tell you a billion things was going on. Ajax has a question right at the back too. Just because I'm feeling that you know it should be a fitness regime and running up and down stairs as much as possible. Don't fall on your face, it'll be really painful. If Fsync is this API that may not actually work and I recognize people wanna run an OSX because they're silly but have you considered going to the Linux kernel and saying here are the synchronization primitives we actually need out of our file system and the guarantees we want or is it just a matter of wealth? Fsync has this different behavior on EXT3 and EXT4 and we're just gonna cope with it. Yeah, I'm sort of torn between two things here. One of which is like with different synchronization primitives there. One of the things that I'd really like because you can do some interesting things is like write barriers in user space. So then you can do interesting things where you want sort of a recent consistency thing that you don't necessarily need each thing flush to disk and that could be nicely fit into IO scheduler but that could be really useful and then immediately the next thing is but people insist running on other platforms as well. And so basically you get to the point it's like you also must be able to emulate this API in user space and at some point there's a trade off but I think there is scope for designing much better synchronization primitives for file systems and even if you look at unfortunately Microsoft done some interesting stuff with transactional NTFS which means you can actually run sort of transactions in the file system which means you get really interesting things like crash safe system software upgrades which I think is really interesting if you ever pull the plug during app to get this upgrade has anyone ever done that and got a machine that booted and that you could log in. Yeah Windows actually does that now but it's meant to at any point like in the middle of an upgrade you can pull the plug and it works. So I think there's room for that API as well. The problem with making transactional files to API is it's really, really hard to not break every app in the entire world but by adding these new small ones I think there is a room for that there and part of it is just getting to the point of having the rest of the database system code sort of something we can be proud of and then explore some of these new things as well. So I have a few things bouncing around and I would like to test with as soon as I get that free 10 minutes but yeah we want to work with new stuff. Last question. NoSQL is actually a really, really old idea. What's your opinion of Berkeley DB? It's an interesting implementation of stuff. Personally I always loved all the software that used Berkeley DB kept on having corruption. So you know, well I think a software project using subversion for revision controls is a good reason not to use that software project. That's possibly a more... The Berkeley DB is not bad. For what it does it's not bad. It's an old thing there. The codebase is rather large. I think there are much better libraries through similar things now. I would instead like to look at Tokyo cabinet or something first and try and get the consistency right there or TDB or any sort of other embedded things simply because the API is a bit odd for me. It feels like it was written a long time ago. It's kind of a bit there but yeah, NoSQL is very much not new. Putting it behind a network service is recently fashionable again. Putting it behind a network service is recently fashionable again because people have done that for a long time as well. But yeah, I don't mind Berkeley DB too much. It's just I have no applications for it. I was just going to say we're all out of time but I'm sure Stuart can answer questions over lunch. Oh, and any time during the conference like this is the last talk, I'm relaxed. I will chat geek all the time. Even better if like beer is brought to me. Best way to get a question answered. I just like to thank Stuart for talking to us today. This is a small gift from Linux Core. Thank you. Thank you.