 He is the CTO of NUWA-DV, which is a shipping database start-up at Cambridge, Massachusetts. This company was founded by Jim Starkey, who was supposedly the inventor of NBCC when he was at debt and very more sort of than a guy who was involved in the New England database community for several years now. Seth is undergrad at Brown, but he and I did not overlap. And he worked at Sun Research for a while, traveled the country doing other things, right? It made me sound like I'm, yes, yes, I traveled the country gaining wisdom and... About databases. Databases. So he can come by today is... On the database. On the... Yes. So he can come by today is sort of a big pride in the world where he sees the future of databases going, databases then going in the context of New England. So thanks, Seth, for being here. Thank you. So, yeah, I heard that the title had changed of this talk and I thought all I had to do was show up and like give someone a t-shirt and I was done, but since he's not here, I will give a talk. So, yeah, thanks, Andy, and thanks, everyone, for coming out. I gather there's stiff competition in this time slot with like food and people who want to hire you and whatever else. So thank you for being here. I will try to make this entertaining. A couple of comments. As Andy said, New ADB were a start-up in Kendall Square over by MIT. If you make it to Cambridge, let us know. Come by, hang out. We'd love to chat and hear what you're up to and show you around Cambridge. We've been working on New ADB as a company for about four years and before that, Jim Starkey, our technical founder, was working on the project for about two or three years on his own. And really, the genesis of this project was looking at what was happening in the database industry about eight years, ten years ago. And this trend away from transactions and this trend away from SQL and kind of this increasing acceptance that the truth is that you cannot make a relational, transactionally consistent system scale effectively when you're thinking about scale out models. When you're thinking about running across lots of infrastructure. And he kind of looked at that and I think having spent his entire life working on relational databases took a little bit of offense to that and said, well, that's nonsense. SQL is just a declarative programming language. ACID is just a set of constraints about what you want your data to do and what should happen in the case of failures. And you probably can make that scale. And that's really where this project comes from. I don't want to just talk about New ADB itself today for a couple of reasons. One, because this is a university and in a former life, I used to work in some research and I really enjoyed getting out and working with universities and not having to give pitches, actually getting to get out and talk about technology. And so I'd really rather do that today. I will talk about what New ADB is because I think that's kind of interesting and I think it gives you some context for some problems that people are looking at today. But I don't want to focus just on what New ADB is. I want to talk a little bit about where I think the industry is and what some of the interesting problems are that people are looking at, say, over the next five years. The other reason I don't want to talk too deeply about what New ADB is is because fundamentally what we've really built is a very different architecture. This is not kind of a slightly different take on the front end of my SQL implementation. This is not a slightly different take on how you do data replication. This really is a fundamentally different architecture than any other database has taken. And as such, it's a little bit hard in an hour to go through and talk about all of the details. And so what I'm going to do is I'm going to talk about some of the high level pieces of what we do. I'm going to talk about some of the problems that we try to solve where I think the industry is going. I'm actually going to do a little bit of a live demo, which I realize in a database talk is a slightly tenuous idea, but I'm going to try to do it. And hopefully it won't be too boring. And I invite anyone who wants to talk with me after the talk and really wants to understand the deep details that I glossed over to come find me because it's all public. We've written about it extensively. It's not something I'm hiding. It's just not something I think I can squeeze into this talk today. So with all of that as the context, the title of today's talk is Current and Future Challenges in Data Management. What I want to do is I want to talk a little bit about what I think some of the current challenges are. Why does the database space is so interesting today? Why people are really excited about it? Why universities that have had no database group for a number of years or have had database groups that have been kind of inactive looking back up? Why this is... Oh, my God. For example, where you are today. Why this is such an interesting space? And then where this is taking us? Kind of where the directions are? What some of the problems are that people maybe are thinking about or are not already thinking about? And obviously the not-so-hidden agenda here is that, yes, we're always looking for people to hire, but also working on some problems that I think are interesting research problems. And so if there's something in here that really peaks your curiosity, really gets you interested, if you see something in here that you think is not just about kind of writing code, but is about solving something that fundamentally no one's looking at today, I'd love to talk with you more about it. So please come find me. What are some of the current challenges that motivate database work today? So let's get this one out of the way right up front so we don't have to talk about it anymore. Cloud. Cloud is this big amorphous thing. No one really knows what it means. No one really cares. It's been ridiculously beaten to death in marketing. You know, it's an excuse to do almost anything these days. Cloud, as far as I'm concerned, all people mean when they say they're doing cloud is they mean that they're thinking about running in a scale-out model, not a scale-up model, right? You want to be able to get more transactional throughput. You want to be able to handle more users. You want to be able to be more fault tolerant. You want to be able to solve other problems like that by adding more machines. You don't replace a big machine with a bigger machine. Cloud also means that you have to be able to do that in an agile fashion, right? Saying that you're going to add another machine by shutting everything down, and then five hours later starting up N plus one machines doesn't count. Like, you just got to be able to add that machine and keep running. So it's about scale. It's about agility. It can be public cloud. It can be on-premise private cloud. It can be a bunch of VMs sitting under your desk. I don't think that matters, right? The cloud model is really more about that agility, the ability to think about resource management and provisioning, the ability to scale out, and the ability to do that all as efficiently and as easily as possible, right? Because people are trying to solve really hard problems on really tight deadlines. People are trying to cut costs where they can, and so these models are really important. And it turns out for a lot of traditional database architectures as probably everyone in the room knows this is hard, right? So that's one thing that's interesting today. Another thing that's really interesting today is what's happening in landscape, right? So eight years ago, ten years ago, people started getting really interested in key value and in systems that either eschewed SQL completely or only took on part of what SQL does, systems that were kind of maybe doing transactions in a limited fashion, maybe doing no transactions, maybe providing transactions for some sets of data, but using this as an argument for saying, this is how you make system scale. This is how you do that cloud stuff. And as a result, we have this explosion of databases. And if you look at the number of databases you could name five years ago and the number of things that are out there today, it's like, I don't know, you all probably know the number better than I do, right? But it's like 5x more, the number of real systems, right? And I'm not going to say it's 5x the quality. I'm not going to say it's 5x the problems you can solve, but it's an enormous number of systems that people have built, right? And what happens after you spend years building all these different systems, you then have a moment where you step back and you go, well, this is a pain. Like what now has to happen to solve a set of problems is I'm running like nine different databases, many of which are handling the same data sets. And I've got something that can do acid and something that can kind of do acid and something that can't do acid. And this thing can kind of scale and this thing can kind of do cloud and this thing can kind of do this other thing. And that's really difficult. And so what we're starting to see is we're starting to see convergence, right? If you look at like Cassandra, what does Cassandra do? Well, Cassandra is building CQL. And if you go talk to the data stacks people or the Cassandra leaders, they're all talking about how they're folding transactions back into the model, right? I mean, that's a query language and it's a transactionally consistent model. You're going to have people coming over the next few weeks. Each week it's going to be a different title. Like you're going to come up with a different seminar title each week or is it going to be the t-shirt thing? It's going to be... Okay, so the speakers who come for the Clothe Dave Anderson seminar weekend and week out, you're going to hear some really interesting takes on this, right? And they're going to be talking about the key value that does acid. They're going to be talking about, you know, relational that's designed to be able to handle JSON. They're going to be able to, you know, talking about indexes that are there to solve very different problems. They're going to be talking about trying to solve more than one problem in the same system. And this is really an important theme, right? And I realize that this flies in the face of what some database people will say. If you've heard Mike Stonebreaker give a talk in the last 30 years, right? It's all about specialized one-off bespoke databases that solve individual problems. The problem is there are just too many of them right now. And what's happening is people are trying to figure out how to get back, maybe not towards just one database, but back towards databases that can do more than one thing. And that's important for a lot of reasons. It's important for kind of pragmatic reasons when you're thinking about all the stuff that you manage in an enterprise. It's important when you're thinking about how as a developer you pick the right tools. But it's also important because there are problems that can't be solved by running multiple databases, right? And so people are talking more and more about hybrid solutions. And I'll give you like the example du jour that I hear a lot about, right? And it comes by several names. One of them is operational intelligence. One of them is hybrid operational analytics workloads. Gartner, the analyst group in their infinite wisdom. They like to coin terms. They've chosen to call this HTAP, Hybrid Transactional Analytic Processing. Whatever you call it, essentially, what people want to be able to do today is they say, I've got my operational database. So I've got my database that's keeping track of real-time stuff, session information, account information, inventory, users, things that are happening in real-time in a system. And traditionally, I'm going to dump that data periodically into something else, some kind of warehouse, something that can handle large amounts of data, something that independently can run lots of analytic queries, can give you a lot of intelligence about what's happening in a system, right? And that's a useful model. The problem is that there's a whole class of problems that can't be solved, right? If I'm sitting in my network operation center and what I really want is I want a monitor sitting up there that tells me when something is not running correctly in my system, that has to be real-time, right? Or if what I really want, as my business proposition to my customers is the way I do a product better is I give them real-time insight into what's working and what's not working with their applications and their users. Or I care about international law, and we'll talk about this one later because this one's really interesting, right? But I care about international law, and I'm not allowed to take data that's in my operational data set and replicate it over into another place. I'm not allowed to maintain the same data set in multiple places. Or I just don't want to because the burden on me now in terms of maintaining audit and maintaining intelligence about where all my data was segmented, that becomes a wicked pain, right? So what I really want to do is I want to have my operational database and I want to do analysis there. I don't want to move my data anywhere else, right? This seems like an obvious thing, but for most traditional relational databases, this is really hard. Things that are built on MVCC can do it a lot more easily because you lose a lot of the kind of inherent transaction level conflict that you'd have, but you still have all kinds of problems in terms of disk thrashing, in terms of I.O., in terms of cash, in terms of how clients are connected, how CPU is being used. These kinds of hybrid workloads are increasingly important, and they go hand-in-hand with this notion of kind of coming back towards common problems that databases can solve. And finally, a trend we're seeing in the last five years, unfortunately, is as we've had this explosion of databases, we've also had an explosion of complexity, right? And to be fair, the gold standard in the database world, Oracle, has always set the bar really low, or really high, depending on how you look at it, but like, you know, has anyone here ever actually deployed an Oracle database in practice? Well, humans collectively can kind of do that, right? Because step one is you... Do you need these mutated species called DBAs? That's right. Well, step one, you install the software. Step two, you scratch your head as the developers and the DBA is trying to figure out what's going on. Step three, like, a dozen Oracle consultants come in and you get a little bit better, and like, iterate steps two and three a lot. And they're just screaming at you. Well, I'm not going to claim that's unique to Oracle. But we built these amazing technologies. We built really incredible technologies, right? And as developers and as computer scientists, part of what we do is we understand that there's theory and then there's a time to be practical and actually build things, right? And the NoSQL movement in mind is really the latter, right? It's saying, theoretically, yes, maybe we could build transactional systems that scale out, but today we need solutions to problems. Like, people are screening to have their kittens blog accessible by 10 times the number of users that are there today. We've got to figure out how to make that work. And arguably, that's not really a critical problem, but we treated it as a critical problem. And so we built systems that could handle that scale. And how did we do it? Well, we said, let's get rid of transactions, which it turns out, transactions are there to help you, right? They're actually like the utility for developers. And so that made writing code harder. And then we said, let's shard things. And that makes code harder because now your application has deep knowledge of your operations model, has deep knowledge of what you're doing in your database. And by the way, let's deploy this across lots of different systems and assume something about the operations model, like all the disks are the same size or all the computers are the same kind, or they're all sitting on the same network. Let's assume that when they fail, they fail in very known ways, and that bringing things back online is something we don't have to do very often. We don't really have to worry about, right? You start embedding all of these assumptions and all of this knowledge up and down the stack. Into your application, into your operations, into how your database actually is built and runs. You start tuning the hell out of your database, trying to think about very specific details of the application workload against your disk or against the network. And you embed an enormous amount of complexity. And now you're trying to scale out, and probably most people in the room know, distributed systems are hard. Scaling out from one machine to multiple machines is really hard. And one of the things that people are starting to be more and more aware of is that if you don't think up front about the issues of complexity and usability and manageability, you're going to build incredible technology that no one can actually use effectively. And so this is something that people are really concerned about right now, is the growing explosion of technologies and also the growing explosion of complexity of using those technologies. And this is as much as anything, you go to Wall Street and you go to one of the top 20 banks on Wall Street and you sit down with their IT team and you say, what are the limiting factors that you're looking at? This is actually one of the key limiting factors. It's not always the technology. It's like, you know, I have sat in a room with someone who says, well, we have 90,000 database instances running here. Like 90,000 database instances. And I'm like, okay, which database? And they're like, yes. We have them all. Good. Well done. I'm like, what versions? Yes. No, we've got them. We've got every database, every version. What operating system? And they're like, mm-hmm. Yeah. And what operating system is like, even test my limit of operating systems. I actually was working on someone recently who asked me if I run on tandem nonstop. I was like, oh, damn. They're like, yeah, we're running tandem. They're like, we're running the new version. I'm like, really, you're running tandem on a test? Two, like, anyway. There's like three people in the room. I thank you for laughing. The three people who understood why that's funny. We still sell nonstop people. And by definition, they always will. You should read the paper. No, no, it's awesome stuff, right? I'm just saying, like, those people are looking at the systems they have today. And as much as anything, the problem they're trying to solve is just incredible complexity. So those are some of the things that I think are interesting today. And let me talk a little bit about new ADB now to give you a sense of kind of why these are interesting problems to me and what one take is on some of these problems. And for context, since I just talked a little bit about kind of looking across breadth of technologies and kind of older and newer technologies, just for kicks, this is our team at New ADB. And I show this slide two reasons. One, because I'm the young guy on the slide and it makes me feel a lot better. So when I come to university, I'm usually the old person in the room. So I like this slide because I'm, by 10 years, the youngest person here. It makes me feel really good. The other reason is to kind of make a point that where we come from at New ADB is really a fundamentally different view of the world and who were working at some company and saw a particular problem and said, that looks cool. I want to go off and do it. When you go back kind of in the databases that followed system R, and you look at the original four SQL databases, we have CEOs of three of the four up there. And I don't think we're going to get the co-ceos of the company that is currently missing from this slide. So one of those of you who are following along at home, I used to joke about Larry when I get, actually that makes me sad. It's the first time I can't, I always joke about Larry when I say, now I can't do it anymore, now I'm sad. Larry, he's still doing something, right? Well, he's probably more effective now. But this is the view we come at from New ADB is really we're looking back over about 40 years of what people have done in the database industry what it is that was important from a business point of view, what it is that was important from a technology point of view, what things we really need to keep and what things we throw out and start over again. And that's what I think is really that's why I've been at New ADB now for four years. And that's why I think this is a really interesting project because it's really a fundamental rethink kind of respecting the past, respecting kind of the lineage of where databases have come from, but trying to understand where we're tracking as an industry. And how do I phrase that? I think broadly most people in this room are in computer science most people in this room understand something broadly about databases and distributed systems. This is obviously a gross generalization on this slide. This is one of the few things I crib from our marketing team. But I kind of did this to make a point, right? That when you think traditionally the ways people have tried to take database technologies and distribute them falls into one of a couple of buckets. And again, when you think about things like Oracle or SQL server or kind of any of the you know the big traditional database architectures, right? They really all have their lineage from system R. They all have their lineage from a database that was trying to do a few fundamental things and was trying to understand, among other things, how do you build a database with very few resources that you can bring to bear, right? Not much disk, not much I.O., not much cache, not much memory, not much processing power. How do you solve some fundamentally heavy lifting problems on very limited resources, right? And so I'm going to argue again in kind of gross terms that when you look at something like Oracle its core technology is essentially a file system, right? It's a mapping from pages on disk to something in memory and assuming that as long as you've got a lot of I.O.S. and you've got really good bandwidth there, you can optimize very efficiently and that's why it works really well on a scale up system and that's why it doesn't work in a distributed architecture, right? So what are the choices you have? Well, you can take the approach that Oracle has taken with something like RAC, right? And you can do a shared disk system. You can say let's essentially make that disk path bigger and it's not really the disk path, it's like lots of disks and it's other technology in the way there. But essentially it's like let's make the disk bigger and then let's put some things across distributed nodes in front of it that are doing some of the tasks that you need a database to do, right? So handling client connections handling lock contention a few other simple tasks but essentially it's still bottlenecked on the disk, right? And so this is something that is very expensive, it's fragile it's very hard to deploy in practice in a fault tolerant fashion and it's not something that you can distribute very widely. But if the problems you're trying to solve are around handling more client connections this may be a great solution for you and if you have a lot of money this may be a great solution for you. The other approach is something we've already talked a little bit about today, right? It's kind of the shared nothing approach or increasingly people think about that kind of exclusively as a sharded approach. But essentially the approach of saying let's take a database, let's split it into lots of sub databases that don't really communicate with each other and let's let them run independently, right? And it is also a good solution, right? It solves a lot of problems pragmatically. It adds a great deal of complexity because your application logic needs to understand something about this model typically. It limits what you can do in terms of a transactional model. It limits your capabilities in terms of data consistency. Failure models are much more complicated because you now have disjoint sets of data spread across different machines and unless everything fails at exactly the same moment, which like pro-tip rarely happens it's very hard to understand something about what has failed when and how you get back into the right system. And the argument of course is you don't care because you've thrown out consistency anyways. Haha. See what you've just done there? Well, I really apologize. Using my jokes are a lot funnier but apparently they don't play well here. I was in France recently and I was giving this talk to like an auditorium of 300 people and I made a joke that I thought was going to be really funny. I just got all blank stares and I was like, well, I guess that joke doesn't translate here well in France. And everyone wore it in laughter and I was like, well, that was interesting. I bet most people in this room have at least read the Spanner paper or the F1 paper or is it Mesa? New one? Is that right? Some 75% of people in this Spanner paper don't understand it. No, but what's amazing is that there's like 25% of the average computer population that can read it and understand it. Which is in itself I think kind of a remarkable achievement because it's a lot of complexity in there and they've managed to actually capture it I think in a reasonable fashion. But I agree, it's not for everyone. If you have trouble sleeping tonight, I highly recommend therefore downloading it and checking it out. Essentially this is the core technology that Google relies on today. So 10 years ago, Google started looking at problems of scale and said relational databases aren't getting it for us and so eight, nine years ago they announced they had this thing called Bigtable and that really was the start of the whole NoSQL revolution in a lot of people's minds. It was Google coming out and saying this is the way to go forward and everyone followed them. So now Google has gone back the other way and said actually you know what? Everything we need that we care about has to sit on top of a transactionally consistent system. We can't do it otherwise. There is contention within Google. There is contention within Google, absolutely. There's great contention within Google. But I submit to you that AdWords, the only product they have that makes money has been completely rewritten to sit on top of a relational system. So I agree completely. There is great contention about whether this is the future path for everything they do at Google. But a number of their core technologies now rely on Spanner on F1 and it's a really cool technology. It's a really interesting system they've built. It does require piles of atomic clocks. It does require lots of really high speed networks. It does require deep knowledge of a very specific network architecture that your schema maps to and that you assume your queries map to in very specific ways. And if you can do that, well done. For the rest of us, however, who don't want to invest in hardware that way, we've built what we think of as a different solution, as a fourth solution here. This is something we call a durable distributed cache. But essentially what it is is it's a peer-to-peer system. It's a system where there is no notion of ownership. There's no notion of a master. There's no notion of active passive. There's no notion of explicit replication with delays of any kind. It is a system that presumes that you want to solve all those stuff we started out the talk talking about. And that part of how you do that is you want to be running in memory. That you'll want some kind of caching component. And so core to this architecture is that we're running, we're driving transactions in cache. We're running in memory. But that you can't have acid transactions without durability. And so it's a caching scheme that understands the rules of atomicity, consistency, and isolation, but also respects the fact that durability is a first class component in any transactionally consistent system. And if you squint really hard, this is what our architecture looks like. And I mean, you have to squint really hard to make it look like this, and probably you shouldn't do that, because you'll also start seeing things floating around everywhere. But this is what we're really talking about. Right? A database in our model is not any one host. It's not any one disk. It's not any one thing. A database is a collection of independent peers. And by independent, I really mean peers that are all equivalent, that all have exactly the same capability, can do exactly the same things, are capable of taking on any other peer. And so it really is a true distributed system, in the sense that it's a number of nodes that are each independent and doing their own thing, but know how to coordinate with each other. So it's fully symmetric. Everybody can do any role? With one caveat. And that is what you see kind of in this green, brand standard green on top and brand standard gray on the bottom, which doesn't come out very well in these... So we're just small enough as a database, but a startup company, sadly, we're just big enough that we have things like brand standards now. So I get to gripe about that in talks at least. When you start a new ADB peer process, you tell it to play one of two roles. You tell it to play what we call the role of a transaction engine, which are the ones on top, the green ones, where you tell it to play the role of a storage manager. It's the same executable, it has all the same functionality. It understands all the same rules about coordination and all the same rules about how you share cashed information, how you drive transactions, but you're told to play one of these roles. And the transaction engine is playing that in-memory role. So these guys up here, completely in memory. No disk, no reliance on disk, no knowledge of disk. SQL clients connect to them. They understand the rules of anonymity, consistency, and isolation. They know how to drive transactions. They know how to parse SQL. They know how to do SQL optimization. They understand how to form a query plan. They keep data in cash. When they need something that they don't have in cash, they know how to coordinate with their peers to find it. And I mean any of their peers. If I take a miss on something, but my peer sitting next to me has it in memory, I can go there to get it. I can solve all basic problems. The other role one of these peers can play is that durability role. There's something that a client never talks to that doesn't care, that it's SQL, doesn't care what's happening on the front end, is there solely for the purpose of data durability. And a couple of interesting things that you can already see on this slide. One of them is that internally what we've built is a distributed object system. It's not a SQL database. What we've built is a distributed object system that understands the rules of acid transactions, understands what the problems are that you care about when you're talking about SQL. Things like indexes, right? Things like schema. But fundamentally is built around the assumption that what you're doing is you're sharing objects around the system. And what you care about is how you coordinate on them, how you cash them, how you maintain durability, how you maintain consistency and correctness in the system. And so, when we're talking about storage, we're really talking about key value. We're talking about just storing objects, which simplifies the storage tier dramatically. Right? And you're going to see another take on this in a few weeks because a pretty cool guy named Ori is going to come in from a foundation. And he's going to talk about kind of a similar idea. He's going to talk about kind of having a key value store and then putting SQL on top of that. Right? And kind of say this up front. First of all, he's going to come in and you're going to know it's Ori because between him and me, you get a person with like a normal amount of hair. So, um, Ori's a good guy. Hey, Ori, if you're watching this at home. He's also ex-Israeli military. Yeah, also seriously, don't mess with dude. Like, he's serious. So that's another take on this idea. But what I want to be clear on is that this is not kind of like, we've said here's SQL and it's sitting on top of key value. I fundamentally said, there is this notion that I can be running transaction here, transaction here, transaction here. I'm maintaining consistency correctly across these. Each of these can be working on the same data sets. Right? Each of these understands the rules of the system. And so what I end up with, what that blue box is there, is a representation that says all of these peers together express what looks to a client as a single logical database. There's no sharding. There's a notion of active passive. There's a single JDBC connection string or ODBC connection string or net entity string or PHP or Python or Ruby or whatever it is you want to program to our system in. There's a single connection string. And whether you're running a new ADB database on your laptop, whether you're running it on your desktop, a couple machines, public cloud, running across a couple of clouds, running around the world, it doesn't matter. It's a database, right? And that's one of the fundamental building blocks that we use to solve some other interesting problems that you don't get from any other system out there. And I think that's kind of cool. Two other comments about what you're seeing on the screen here in front of you. One is that kind of blow up box in the upper right corner. So this is a system that understands objects, as I said. And that means it has some front-end notion of SQL. And by front-end, what I really mean is that we don't care that it's SQL that we support. Tomorrow, if someone came along and said I really want you to support something that is native to JSON or I really want you to support Sparkle. God knows why you'd want that. But like I really want you to support Sparkle or I really want you to support some other kind of language. That's fine. But as we'll talk about in a few slides it's not just that you want that front-end language it's that you want the whole architecture to understand what are the new problems you're trying to solve that motivate you to want that different front-end. And that's where this gets interesting. Because this is an architecture that can adapt not just in that front-end language but fundamentally throughout the architecture to understand what the problems are that are motivated by the model that you're trying to support. The other interesting thing here is that management client up there. So something we've built into this product as a first-class citizen is a management and provisioning and monitoring tier. So when you install our software and you can grab our software for free you can install it Windows, Linux, Mac you can go public cloud or your laptop. You can do mixed models you can run on, you know, one database can span Windows and Linux you can run on your laptop and fast machine and slow machine and VM and whatever it doesn't matter. When you install our software you're not starting a database. What you're doing in the cloud sense of the word is you're provisioning that host and as you install it on more machines you're peering those machines together and you're provisioning more resources you're like increasing the pool of resources that you can bring to bear to solve database problems. And then what our tools give you and our APIs give you is the ability to see what's actually available and decide where you're going to run which databases and how you're going to move things around and how you're going to optimize for resource utilization. So this is a thing we built in to do management more efficiently and make it much more simple. It's also something we built in so that you can then start to answer questions about security because you now can talk end to end about kind of the audit of not just who could do a SQL query but who was it that provisioned machines who was it that defined policies about where data lives who was it that made decisions about whether or not something was allowed to run in a particular safe zone or unsafe zone and so you can have much greater control over kind of access to the system operationally you can make interesting operational decisions and you can make them separate from developers who are actually writing code. Our marketing team likes to come up with big, bold, brash claims and so this is the other marketing slide I've stolen although I changed the title because the title was a little bit big, bold and brash but when we talk about the core capabilities that we enable with this architecture what is it we think I said earlier we fundamentally step back we threw out architectures we started over we built something that doesn't look like a traditional database because there were problems we're trying to solve some of the baseline problems that we think this lets you attack one of them is about scale out is about on demand being able to add nodes to a database and just get more transactional throughput and then when you don't need those resources shut down nodes right so on demand you can scale out a database you can make it scale back why can we do that we can do that because when you're adding nodes in our system for additional transactional throughput you're adding things that are in memory and literally those nodes come up in tens of milliseconds in a running system right so these are things this is something you can do very fast something you can make decisions about like suddenly my blog is getting slammed and I need 10x capacity you spin up more machines and when you don't need those you spin them down the memory of all the machines no in today's architecture a transaction runs only on a single machine oh sorry sorry I misunderstood your question yeah I'm not going to explain sorry I hate to do it but can we talk about it after the talk because I have to explain what the locking mechanism and I'm not going to get to that in this talk a transaction that's running can be holding locks in multiple places this is a version controlled system and so these are for right right conflicts this is an MVCC system but answering that question actually requires me to answer something else first that's not obvious about this architecture and that one's going to take me 20 minutes to do so let me let me punt on that and we'll come back to it one of the nice things about a system that is designed to be able to scale out dynamically like this is that it means you get another thing essentially for free you get a feature that people increasingly want continuous availability and I say continuous availability not high availability because high availability is about saying you know I know the system is going to fail eventually it's all going to go down but let me try to minimize the amount of downtime I expect per year right when you're talking about continuous availability you're talking about designing with the goal of never taking down the system right now in practice systems will always fail not going to claim that what we've done is guarantee 100% uptime because that would be hubris and other things but that's the design goal is a system that never has to be taken down right and because we have this pure architecture it means that if you want to upgrade a machine you can just shut down stuff on that machine and you can upgrade the OS or the hardware you can upgrade the new adb software you can go through and do rolling upgrades of our software when something fails you can bring something else online to compensate automatically you can bring something up proactively and then start going through and doing rolling upgrades you've never lost any availability it's designed with the goal in mind of being able to do live upgrade by the way we can do live schema evolution as well in the system because of that object model I talked about earlier so online changes to your schema are done in real time are also done in this always evolving model so it's designed to never have to go down yes Cache coherency awesome question I'm not going to talk about that today in this talk I'm going to talk about how we maintain cache coherency I'm going to give you a pointer at the end to our developer pages where we actually talk in great detail about this and again I'm going to be hanging around here today and I'm very happy to talk through it all it's just too much to do in an hour I apologize but that is kind of core to what it is that makes this pure to pure system work for sure it's the right question to ask and it is core to why we can do another really interesting thing we can make one of these databases scale not just in a data center but across geographies right so increasingly geo distribution is something people talk about not just for failure tolerance not just kind of better uptime in case a meteor hits one of your data centers but because people are distributed and you may have an application with users in more than one location and you want to give them low latency access to their data or you may have users who travel or you may have users who are sharing information across geographies you want low latency access you want consistent access and again if you can avoid it you don't want to write your application to make assumptions about how you're deploying your database to solve these problems I'm going to actually dig into this one in a second and it'll get to part of the cache coherency question because that one sounds a little bit magical I know and I don't want to just like say hey geo distribution and then move on so I'm going to kind of justify that we can do that one the fourth one that's kind of interesting is around multi tenancy so people increasingly care about multi tenancy because they care about service deployment models and multi tenancy is really what you're talking about when you're talking about service deployment models this peer to peer architecture lets us run a single database on a machine or hundreds of databases on that machine thousands of databases and what I really mean is individual databases are served by separate peer processes that are physically isolated that are writing to different locations are using different security credentials for their communication are sitting in different processes and so they have OS level isolation and so you get a great deal of flexibility in terms of managing multiple databases you get guarantees about the security model and you get the ability to manage each of those databases independently in terms of the resources they're using so you could throw like a hundred people on one of those databases really needs more capacity you just move it somewhere else and you can do that live you can do that without disrupting the application you can just move it and then go on with your life and the fifth capability is not really a core capability it's again how we think about all of this is simplifying everything even as we're building more complicated systems and I'll give you a demo of that in just a second we're we're tight on time so I'm going to run through a couple pictures but this is the scale up picture we were talking about the ability to add in-memory nodes on demand lets us scale out really effectively this is that continuous availability picture the ability to start new nodes to compensate for the fact that something has failed the ability to run with data durable in multiple locations means that you're always getting ahead of failures you're always able to run through and do upgrades geodistribution and this one's pretty interesting so I said that these in-memory peers form a cache and what I didn't explain is that each of these peers is an on-demand cache so each of the in-memory peer processes that we're running each of the places where transactions run they're forming their cache based on the transactions that are running there if I'm sitting in Boston and Andy's sitting here in Pittsburgh I started to say CMU and then I realized that I was mixing something and something else okay there we go so I'm sitting near MIT and he's sitting near CMU and we have a database that's made up of peer processes running at MIT and CMU and he's doing work he's looking at his account and I'm looking at my account and I'm looking at what those are going to form in these two locations still a single database the rules of transactional consistency are still applied if he and I started contending on the same stuff our database handles it but if the thing that you care about is low latency low latency will work well for a large class of applications and there are these kinds of applications there are ones that are operational that manage session information user information, accounts, logging their social applications that follow the path of the sun so someone in New York wakes up and they start looking at financial data and they go to sleep just as someone in London is waking up and doing the same for those kinds of applications and that's a large number of the applications that people actually want to distribute to multiple physical locations what happens in our database is that naturally disjoint caches form and because these disjoint caches form and because the way we handle transactional consistency and the way we handle cash coherency is based on essentially internal knowledge of where objects are replicated in cash and again I apologize that I don't have the time to go into all those details but essentially part of what we've built is a self-describing name service of all of this data so all these objects in the system everyone knows where these objects are replicated and only the places where an object is replicated care about participating in protocols around right conflict communication information and so what you end up with is you end up with a database that still looks like a single logical database you don't have to have your application developer write special code you don't have to configure anything ahead of time with knowledge of where your data will be accessed but naturally the database does the right thing and you end up typically running at land latencies one really important feature that helps this is something we call our commit protocol and our commit protocol is one of the few tuning points that we expose in our product and what the commit protocol is is it's the thing that lets you say what's the contract between that in-memory tier and that data durability tier that has to be met for me to acknowledge back to a SQL client that commit has succeeded and this is something we let you choose right and you can pick something that looks like K-safety you can say you know send a message to K of my durable peers tell them about this change and I'm done you can do that and then say actually and make sure at least one of them has responded to me to say it's on disk or two of them or however many you can also say I need acknowledgement from one physical region or two regions or n regions so essentially what we do is we let our users decide where do they fall on the spectrum between super low latency and super data safety paranoia and on a database by database basis you can decide kind of where your database falls and as a result you can tune this one parameter and you can get the kinds of safety guarantees and latency trade-offs that you need you said database by database so you can't say I think of one line with database this is the critical stuff this is the question we've actually implemented it on a transaction by transaction basis we then spent several hours in front of a whiteboard coming up with some awesome scenarios where you can cause data to fail or you can cause database to fail you can cause data to get rolled back in unexpected ways you can induce failures that result in certain transactions never having happened or not having happened the way you expect them to have happened and we decided we would start at a database level and so the conversation I actually get that question a lot and my answer to our prospective users is always come to me with a concrete problem where you need it at a more granular level and we'll work with you on how to do that in a safe way and actually I'll throw that out to this audience that's a kind of interesting research question honestly it's equal parts kind of database theory and like human-user interaction stuff and if anyone's interested in that I'd really love to talk with someone about how to kind of noodle over ways to start exposing those kinds of things to a user in a way that's actually safe like that's actually a really interesting subtle question I have a live demo that I'm going to show you so I'm going to basically skip two slides but if you're curious you can go to the Google developer stuff and you can find an awesomely hilarious video of us doing a live benchmark on GCE we started with zero nodes over five minutes we went up to 32 VMs on GCE we hit about two million this is YCSB not a very interesting transactional benchmark but you know sub millisecond latency on transactions as we scaled out and it's awesome because at about five minutes in the wifi and the Google campus goes down and you get to watch us scramble like crazy like pretending everything's fine while they try to fix it it's really entertaining that's me and then on the other on the other end of the spectrum if you're interested in the multi-tenant part we've done a lot of writing on our tech blog we have some videos, we did a project with HP on a piece of hardware they did called Moonshot Moonshot's in its first version essentially 45 atom processors in a box and they came to us and said we want you to demonstrate like the biggest database ever done in small hardware and we said we can't do that because you've got like 20 cell phones in there but what we can do is we can run a whole lot of small databases and show you something about extreme efficiency and this one's actually pretty cool I think it's really much more about resource efficiency and total cost than it is about what you can do with a database but we've written about this in great detail on our tech blog and I highly recommend checking it out I'm going to do a really quick demonstration of something wow I'm not going to do it because the resolution just got bumped to nothing hey technology okay what you're actually seeing you're the first people to actually see this demoed this is actually a bleeding edge developer build of son of a really? it's a good thing someone in the room is paying attention thank you so we've been working on a new release we've got a beta version up on the website today it'll be actually a finished release in probably about a month but we haven't actually started any public views of this so you guys are getting to kind of see a first view this is a web UI that we've been building to help kind of on that simplification and that out of the box experience this is built on top of core APIs so if you don't want to use our web UI we have a CLI and if you don't want to use the CLI we have rest APIs we have python APIs we have java APIs so you can take kind of all our monitoring stuff all of our management stuff and plug it into whatever tools you do like but this is just a simple view of something and this is actually what you're looking at is this is live this is running against in this case virtual machines on Amazon and from a single point I'm able to monitor something about resource utilization across a set of systems explain what we're looking at in a second but what I really want to show you is something we've been building that I think traditionally people don't think about as a database problem but I'm going to argue is increasingly a database problem and that's about how you actually orchestrate and then automate running databases across infrastructure as systems get more and more complicated as you start talking about running across geographies running across different kinds of resources how you solve these problems in a way where the database continues to solve the problems that it's supposed to do so your developers can actually solve application problems we've built into this product something we call templates and a template is essentially it's an SLA it's a way of starting a database based on a problem that you want to solve and then it's a way of automating the process not just monitoring that you're running at spec but automatically reacting to failure or reacting to new resources and keeping your database running at the levels you want and I'm actually you're really going to know this is like a bleeding edge live demo because in about two minutes minute and a half you're going to see a bug and I'm going to actually explain why that bug is interesting and Andy wanted me to make sure that you all know that I'm a powerful CTO and I fired the guy who introduced the bug so so we're going to do this I'm going to start a database I'm going to call it CMU I'm going to pick a DBA username and password if anyone in this room has used like RDS on Amazon and has wondered why you have to walk through five screens and answer about 40 questions to start a database like this is what you wanted just by the way and I'm going to pick the single host template I'm going to say get a database running in one place click submit it's going to why is my no and so now I've got a database running right didn't have to do a lot to make that happen and as I said this is just a single logical database it's got a JDBC connection string or whatever it's just available you can do it's got a cross equal to it it's fully durable it's made up of two of those peer processes we're talking about it's made up of a the transaction engine which is the in-memory piece and the storage manager which is the durable piece and it's great like I think it's pretty cool that you can just do that you can get a database running it's pretty easy I could have done that by invoking a REST call or like use our Python API so you can script this really easily if you just want to like rinse and repeat create lots of databases so this isn't very interesting yet because it's still running it's two processes that's good it's one host that's less good so now what I'm going to do is I'm going to say over time I'm now ready to start taking this database into production and so what I'm going to do is live against this database I'm going to say don't run single host run in what we call a minimally redundant fashion run in a fashion where any host could fail any one of the Amazon VMs could fail and the database is still fully available I still have all my data available in its durable state and I still can get to the database from that in memory point of view got a nice little redraw bug here in the console and so that's what it's done I've just changed that SLA I've changed the template and the system has gone off and reconfigured it and now I consist of four processes running across three hosts right application did not change in fact if there was an application running against this database it kept running it didn't care that the database had reconfigured itself but now I'm solving a different set of operational problems I can handle different load I can handle different capacity and that's pretty cool right and I can keep going like this I can now say you know what actually what I really want to do because what I haven't told you is that what we're doing is we're running on Amazon and I've actually got VMs running in three regions in Amazon I've got VMs running in Virginia in California and Oregon so all around the country and now what I'm going to do is I'm going to go to my database and I'm going to say hey CMU database what I actually want you to be now is I want you to be a geodistributed database and we hit submit and here's where the bug comes in and here's why like layer distributed systems are a great thing because what's going to happen is it's going to go off and it's going to think for a minute and it's going to get the the processes moved around where it claims that it's now made up of 10 processes across 8 hosts which it's not actually and I know it's not because if we actually look you know here's the fabulous JSON description of what it is and we can see it's actually made up of 16 processes so Hooray APIs for that matter I can pull up on the command line here I can also inject at this level and I can say summary and let's see let's find our little database here CMU there it is I realize that font is really small but that's exactly what I wanted it to do it expanded to use all 12 hosts that I had running and so this is a great example of why you build things that are distributed you assume things will fail and so you have insight at multiple layers of a system so kids always do that because you'll want to give a demo and discover that the bleeding edge developer build has a bug but you'll still be able to show what you wanted to show that may not apply to everyone in this room but some day it may and then you'll be really glad that you you heard that so what's the point the point is that I have just in a matter of two minutes taken a database gotten it started on Amazon scaled it from one machine to three machines scaled it from three machines to 12 machines in three regions without disrupting the application without changing anything about the model without changing anything about what the developer had to think it's still a SQL database it's still a SQL database it's still transactionally consistent right and so when we go back now to slides from earlier and we had all of these topics we were talking about right and I realize we're we're even having started five minutes late we're now five minutes over so I'll wrap up but what I was talking earlier about kind of there was this pragmatic movement we had where we were we were solving problems because we could not figure out how to do some fundamental things we could not figure out how to scale on demand a transactionally consistent system we could not make transactions very effectively scale in this kind of cloud environment right but we're interested in how we simplify our deployments we're interested in hybrid models and we're interested in multi model deployments right so where's the future going why do I think this is interesting because what I just demoed I think is pretty interesting it turns out a lot of large companies today they want this they're pretty excited by this but I'm the CTO and what I think about is kind of where this takes us next not what we do today right and that's why I'm here today at CMU because you guys are thinking about what's next not what we do today so what's next when you have building blocks like this they're interesting problems you want to think about one problem that is very much in the forefront of people's minds today is this question of multi model right is how does one of these database databases handle hybrid workloads not just like how do I speak a different front end language not just how do I layer kind of a different storage engine but how do I think holistically like if I have a relational database it's because I know that the queries I'm going to run are relational because my data models represent something about tabular formats where you can kind of connect the dots that's how my queries are going to run that's why I have indexes the way they are and if I wanted to be a graph database what does that mean well it means I need a different front end language probably talking about something like RDF as a data format probably more interested in column than row why am I more interested in column than row well because the model is one that applies itself very well to the kinds of problems that column stores tend to be optimized for vertical partitioning compression in very specific ways understanding something about the parallelization and the distribution of queries in a way that SQL queries do not need to be and often cannot as easily be distributed or parallelized and so it's really not just about like a front end language or a storage model it's really about the fundamental problems you're trying to solve and what your architecture lets you do and let me tell you there are a lot of people out there today who are really interested in how do I from one data set be able to think about both of those kinds of models and wrap a single transactional boundary around it that's a really interesting problem that people care about today another really interesting problem that automation stuff is great but now go to a telco and ask them about why they talk about resiliency and not redundancy and ask them how they actually do management right I was talking to a telco recently that does I can't tell you the number but does a lot of the world's traffic and has a lot of stuff running through their data centers and at the end of the day and by a lot of the world's traffic I mean like they do a lot more than Netflix does in this country they do a lot more than YouTube does in this country and Netflix and YouTube represent 50% of this country's data on the public internet they're able to manage their entire system in two data centers there's a room and in each room there's four monitors how do they do that they do that because everything in their system is driven based on knowledge of policy it's driven based on knowledge of SLAs it's all on autopilot you don't tell anyone anything's going on until there's something really bad going on but you build in up front the ability to express what the problems are you want to solve and what the failure models are right and while traditionally these haven't been database problems like I said earlier increasingly it's on all of us as data people to understand that these are our problems to solve and that the people who can embed this kind of intelligence into data management systems are solving the next generation of problems that people care about and where that gets really interesting is around the topic of data residency so if you've been to Europe recently you've been to a place that looks like it's a single country but it's actually many different countries with many different sets of laws and let's say you wanted to put up a global service that's running across the EU so that EU residents can get at their data wherever they travel around the EU and that would be really cool except it turns out that if one of your users is German their data has to stay on disk in Germany and pretty soon if they're a French user their data has to stay on disk in France and dear god the Europeans don't want their data stored on disks in the US and we're starting to feel maybe kind of the same way about the rest of the world and Brazil is enacting laws and Russia is enacting laws and China is enacting laws and so on the one hand people want databases that are global that span these regions on the other hand if they're disjoint databases how do you understand all the different data flows how do you understand where data was copied how do you audit that two years later when someone comes back to you and says can you prove to me that you really met the rules of this country how do you do it if you have data coming in and out of lots of different databases this is the problem by the way that is on like every enterprise every like Fortune 500 shop that I go into this is the question that everyone says how are you solving this problem and if you want to try to bite off a really interesting next generation data problem please come talk to me because this is something we're starting to put together something that really is kind of a pure research project around I'd love to I'd love to get you involved because this one is this one's pretty exciting dev.newadb.com is where you'll find our discussion forums it's also where you find our documentation if you want to learn how to run the product and details of our sequel and everything else it's also where you find our engineering blog where we actually write about how the core technology works and that's my contact info if you're interested in learning more about the technology if you're interested in coming to join us if you're interested in coming to hang out in Boston and I apologize that I've gone 10 minutes over but hopefully this was interesting so thank you everyone we have time for one very very quick question from anyone yes so how your database processes one for transactions one for the last is that if you gave a database customers really loved it for user sessions so is that a model that you consider? we support that you actually can run without that durability period you have to explicitly opt into it with a command line flag that says I'm going to run in a non durable fashion not something that most of our customers are interested in so we treat that more as a testing and thought exercise thing but if you're curious I can show you how to do it thanks again